Download - R News 4(1)
NewsThe Newsletter of the R Project Volume 41 June 2004
Editorialby Thomas Lumley
R has been well accepted in the academic world forsome time but this issue of R News begins witha description of the use of R in a business settingMarc Schwartz describes how his consulting com-pany MedAnalytics came to use R instead of com-mercial statistical software and has since providedboth code and financial support to the R Project
The growth of the R community was also evidentat the very successful useR 2004 the first conferencefor R users While R aims to blur the distinctions be-tween users and programmers useR was definitelydifferent in aims and audience from the previousDSC conferences held in 1999 2001 and 2003 JohnFox provides a summary of the conference in this is-sue for those who could not make it to Vienna
A free registration to useR was the prize inthe competition for a new graphic for httpwwwr-projectorg You will have seen the winninggraphic from Eric Lecoutre on the way to down-loading this newsletter The graphic is uneditedR output (click on it to see the code) showingthe power of R for creating complex graphs thatare completely reproducible I would also en-courage R users to visit the Journal of Statisti-cal Software (httpwwwjstatsoftorg) whichpublishes peer-reviewed code and documentation Ris heavily represented in recent issues of the journal
The success of R in penetrating so many statisti-
cal niches is due in large part to the wide variety ofpackages produced by the R community We havearticles describing contributed packages for princi-pal component analysis (ade4 used in creating EricLecoutrersquos winning graphic) and statistical processcontrol (qcc) and the recommended survival pack-age Jianhua Zhang and Robert Gentlement describetools from the Bioconductor project that make it eas-ier to explore the resulting variety of code and docu-mentation
The Programmersrsquo Niche and Help Desk columnsboth come from guest contributors this time GaborGrothendieckknown as frequent poster of answerson the mailing list r-help has been invited to con-tribute to the R Help Desk He presents ldquoDate andTime Classes in Rrdquo For the Programmersrsquo Niche Ihave written a simple example of classes and meth-ods using the old and new class systems in R Thisexample arose from a discussion at an R program-ming course
Programmers will also find useful Doug Batesrsquo ar-ticle on least squares calculations Prompted by dis-cussions on the r-devel list he describes why thesimple matrix formulae from so many textbooks arenot the best approach either for speed or for accuracy
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
thomaslumleyR-projectorg
Contents of this issue
Editorial 1The Decision To Use R 2The ade4 package - I One-table methods 5qcc An R package for quality control charting
and statistical process control 11Least Squares Calculations in R 17
Tools for interactively exploring R packages 20The survival package 26useR 2004 28R Help Desk 29Programmersrsquo Niche 33Changes in R 36Changes on CRAN 41R Foundation News 46
Vol 41 June 2004 2
The Decision To Use RA Consulting Business Perspective
by Marc Schwartz
Preface
The use of R has grown markedly during the pastfew years which is a testament to the dedication andcommitment of R Core to the contributions of thegrowing useR base and to the variety of disciplinesin which R is being used The recent useR meeting inVienna was another strong indicator of growth andwas an outstanding success
In addition to its use in academic areas R isincreasingly being used in non-academic environ-ments such as government agencies and variousbusinesses ranging from small firms to large multi-national corporations in a variety of industries
In a sense this diversification in the user baseis a visible demonstration of the greater comfortthat business-based decision makers have regard-ing the use of Open-Source operating systems suchas Linux and applications such as R Whetherthese tools and associated support programs are pur-chased (such as commercial Linux distributions) orare obtained through online collaborative commu-nities businesses are increasingly recognizing thevalue of Open-Source software
In this article I will describe the use of R (andother Open-Source tools) in a healthcare-based con-sulting firm and the value realized by us and impor-tantly by our clients through the use of these tools inour company
Some Background on the Company
MedAnalytics is a healthcare consulting firm locatedin a suburb of Minneapolis Minnesota The com-pany was founded and legally incorporated in Au-gust of 2000 to provide a variety of analytic and re-lated services to two principle constituencies Thefirst are healthcare providers (hospitals and physi-cians) engaged in quality improvement programsThese clients will typically have existing clinical (asopposed to administrative) databases and wish touse these to better understand the interactions be-tween patient characteristics clinical processes andoutcomes The second are drug and medical devicecompanies engaged in late-phase and post-marketclinical studies For these clients MedAnalyticsprovides services related to protocol and case re-port form design and development interim and fi-nal analyses and independent data safety monitor-ing board appointments In these cases MedAnalyt-ics will typically partner with healthcare technology
and service companies to develop web-based elec-tronic data capture study management and report-ing We have worked in many medical specialties in-cluding cardiac surgery cardiology ophthalmologyorthopaedics gastroenterology and oncology
Prior to founding MedAnalytics I had 5 years ofclinical experience in cardiac surgery and cardiologyand over 12 years with a medical database softwarecompany where I was principally responsible for themanagement and analysis of several large multi-sitenational and regional sub-specialty databases thatwere sponsored by professional medical societiesMy duties included contributing to core dataset de-sign the generation of periodic aggregate and com-parative analyses the development of multi-variablerisk models and contributing to peer-reviewed pa-pers In that position the majority of my analyseswere performed using SAS (Base Stat and Graph)though S-PLUS (then from StatSci) was evaluated inthe mid-90rsquos and used for a period of time with somesmaller datasets
Evaluating The Tools of the Trade -A Value Based Equation
The founder of a small business must determine howhe or she is going to finance initial purchases (ofwhich there are many) and the ongoing operatingcosts of the company until such time as revenuesequal (and we hope ultimately exceed) those costsIf financial decisions are not made wisely a companycan find itself rapidly rdquoburning cashrdquo and risking itsviability (It is for this reason that most small busi-nesses fail within the first two or three years)
Because there are market-driven thresholds ofwhat prospective clients are willing to pay for ser-vices the cost of the tools used to provide clientservices must be carefully considered If your costsare too high you will not recover them via clientbillings and (despite attitudes that prevailed dur-ing the dotcom bubble) you canrsquot make up per-itemlosses simply by increasing business volume
An important part of my companyrsquos infrastruc-ture would be the applications software that I usedHowever it would be foolish for me to base this de-cision solely on cost I must consider the rdquovaluerdquowhich takes into account four key characteristicscost time quality and client service Like any busi-ness mine would strive to minimize cost and timewhile maximizing quality and client service How-ever there is a constant tension among these factorsand the key would be to find a reasonable balanceamong them while meeting (and we hope exceed-ing) the needs and expectations of our clients
First and foremost a business must present itself
R News ISSN 1609-3631
Vol 41 June 2004 3
and act in a responsible and professional manner andmeet legal and regulatory obligations There wouldbe no value in the software if my clients and I couldnot trust the results I obtained
In the case of professional tools such as softwarebeing used internally one typically has many alter-natives to consider You must first be sure that thepotential choices are validated against a list of re-quired functional and quality benchmarks driven bythe needs of your prospective clients You must ofcourse be sure that the choices fit within your op-erating budget In the specific case of analytic soft-ware applications the list to consider and the pricepoints were quite varied At the high end of the pricerange are applications like SAS and SPSS whichhave become prohibitively expensive for small busi-nesses that require commercial licenses Despite mypast lengthy experience with SAS it would not havebeen financially practical to consider using it Asingle-user desktop license for the three key compo-nents (Base Stat and Graph) would have cost about(US)$5000 per year well beyond my budget
Over a period of several months through late2000 I evaluated demonstration versions of severalstatistical applications At the time I was using Win-dows XP and thus considered Windows versions ofany applications I narrowed the list to S-PLUS andStata Because of my prior experience with both SASand S-PLUS I was comfortable with command lineapplications as opposed to those with menu-drivenrdquopoint and clickrdquo GUIs Both S-PLUS and Statawould meet my functional requirements regardinganalytic methods and graphics Because both pro-vided for programming within the language I couldreasonably expect to introduce some productivityenhancements and reduce my costs by more efficientuse of my time
rdquoAn Introduction To Rrdquo
About then (late 2000) I became aware of R throughcontacts with other users and through posts to var-ious online forums I had not yet decided on S-PLUS or Stata so I downloaded and installed R togain some experience with it This was circa R ver-sion 12x I began to experiment with R readingthe available online documentation perusing poststo the r-help list and using some of the books onS-PLUS that I had previously purchased most no-tably the first edition (1994) of Venables and RipleyrsquosMASS
Although I was not yet specifically aware of theR Community and the general culture of the GPLand Open-Source application development the basicpremise and philosophy was not foreign to me I wasquite comfortable with online support mechanismshaving used Usenet online rdquoknowledge basesrdquo andother online community forums in the past I had
found that more often that not these provided morecompetent and expedient support than did the tele-phone support from a vendor I also began to re-alize that the key factor in the success of my busi-ness would be superior client service and not expen-sive proprietary technology My experience and myneeds helped me accept the basic notion of Open-Source development and online community-basedsupport
From my earliest exposure to R it was evidentto me that this application had evolved substantiallyin a relatively short period of time (as it has contin-ued to do) under the leadership of some of the bestminds in the business It also became rapidly evi-dent that it would fit my functional requirements andmy comfort level with it increased substantially overa period of months Of course my prior experiencewith S-PLUS certainly helped me to learn R rapidly(although there are important differences between S-PLUS and R as described in the R FAQ)
I therefore postponed any purchase decisions re-garding S-PLUS or Stata and made a commitment tousing R
I began to develop and implement various func-tions that I would need in providing certain analyticand reporting capabilities for clients Over a periodof several weeks I created a package of functionsfor internal use that substantially reduced the timeit takes to import and structure data from certainclients (typically provided as Excel workbooks or asASCII files) and then generate a large number of ta-bles and graphs These types of reports typically usestandardized datasets and are created reasonably fre-quently Moreover I could easily enhance and intro-duce new components to the reports as required bythe changing needs of these clients Over time I in-troduced further efficiencies and was able to take aprocess that initially had taken several days quite lit-erally down to a few hours As a result I was able tospend less time on basic data manipulation and moretime on the review interpretation and description ofkey findings
In this way I could reduce the total time requiredfor a standardized project reducing my costs andalso increasing the value to the client by being able tospend more time on the higher-value services to theclient instead of on internal processes By enhancingthe quality of the product and ultimately passing mycost reductions on to the client I am able to increasesubstantially the value of my services
A Continued Evolution
In my more than three years of using R I have con-tinued to implement other internal process changesMost notably I have moved day-to-day computingoperations from Windows XP to Linux Needless tosay Rrsquos support of multiple operating systems made
R News ISSN 1609-3631
Vol 41 June 2004 4
this transition essentially transparent for the applica-tion software I also started using ESS as my primaryworking environment for R and enjoyed its higherlevel of support for R compared to the editor that Ihad used under Windows which simply providedsyntax highlighting
My transformation to Linux occurred graduallyduring the later Red Hat (RH) 80 beta releases as Ibecame more comfortable with Linux and with theprocess of replacing Windows functionality that hadbecome rdquosecond naturerdquo over my years of MS op-erating system use (dating back to the original IBMPCDOS days of 1981) For example I needed to re-place my practice in R of generating WMF files to beincorporated into Word and Powerpoint documentsAt first I switched to generating EPS files for incorpo-ration into OpenOfficeorgrsquos (OOorg) Writer and Im-press documents but now I primarily use LATEX andthe rdquoseminarrdquo slide creation package frequently inconjunction with Sweave Rather than using Power-point or Impress for presentations I create landscapeoriented PDF files and display them with Adobe Ac-robat Reader which supports a full screen presenta-tion mode
I also use OOorgrsquos Writer for most documentsthat I need to exchange with clients For the finalversion I export a PDF file from Writer knowingthat clients will have access to Adobersquos Reader Be-cause OOorgrsquos Writer supports MS Word input for-mats it provides for interchange with clients whenjointly editable documents (or tracking of changes)are required When clients send me Excel worksheetsI use OOorgrsquos Calc which supports Excel data for-mats (and in fact exports much better CSV files fromthese worksheets than does Excel) I have replacedOutlook with Ximianrsquos Evolution for my e-mail con-tact list and calendar management including the co-ordination of electronic meeting requests with busi-ness partners and clients and of course synching mySony Clie (which Sony recently announced will soonbe deprecated in favor of smarter cell phones)
With respect to Linux I am now using FedoraCore (FC) 2 having moved from RH 80 to RH 9 toFC 1 I have in time replaced all Windows-basedapplications with Linux-based alternatives with oneexception and that is some accounting software that Ineed in order to exchange electronic files with my ac-countant In time I suspect that this final hurdle willbe removed and with it the need to use Windowsat all I should make it clear that this is not a goalfor philosophic reasons but for business reasons Ihave found that the stability and functionality of theLinux environment has enabled me to engage in a va-riety of technical and business process related activi-ties that would either be problematic or prohibitivelyexpensive under Windows This enables me to real-ize additional productivity gains that further reduceoperating costs
Giving Back
In closing I would like to make a few commentsabout the R Community
I have felt fortunate to be a useR and I also havefelt strongly that I should give back to the Commu-nity - recognizing that ultimately through MedAna-lytics and the use of R I am making a living usingsoftware that has been made available through vol-untary efforts and which I did not need to purchase
To that end over the past three years I have com-mitted to contributing and giving back to the Com-munity in several ways First and foremost by re-sponding to queries on the r-help list and later onthe r-devel list There has been and will continueto be a continuous stream of new useRs who needbasic assistance As current userRs progress in expe-rience each of us can contribute to the Communityby assisting new useRs In this way a cycle developsin which the students evolve into teachers
Secondly in keeping with the philosophy thatrdquouseRs become developersrdquo as I have found theneed for particular features and have developed spe-cific solutions I have made them available via CRANor the lists Two functions in particular have beenavailable in the rdquogregmiscrdquo package since mid-2002thanks to Greg Warnes These are the barplot2()and CrossTable() functions As time permits I planto enhance most notably the CrossTable() functionwith a variety of non-parametric correlation mea-sures and will make those available when ready
Thirdly in a different area I have committed tofinancially contributing back to the Community bysupporting the R Foundation and the recent useR2004 meeting This is distinct from volunteering mytime and recognizes that ultimately even a volun-tary Community such as this one has some oper-ational costs to cover When in 2000-2001 I wascontemplating the purchase of S-PLUS and Statathese would have cost approximately (US)$2500 and(US)$1200 respectively In effect I have committedthose dollars to supporting the R Foundation at theBenefactor level over the coming years
I would challenge and urge other commercialuseRs to contribute to the R Foundation at whateverlevel makes sense for your organization
Thanks
It has been my pleasure and privilege to be a mem-ber of this Community over the past three years Ithas been nothing short of phenomenal to experiencethe growth of this Community during that time
I would like to thank Doug Bates for his kind in-vitation to contribute this article the idea for whicharose in a discussion we had during the recent useRmeeting in Vienna I would also like to extend mythanks to R Core and to the R Community at large
R News ISSN 1609-3631
Vol 41 June 2004 5
for their continued drive to create and support thiswonderful vehicle for international collaboration
Marc SchwartzMedAnalytics Inc Minneapolis Minnesota USAMSchwartzMedAnalyticscom
The ade4 package - I One-table methodsby Daniel Chessel Anne B Dufour and Jean Thioulouse
Introduction
This paper is a short summary of the main classesdefined in the ade4 package for one table analysismethods (eg principal component analysis) Otherpapers will detail the classes defined in ade4 for two-tables coupling methods (such as canonical corre-spondence analysis redundancy analysis and co-inertia analysis) for methods dealing with K-tablesanalysis (ie three-ways tables) and for graphicalmethods
This package is a complete rewrite of the ADE4software (Thioulouse et al (1997) httppbiluniv-lyon1frADE-4) for the R environment It containsData Analysis functions to analyse Ecological andEnvironmental data in the framework of EuclideanExploratory methods hence the name ade4 (ie 4 isnot a version number but means that there are four Ein the acronym)
The ade4 package is available in CRAN but itcan also be used directly online thanks to the Rwebsystem (httppbiluniv-lyon1frRweb) Thispossibility is being used to provide multivariateanalysis services in the field of bioinformatics par-ticularly for sequence and genome structure analysisat the PBIL (httppbiluniv-lyon1fr) An ex-ample of these services is the automated analysis ofthe codon usage of a set of DNA sequences by corre-spondence analysis (httppbiluniv-lyon1frmvacoaphp)
The duality diagram class
The basic tool in ade4 is the duality diagram Es-coufier (1987) A duality diagram is simply a list thatcontains a triplet (X Q D)
- X is a table with n rows and p columns consid-ered as p points in Rn (column vectors) or n points inRp (row vectors)
- Q is a p times p diagonal matrix containing theweights of the p columns of X and used as a scalarproduct in Rp (Q is stored under the form of a vectorof length p)
- D is a n times n diagonal matrix containing theweights of the n rows of X and used as a scalar prod-uct in Rn (D is stored under the form of a vector oflength n)
For example if X is a table containing normalizedquantitative variables if Q is the identity matrix Ip
and if D is equal to 1n In the triplet corresponds to a
principal component analysis on correlation matrix(normed PCA) Each basic method corresponds toa particular triplet (see table 1) but more complexmethods can also be represented by their duality di-agram
Functions Analyses Notesdudipca principal component 1dudicoa correspondence 2dudiacm multiple correspondence 3dudifca fuzzy correspondence 4dudimix analysis of a mixture of
numeric and factors5
dudinsc non symmetric corre-spondence
6
dudidec decentered correspon-dence
7
The dudi functions 1 Principal component analysissame as prcompprincomp 2 Correspondence analysisGreenacre (1984) 3 Multiple correspondence analysisTenenhaus and Young (1985) 4 Fuzzy correspondenceanalysis Chevenet et al (1994) 5 Analysis of a mixtureof numeric variables and factors Hill and Smith (1976)Kiers (1994) 6 Non symmetric correspondence analysisKroonenberg and Lombardo (1999) 7 Decentered corre-spondence analysis Doleacutedec et al (1995)
The singular value decomposition of a tripletgives principal axes principal components and rowand column coordinates which are added to thetriplet for later use
We can use for example a well-known datasetfrom the base package
gt data(USArrests)
gt pca1 lt- dudipca(USArrestsscannf=FALSEnf=3)
scannf = FALSE means that the number of prin-cipal components that will be used to compute rowand column coordinates should not be asked interac-tively to the user but taken as the value of argumentnf (by default nf = 2) Other parameters allow tochoose between centered normed or raw PCA (de-fault is centered and normed) and to set arbitraryrow and column weights The pca1 object is a du-ality diagram ie a list made of several vectors anddataframes
R News ISSN 1609-3631
Vol 41 June 2004 6
gt pca1
Duality diagramm
class pca dudi
$call dudipca(df = USArrests scannf = FALSE nf=3)
$nf 3 axis-components saved
$rank 4
eigen values 248 09898 03566 01734
vector length mode content
1 $cw 4 numeric column weights
2 $lw 50 numeric row weights
3 $eig 4 numeric eigen values
dataframe nrow ncol content
1 $tab 50 4 modified array
2 $li 50 3 row coordinates
3 $l1 50 3 row normed scores
4 $co 4 3 column coordinates
5 $c1 4 3 column normed scores
other elements cent norm
pca1$lw and pca1$cw are the row and columnweights that define the duality diagram togetherwith the data table (pca1$tab) pca1$eig containsthe eigenvalues The row and column coordinatesare stored in pca1$li and pca1$co The varianceof these coordinates is equal to the correspondingeigenvalue and unit variance coordinates are storedin pca1$l1 and pca1$c1 (this is usefull to draw bi-plots)
The general optimization theorems of data anal-ysis take particular meanings for each type of anal-ysis and graphical functions are proposed to drawthe canonical graphs ie the graphical expression cor-responding to the mathematical property of the ob-ject For example the normed PCA of a quantita-tive variable table gives a score that maximizes thesum of squared correlations with variables The PCAcanonical graph is therefore a graph showing howthe sum of squared correlations is maximized for thevariables of the data set On the USArrests examplewe obtain the following graphs
gt score(pca1)
gt scorcircle(pca1$co)
minus2 minus1 0 1 2
510
15
Murder
minus2 minus1 0 1 2
5010
015
020
025
030
0
Assault
minus2 minus1 0 1 2
3040
5060
7080
90 UrbanPop
minus2 minus1 0 1 2
1020
3040
Rape
Figure 1 One dimensional canonical graph for a normedPCA Variables are displayed as a function of row scoresto get a picture of the maximization of the sum of squaredcorrelations
Murder
Assault
UrbanPop
Rape
Figure 2 Two dimensional canonical graph for a normedPCA (correlation circle) the direction and length of ar-rows show the quality of the correlation between variablesand between variables and principal components
The scatter function draws the biplot of the PCA(ie a graph with both rows and columns superim-posed)
gt scatter(pca1)
R News ISSN 1609-3631
Vol 41 June 2004 7
d = 1
Alabama Alaska
Arizona
Arkansas
California
Colorado Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa Kansas
Kentucky Louisiana
Maine Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon Pennsylvania
Rhode Island
South Carolina
South Dakota Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Murder
Assault
UrbanPop
Rape
Figure 3 The PCA biplot Variables are symbolized byarrows and they are superimposed to the individuals dis-play The scale of the graph is given by a grid which sizeis given in the upper right corner Here the length of theside of grid squares is equal to one The eigenvalues barchart is drawn in the upper left corner with the two blackbars corresponding to the two axes used to draw the bi-plot Grey bars correspond to axes that were kept in theanalysis but not used to draw the graph
Separate factor maps can be drawn with thescorcircle (see figure 2) and slabel functions
gt slabel(pca1$li)
d = 1
Alabama Alaska
Arizona
Arkansas
California
Colorado Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa Kansas
Kentucky Louisiana
Maine Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma Oregon Pennsylvania
Rhode Island
South Carolina
South Dakota Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Figure 4 Individuals factor map in a normed PCA
Distance matrices
A duality diagram can also come from a distancematrix if this matrix is Euclidean (ie if the dis-tances in the matrix are the distances between somepoints in a Euclidean space) The ade4 packagecontains functions to compute dissimilarity matrices(distbinary for binary data and distprop for fre-quency data) test whether they are Euclidean Gowerand Legendre (1986) and make them Euclidean(quasieuclid lingoes Lingoes (1971) cailliezCailliez (1983)) These functions are useful to ecol-ogists who use the works of Legendre and Anderson(1999) and Legendre and Legendre (1998)
The Yanomama data set (Manly (1991)) con-tains three distance matrices between 19 villages ofYanomama Indians The dudipco function can beused to compute a principal coordinates analysis(PCO Gower (1966)) that gives a Euclidean repre-sentation of the 19 villages This Euclidean represen-tation allows to compare the geographical geneticand anthropometric distances
gt data(yanomama)
gt gen lt- quasieuclid(asdist(yanomama$gen))
gt geo lt- quasieuclid(asdist(yanomama$geo))
gt ant lt- quasieuclid(asdist(yanomama$ant))
gt geo1 lt- dudipco(geo scann = FALSE nf = 3)
gt gen1 lt- dudipco(gen scann = FALSE nf = 3)
gt ant1 lt- dudipco(ant scann = FALSE nf = 3)
gt par(mfrow=c(22))
gt scatter(geo1)
gt scatter(gen1)
gt scatter(ant1posi=bottom)
d = 100
1 2
3
4 5 6 7
8 9
10
11 12 13
14
15
16 17 18
19
d = 20
1
2
3
4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
19
d = 100
1
2 3
4
5
6
7
8
9 10
11
12 13
14 15
16
17 18
19
Figure 5 Comparison of the PCO analysis of the threedistance matrices of the Yanomama data set geographicgenetic and anthropometric distances
R News ISSN 1609-3631
Vol 41 June 2004 8
Taking into account groups of indi-viduals
In sites x species tables rows correspond to sitescolumns correspond to species and the values arethe number of individuals of species j found at sitei These tables can have many columns and can-not be used in a discriminant analysis In this casebetween-class analyses (between function) are a bet-ter alternative and they can be used with any dualitydiagram The between-class analysis of triplet (X QD) for a given factor f is the analysis of the triplet (GQ Dw) where G is the table of the means of table Xfor the groups defined by f and Dw is the diagonalmatrix of group weights For example a between-class correspondence analysis (BCA) is very simplyobtained after a correspondence analysis (CA)
gt data(meaudret)
gt coa1lt-dudicoa(meaudret$fau scannf = FALSE)
gt bet1lt-between(coa1meaudret$plan$sta
+ scannf=FALSE)
gt plot(bet1)
The meaudret$fau dataframe is an ecological ta-ble with 24 rows corresponding to six sampling sitesalong a small French stream (the Meaudret) Thesesix sampling sites were sampled four times (springsummer winter and autumn) hence the 24 rowsThe 13 columns correspond to 13 ephemeroteraspecies The CA of this data table is done with thedudicoa function giving the coa1 duality diagramThe corresponding bewteen-class analysis is donewith the between function considering the sites asclasses (meaudret$plan$sta is a factor defining theclasses) Therefore this is a between-sites analysiswhich aim is to discriminate the sites given the dis-tribution of ephemeroptera species This gives thebet1 duality diagram and Figure 6 shows the graphobtained by plotting this object
d = 1
Canonical weights
d = 1
Eda
Bsp
Brh
Bni
Bpu
Cen
Ecd
Rhi Hla
Hab
Par
Cae
Eig
Canonical weights
Variables
Eda
Bsp
Brh
Bni Bpu
Cen
Ecd
Rhi Hla
Hab
Par
Cae
Eig
Variables
00
000
501
001
5
Eigenvalues
d = 05
Scores and classes
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
S1
S2
S3
S4
S5
Axis1 Axis2
Inertia axes
d = 02
Classes
S1
S2
S3
S4
S5
Figure 6 BCA plot This is a composed plot made of 1- the species canonical weights (top left) 2- the speciesscores (middle left) 3- the eigenvalues bar chart (bottomleft) 4- the plot of plain CA axes projected into BCA(bottom center) 5- the gravity centers of classes (bottomright) 6- the projection of the rows with ellipses and grav-ity center of classes (main graph)
Like between-class analyses linear discriminantanalysis (discrimin function) can be extended to anyduality diagram (G (XtDX)minus Dw) where (XtDX)minus
is a generalized inverse This gives for example acorrespondence discriminant analysis (Perriegravere et al(1996) Perriegravere and Thioulouse (2002)) that can becomputed with the discrimincoa function
Opposite to between-class analyses are within-class analyses corresponding to diagrams (X minusYDminus1
w YtDX Q D) (within functions) These anal-yses extend to any type of variables the Multi-ple Group Principal Component Analysis (MGPCAThorpe (1983a) Thorpe (1983b) Thorpe and Leamy(1983) Furthermore the withincoa function intro-duces the double within-class correspondence anal-ysis (also named internal correspondence analysisCazes et al (1988))
Permutation tests
Permutation tests (also called Monte-Carlo tests orrandomization tests) can be used to assess the sta-tistical significance of between-class analyses Manypermutation tests are available in the ade4 packagefor example mantelrandtest procusterandtestrandtestbetween randtestcoinertia RVrtestrandtestdiscrimin and several of these testsare available both in R (mantelrtest) and in C(mantelrandtest) programming langage The R
R News ISSN 1609-3631
Vol 41 June 2004 9
version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons
The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values
gt test1lt-randtestbetween(bet1)
gt test1
Monte-Carlo test
Observation 04292
Call randtestbetween(xtest = bet1)
Based on 999 replicates
Simulated p-value 0001
gt plot(test1main=Between class inertia)
Between class inertia
sim
Fre
quency
01 02 03 04 05
0100
200
300
Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram
Conclusion
We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)
We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))
The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that
try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space
Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development
Bibliography
Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107
Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8
Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5
Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5
Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9
Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5
Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7
Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7
Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5
Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5
R News ISSN 1609-3631
Vol 41 June 2004 10
Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125
Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5
Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9
Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7
Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9
Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7
Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7
Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399
Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate
method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8
Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8
Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5
Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5
Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9
Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8
Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8
Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8
R News ISSN 1609-3631
Vol 41 June 2004 11
qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca
Introduction
The qcc package for the R statistical environment al-lows to
bull plot Shewhart quality control charts for contin-uous attribute and count data
bull plot Cusum and EWMA charts for continuousdata
bull draw operating characteristic curves
bull perform process capability analyses
bull draw Pareto charts and cause-and-effect dia-grams
I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate
The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy
Creating a qcc object
A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)
Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed
Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess
gt data(pistonrings)
gt attach(pistonrings)
gt dim(pistonrings)
[1] 200 3
gt pistonrings
diameter sample trial
1 74030 1 TRUE
2 74002 1 TRUE
3 74019 1 TRUE
4 73992 1 TRUE
5 74008 1 TRUE
6 73995 2 TRUE
7 73992 2 TRUE
199 74000 40 FALSE
200 74020 40 FALSE
gt diameter lt- qccgroups(diameter sample)
gt dim(diameter)
[1] 40 5
gt diameter
[1] [2] [3] [4] [5]
1 74030 74002 74019 73992 74008
2 73995 73992 74001 74011 74004
40 74010 74005 74029 74000 74020
Using the first 25 samples as training data an X chartcan be obtained as follows
gt obj lt- qcc(diameter[125] type=xbar)
By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)
gt summary(obj)
Call
qcc(data = diameter[125 ] type = xbar)
xbar chart for diameter[125 ]
Summary of group statistics
Min 1st Qu Median Mean 3rd Qu Max
7399 7400 7400 7400 7400 7401
Group sample size 5
Number of groups 25
Center of group statistics 7400118
Standard deviation 0009887547
Control limits
LCL UCL
7398791 7401444
R News ISSN 1609-3631
Vol 41 June 2004 12
type Control charts for variablesxbar X chart Sample means are plotted to
control the mean level of a con-tinuous process variable
xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable
R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable
S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able
type Control charts for attributesp p chart The proportion of nonconform-
ing units is plotted Controllimits are based on the binomialdistribution
np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution
c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion
u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units
Table 1 Shewhart control charts available in the qcc package
R News ISSN 1609-3631
Vol 41 June 2004 13
The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned
The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument
Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve
Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively
xbar Chartfor diameter[125 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9073
995
740
0074
005
740
1074
015
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24
LCL
UCL
Number of groups = 25Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 0Number violating runs = 0
Figure 1 A X chart for samples data
Shewhart control charts
A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows
gt plot(obj)
giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the
sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)
Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample
gt obj lt- qcc(diameter[125] type=xbar
+ newdata=diameter[2640])
plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples
xbar Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9074
000
740
1074
020
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
LCL
UCL
Calibration data in diameter[125 ] New data in diameter[2640 ]
Number of groups = 40Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 3Number violating runs = 1
Figure 2 A X chart with calibration data and newdata samples
If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method
gt plot(obj chartall=FALSE)
Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))
As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points
Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows
gt data(orangejuice)
gt attach(orangejuice)
R News ISSN 1609-3631
Vol 41 June 2004 14
gt orangejuice
sample D size trial
1 1 12 50 TRUE
2 2 15 50 TRUE
3 3 8 50 TRUE
53 53 3 50 FALSE
54 54 5 50 FALSE
gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)
where we use the logical indicator variable trial toselect the first 30 samples as calibration data
Operating characteristic function
An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo
The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo
gt par(mfrow=c(12))
gt occurves(obj)
gt occurves(obj2)
The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot
0 1 2 3 4 5
00
02
04
06
08
10
OC curves for xbar chart
Process shift (stddev)
Pro
b ty
pe II
err
or
n = 5n = 1n = 10n = 15n = 20
00 02 04 06 08 10
00
02
04
06
08
10
OC curves for p chart
p
Pro
b ty
pe II
err
or
Figure 3 Operating characteristics curves
Cusum charts
Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process
The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code
gt cusum(obj)
and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1
Cusum Chartfor diameter[125 ] and diameter[2640 ]
Group
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
minus5
minus3
minus1
13
57
911
1315
17
Cum
ulat
ive
Sum
Abo
ve T
arge
tB
elow
Tar
get
LDB
UDB
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4
Figure 4 Cusum chart
EWMA charts
Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process
The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows
gt ewma(obj)
and the resulting graph is shown in Figure 5
R News ISSN 1609-3631
Vol 41 June 2004 15
EWMA Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up S
umm
ary
Sta
tistic
s
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
739
9073
995
740
0074
005
740
1074
015
740
20
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Smoothing parameter = 02Control limits at 3sigma
Figure 5 EWMA chart
Process capability analysis
Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use
gt processcapability(obj
+ speclimits=c(73957405))
Process Capability Analysis
Call
processcapability(object = obj
speclimits = c(7395 7405))
Number of obs = 125 Target = 74
Center = 7400118 LSL = 7395
StdDev = 0009887547 USL = 7405
Capability indices
Value 25 975
Cp 1686 1476 1895
Cp_l 1725 1539 1912
Cp_u 1646 1467 1825
Cp_k 1646 1433 1859
Cpm 1674 1465 1882
ExpltLSL 0 ObsltLSL 0
ExpgtUSL 0 ObsgtUSL 0
The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if
available otherwise the target is set at the middlevalue between the specification limits
The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument
The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)
Process Capability Analysisfor diameter[125 ]
7394 7396 7398 7400 7402 7404 7406
LSL USLTarget
Number of obs = 125Center = 7400118StdDev = 0009887547
Target = 74LSL = 7395USL = 7405
Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167
ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0
Figure 6 Process capability analysis
A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice
bull a X chart
bull a R or S chart (if sample sizes gt 10)
bull a run chart
bull a histogram with specification limits
bull a normal QndashQ plot
bull a capability plot
As an example we may input the following code
gt processcapabilitysixpack(obj
+ speclimits=c(73957405)
+ target= 7401)
Pareto charts and cause-and-effectdiagrams
A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system
R News ISSN 1609-3631
Vol 41 June 2004 16
and thus screen out the less significant factors in ananalysis
A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example
gt defect lt- c(80 27 66 94 33)
gt names(defect) lt-
+ c(price code schedule date
+ supplier code contact num
+ part num)
gt paretochart(defect)
The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))
contact num
price code
supplier code
part num
schedule date
Pareto
Ch
art for d
efect
Frequency
0 50 100 150 200 250 300
0 25 50 75 100
Cumulative Percentage
Figure 7 Pareto chart
A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it
A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8
gt causeandeffect(
+ cause=list(Measurements=c(Microscopes
+ Inspectors)
+ Materials=c(Alloys
+ Suppliers)
+ Personnel=c(Supervisors
+ Operators)
+ Environment=c(Condensation
+ Moisture)
+ Methods=c(Brake Engager
+ Angle)
+ Machines=c(Speed Bits
+ Sockets))
+ effect=Surface Flaws)
CauseminusandminusEffect diagram
Surface Flaws
Measurements
Environment
Materials
Methods
Personnel
Machines
Microscopes
Inspectors
Alloys
Suppliers
Supervisors
Operators
Condensation
Moisture
Brake
Engager
Angle
Speed
Bits
Sockets
Figure 8 Cause-and-effect diagram
Tuning and extending the package
Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)
Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be
R News ISSN 1609-3631
Vol 41 July 8 2004 17
called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart
statspstd lt- function(data sizes)
data lt- asvector(data)
sizes lt- asvector(sizes)
pbar lt- sum(data)sum(sizes)
z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)
list(statistics = z center = 0)
sdpstd lt- function(data sizes) return(1)
limitspstd lt- function(center stddev sizes conf)
if (conf gt= 1) lcl lt- -conf
ucl lt- +conf
else
if (conf gt 0 amp conf lt 1)
nsigmas lt- qnorm(1 - (1 - conf)2)
lcl lt- -nsigmas
ucl lt- +nsigmas
else stop(invalid conf argument)
limits lt- matrix(c(lcl ucl) ncol = 2)
rownames(limits) lt- rep( length = nrow(limits))
colnames(limits) lt- c(LCL UCL)
return(limits)
Then we may source the above code and obtainthe control charts in Figure 9 as follows
set unequal sample sizes
gt n lt- c(rep(505) rep(1005) rep(25 5))
generate randomly the number of successes
gt x lt- rbinom(length(n) n 02)
gt par(mfrow=c(12))
plot the control chart with variable limits
gt qcc(x type=p size=n)
plot the standardized control chart
gt qcc(x type=pstd size=n)
p Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
00
01
02
03
04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 01931429StdDev = 03947641
LCL is variableUCL is variable
Number beyond limits = 0Number violating runs = 0
pstd Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
minus3
minus2
minus1
01
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 0StdDev = 1
LCL = minus3UCL = 3
Number beyond limits = 0Number violating runs = 0
Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)
Summary
In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram
References
Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall
Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit
Least Squares Calculations in RTiming different approaches
by Douglas Bates
Introduction
The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -
how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods
Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses
In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-
R News ISSN 1609-3631
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 2
The Decision To Use RA Consulting Business Perspective
by Marc Schwartz
Preface
The use of R has grown markedly during the pastfew years which is a testament to the dedication andcommitment of R Core to the contributions of thegrowing useR base and to the variety of disciplinesin which R is being used The recent useR meeting inVienna was another strong indicator of growth andwas an outstanding success
In addition to its use in academic areas R isincreasingly being used in non-academic environ-ments such as government agencies and variousbusinesses ranging from small firms to large multi-national corporations in a variety of industries
In a sense this diversification in the user baseis a visible demonstration of the greater comfortthat business-based decision makers have regard-ing the use of Open-Source operating systems suchas Linux and applications such as R Whetherthese tools and associated support programs are pur-chased (such as commercial Linux distributions) orare obtained through online collaborative commu-nities businesses are increasingly recognizing thevalue of Open-Source software
In this article I will describe the use of R (andother Open-Source tools) in a healthcare-based con-sulting firm and the value realized by us and impor-tantly by our clients through the use of these tools inour company
Some Background on the Company
MedAnalytics is a healthcare consulting firm locatedin a suburb of Minneapolis Minnesota The com-pany was founded and legally incorporated in Au-gust of 2000 to provide a variety of analytic and re-lated services to two principle constituencies Thefirst are healthcare providers (hospitals and physi-cians) engaged in quality improvement programsThese clients will typically have existing clinical (asopposed to administrative) databases and wish touse these to better understand the interactions be-tween patient characteristics clinical processes andoutcomes The second are drug and medical devicecompanies engaged in late-phase and post-marketclinical studies For these clients MedAnalyticsprovides services related to protocol and case re-port form design and development interim and fi-nal analyses and independent data safety monitor-ing board appointments In these cases MedAnalyt-ics will typically partner with healthcare technology
and service companies to develop web-based elec-tronic data capture study management and report-ing We have worked in many medical specialties in-cluding cardiac surgery cardiology ophthalmologyorthopaedics gastroenterology and oncology
Prior to founding MedAnalytics I had 5 years ofclinical experience in cardiac surgery and cardiologyand over 12 years with a medical database softwarecompany where I was principally responsible for themanagement and analysis of several large multi-sitenational and regional sub-specialty databases thatwere sponsored by professional medical societiesMy duties included contributing to core dataset de-sign the generation of periodic aggregate and com-parative analyses the development of multi-variablerisk models and contributing to peer-reviewed pa-pers In that position the majority of my analyseswere performed using SAS (Base Stat and Graph)though S-PLUS (then from StatSci) was evaluated inthe mid-90rsquos and used for a period of time with somesmaller datasets
Evaluating The Tools of the Trade -A Value Based Equation
The founder of a small business must determine howhe or she is going to finance initial purchases (ofwhich there are many) and the ongoing operatingcosts of the company until such time as revenuesequal (and we hope ultimately exceed) those costsIf financial decisions are not made wisely a companycan find itself rapidly rdquoburning cashrdquo and risking itsviability (It is for this reason that most small busi-nesses fail within the first two or three years)
Because there are market-driven thresholds ofwhat prospective clients are willing to pay for ser-vices the cost of the tools used to provide clientservices must be carefully considered If your costsare too high you will not recover them via clientbillings and (despite attitudes that prevailed dur-ing the dotcom bubble) you canrsquot make up per-itemlosses simply by increasing business volume
An important part of my companyrsquos infrastruc-ture would be the applications software that I usedHowever it would be foolish for me to base this de-cision solely on cost I must consider the rdquovaluerdquowhich takes into account four key characteristicscost time quality and client service Like any busi-ness mine would strive to minimize cost and timewhile maximizing quality and client service How-ever there is a constant tension among these factorsand the key would be to find a reasonable balanceamong them while meeting (and we hope exceed-ing) the needs and expectations of our clients
First and foremost a business must present itself
R News ISSN 1609-3631
Vol 41 June 2004 3
and act in a responsible and professional manner andmeet legal and regulatory obligations There wouldbe no value in the software if my clients and I couldnot trust the results I obtained
In the case of professional tools such as softwarebeing used internally one typically has many alter-natives to consider You must first be sure that thepotential choices are validated against a list of re-quired functional and quality benchmarks driven bythe needs of your prospective clients You must ofcourse be sure that the choices fit within your op-erating budget In the specific case of analytic soft-ware applications the list to consider and the pricepoints were quite varied At the high end of the pricerange are applications like SAS and SPSS whichhave become prohibitively expensive for small busi-nesses that require commercial licenses Despite mypast lengthy experience with SAS it would not havebeen financially practical to consider using it Asingle-user desktop license for the three key compo-nents (Base Stat and Graph) would have cost about(US)$5000 per year well beyond my budget
Over a period of several months through late2000 I evaluated demonstration versions of severalstatistical applications At the time I was using Win-dows XP and thus considered Windows versions ofany applications I narrowed the list to S-PLUS andStata Because of my prior experience with both SASand S-PLUS I was comfortable with command lineapplications as opposed to those with menu-drivenrdquopoint and clickrdquo GUIs Both S-PLUS and Statawould meet my functional requirements regardinganalytic methods and graphics Because both pro-vided for programming within the language I couldreasonably expect to introduce some productivityenhancements and reduce my costs by more efficientuse of my time
rdquoAn Introduction To Rrdquo
About then (late 2000) I became aware of R throughcontacts with other users and through posts to var-ious online forums I had not yet decided on S-PLUS or Stata so I downloaded and installed R togain some experience with it This was circa R ver-sion 12x I began to experiment with R readingthe available online documentation perusing poststo the r-help list and using some of the books onS-PLUS that I had previously purchased most no-tably the first edition (1994) of Venables and RipleyrsquosMASS
Although I was not yet specifically aware of theR Community and the general culture of the GPLand Open-Source application development the basicpremise and philosophy was not foreign to me I wasquite comfortable with online support mechanismshaving used Usenet online rdquoknowledge basesrdquo andother online community forums in the past I had
found that more often that not these provided morecompetent and expedient support than did the tele-phone support from a vendor I also began to re-alize that the key factor in the success of my busi-ness would be superior client service and not expen-sive proprietary technology My experience and myneeds helped me accept the basic notion of Open-Source development and online community-basedsupport
From my earliest exposure to R it was evidentto me that this application had evolved substantiallyin a relatively short period of time (as it has contin-ued to do) under the leadership of some of the bestminds in the business It also became rapidly evi-dent that it would fit my functional requirements andmy comfort level with it increased substantially overa period of months Of course my prior experiencewith S-PLUS certainly helped me to learn R rapidly(although there are important differences between S-PLUS and R as described in the R FAQ)
I therefore postponed any purchase decisions re-garding S-PLUS or Stata and made a commitment tousing R
I began to develop and implement various func-tions that I would need in providing certain analyticand reporting capabilities for clients Over a periodof several weeks I created a package of functionsfor internal use that substantially reduced the timeit takes to import and structure data from certainclients (typically provided as Excel workbooks or asASCII files) and then generate a large number of ta-bles and graphs These types of reports typically usestandardized datasets and are created reasonably fre-quently Moreover I could easily enhance and intro-duce new components to the reports as required bythe changing needs of these clients Over time I in-troduced further efficiencies and was able to take aprocess that initially had taken several days quite lit-erally down to a few hours As a result I was able tospend less time on basic data manipulation and moretime on the review interpretation and description ofkey findings
In this way I could reduce the total time requiredfor a standardized project reducing my costs andalso increasing the value to the client by being able tospend more time on the higher-value services to theclient instead of on internal processes By enhancingthe quality of the product and ultimately passing mycost reductions on to the client I am able to increasesubstantially the value of my services
A Continued Evolution
In my more than three years of using R I have con-tinued to implement other internal process changesMost notably I have moved day-to-day computingoperations from Windows XP to Linux Needless tosay Rrsquos support of multiple operating systems made
R News ISSN 1609-3631
Vol 41 June 2004 4
this transition essentially transparent for the applica-tion software I also started using ESS as my primaryworking environment for R and enjoyed its higherlevel of support for R compared to the editor that Ihad used under Windows which simply providedsyntax highlighting
My transformation to Linux occurred graduallyduring the later Red Hat (RH) 80 beta releases as Ibecame more comfortable with Linux and with theprocess of replacing Windows functionality that hadbecome rdquosecond naturerdquo over my years of MS op-erating system use (dating back to the original IBMPCDOS days of 1981) For example I needed to re-place my practice in R of generating WMF files to beincorporated into Word and Powerpoint documentsAt first I switched to generating EPS files for incorpo-ration into OpenOfficeorgrsquos (OOorg) Writer and Im-press documents but now I primarily use LATEX andthe rdquoseminarrdquo slide creation package frequently inconjunction with Sweave Rather than using Power-point or Impress for presentations I create landscapeoriented PDF files and display them with Adobe Ac-robat Reader which supports a full screen presenta-tion mode
I also use OOorgrsquos Writer for most documentsthat I need to exchange with clients For the finalversion I export a PDF file from Writer knowingthat clients will have access to Adobersquos Reader Be-cause OOorgrsquos Writer supports MS Word input for-mats it provides for interchange with clients whenjointly editable documents (or tracking of changes)are required When clients send me Excel worksheetsI use OOorgrsquos Calc which supports Excel data for-mats (and in fact exports much better CSV files fromthese worksheets than does Excel) I have replacedOutlook with Ximianrsquos Evolution for my e-mail con-tact list and calendar management including the co-ordination of electronic meeting requests with busi-ness partners and clients and of course synching mySony Clie (which Sony recently announced will soonbe deprecated in favor of smarter cell phones)
With respect to Linux I am now using FedoraCore (FC) 2 having moved from RH 80 to RH 9 toFC 1 I have in time replaced all Windows-basedapplications with Linux-based alternatives with oneexception and that is some accounting software that Ineed in order to exchange electronic files with my ac-countant In time I suspect that this final hurdle willbe removed and with it the need to use Windowsat all I should make it clear that this is not a goalfor philosophic reasons but for business reasons Ihave found that the stability and functionality of theLinux environment has enabled me to engage in a va-riety of technical and business process related activi-ties that would either be problematic or prohibitivelyexpensive under Windows This enables me to real-ize additional productivity gains that further reduceoperating costs
Giving Back
In closing I would like to make a few commentsabout the R Community
I have felt fortunate to be a useR and I also havefelt strongly that I should give back to the Commu-nity - recognizing that ultimately through MedAna-lytics and the use of R I am making a living usingsoftware that has been made available through vol-untary efforts and which I did not need to purchase
To that end over the past three years I have com-mitted to contributing and giving back to the Com-munity in several ways First and foremost by re-sponding to queries on the r-help list and later onthe r-devel list There has been and will continueto be a continuous stream of new useRs who needbasic assistance As current userRs progress in expe-rience each of us can contribute to the Communityby assisting new useRs In this way a cycle developsin which the students evolve into teachers
Secondly in keeping with the philosophy thatrdquouseRs become developersrdquo as I have found theneed for particular features and have developed spe-cific solutions I have made them available via CRANor the lists Two functions in particular have beenavailable in the rdquogregmiscrdquo package since mid-2002thanks to Greg Warnes These are the barplot2()and CrossTable() functions As time permits I planto enhance most notably the CrossTable() functionwith a variety of non-parametric correlation mea-sures and will make those available when ready
Thirdly in a different area I have committed tofinancially contributing back to the Community bysupporting the R Foundation and the recent useR2004 meeting This is distinct from volunteering mytime and recognizes that ultimately even a volun-tary Community such as this one has some oper-ational costs to cover When in 2000-2001 I wascontemplating the purchase of S-PLUS and Statathese would have cost approximately (US)$2500 and(US)$1200 respectively In effect I have committedthose dollars to supporting the R Foundation at theBenefactor level over the coming years
I would challenge and urge other commercialuseRs to contribute to the R Foundation at whateverlevel makes sense for your organization
Thanks
It has been my pleasure and privilege to be a mem-ber of this Community over the past three years Ithas been nothing short of phenomenal to experiencethe growth of this Community during that time
I would like to thank Doug Bates for his kind in-vitation to contribute this article the idea for whicharose in a discussion we had during the recent useRmeeting in Vienna I would also like to extend mythanks to R Core and to the R Community at large
R News ISSN 1609-3631
Vol 41 June 2004 5
for their continued drive to create and support thiswonderful vehicle for international collaboration
Marc SchwartzMedAnalytics Inc Minneapolis Minnesota USAMSchwartzMedAnalyticscom
The ade4 package - I One-table methodsby Daniel Chessel Anne B Dufour and Jean Thioulouse
Introduction
This paper is a short summary of the main classesdefined in the ade4 package for one table analysismethods (eg principal component analysis) Otherpapers will detail the classes defined in ade4 for two-tables coupling methods (such as canonical corre-spondence analysis redundancy analysis and co-inertia analysis) for methods dealing with K-tablesanalysis (ie three-ways tables) and for graphicalmethods
This package is a complete rewrite of the ADE4software (Thioulouse et al (1997) httppbiluniv-lyon1frADE-4) for the R environment It containsData Analysis functions to analyse Ecological andEnvironmental data in the framework of EuclideanExploratory methods hence the name ade4 (ie 4 isnot a version number but means that there are four Ein the acronym)
The ade4 package is available in CRAN but itcan also be used directly online thanks to the Rwebsystem (httppbiluniv-lyon1frRweb) Thispossibility is being used to provide multivariateanalysis services in the field of bioinformatics par-ticularly for sequence and genome structure analysisat the PBIL (httppbiluniv-lyon1fr) An ex-ample of these services is the automated analysis ofthe codon usage of a set of DNA sequences by corre-spondence analysis (httppbiluniv-lyon1frmvacoaphp)
The duality diagram class
The basic tool in ade4 is the duality diagram Es-coufier (1987) A duality diagram is simply a list thatcontains a triplet (X Q D)
- X is a table with n rows and p columns consid-ered as p points in Rn (column vectors) or n points inRp (row vectors)
- Q is a p times p diagonal matrix containing theweights of the p columns of X and used as a scalarproduct in Rp (Q is stored under the form of a vectorof length p)
- D is a n times n diagonal matrix containing theweights of the n rows of X and used as a scalar prod-uct in Rn (D is stored under the form of a vector oflength n)
For example if X is a table containing normalizedquantitative variables if Q is the identity matrix Ip
and if D is equal to 1n In the triplet corresponds to a
principal component analysis on correlation matrix(normed PCA) Each basic method corresponds toa particular triplet (see table 1) but more complexmethods can also be represented by their duality di-agram
Functions Analyses Notesdudipca principal component 1dudicoa correspondence 2dudiacm multiple correspondence 3dudifca fuzzy correspondence 4dudimix analysis of a mixture of
numeric and factors5
dudinsc non symmetric corre-spondence
6
dudidec decentered correspon-dence
7
The dudi functions 1 Principal component analysissame as prcompprincomp 2 Correspondence analysisGreenacre (1984) 3 Multiple correspondence analysisTenenhaus and Young (1985) 4 Fuzzy correspondenceanalysis Chevenet et al (1994) 5 Analysis of a mixtureof numeric variables and factors Hill and Smith (1976)Kiers (1994) 6 Non symmetric correspondence analysisKroonenberg and Lombardo (1999) 7 Decentered corre-spondence analysis Doleacutedec et al (1995)
The singular value decomposition of a tripletgives principal axes principal components and rowand column coordinates which are added to thetriplet for later use
We can use for example a well-known datasetfrom the base package
gt data(USArrests)
gt pca1 lt- dudipca(USArrestsscannf=FALSEnf=3)
scannf = FALSE means that the number of prin-cipal components that will be used to compute rowand column coordinates should not be asked interac-tively to the user but taken as the value of argumentnf (by default nf = 2) Other parameters allow tochoose between centered normed or raw PCA (de-fault is centered and normed) and to set arbitraryrow and column weights The pca1 object is a du-ality diagram ie a list made of several vectors anddataframes
R News ISSN 1609-3631
Vol 41 June 2004 6
gt pca1
Duality diagramm
class pca dudi
$call dudipca(df = USArrests scannf = FALSE nf=3)
$nf 3 axis-components saved
$rank 4
eigen values 248 09898 03566 01734
vector length mode content
1 $cw 4 numeric column weights
2 $lw 50 numeric row weights
3 $eig 4 numeric eigen values
dataframe nrow ncol content
1 $tab 50 4 modified array
2 $li 50 3 row coordinates
3 $l1 50 3 row normed scores
4 $co 4 3 column coordinates
5 $c1 4 3 column normed scores
other elements cent norm
pca1$lw and pca1$cw are the row and columnweights that define the duality diagram togetherwith the data table (pca1$tab) pca1$eig containsthe eigenvalues The row and column coordinatesare stored in pca1$li and pca1$co The varianceof these coordinates is equal to the correspondingeigenvalue and unit variance coordinates are storedin pca1$l1 and pca1$c1 (this is usefull to draw bi-plots)
The general optimization theorems of data anal-ysis take particular meanings for each type of anal-ysis and graphical functions are proposed to drawthe canonical graphs ie the graphical expression cor-responding to the mathematical property of the ob-ject For example the normed PCA of a quantita-tive variable table gives a score that maximizes thesum of squared correlations with variables The PCAcanonical graph is therefore a graph showing howthe sum of squared correlations is maximized for thevariables of the data set On the USArrests examplewe obtain the following graphs
gt score(pca1)
gt scorcircle(pca1$co)
minus2 minus1 0 1 2
510
15
Murder
minus2 minus1 0 1 2
5010
015
020
025
030
0
Assault
minus2 minus1 0 1 2
3040
5060
7080
90 UrbanPop
minus2 minus1 0 1 2
1020
3040
Rape
Figure 1 One dimensional canonical graph for a normedPCA Variables are displayed as a function of row scoresto get a picture of the maximization of the sum of squaredcorrelations
Murder
Assault
UrbanPop
Rape
Figure 2 Two dimensional canonical graph for a normedPCA (correlation circle) the direction and length of ar-rows show the quality of the correlation between variablesand between variables and principal components
The scatter function draws the biplot of the PCA(ie a graph with both rows and columns superim-posed)
gt scatter(pca1)
R News ISSN 1609-3631
Vol 41 June 2004 7
d = 1
Alabama Alaska
Arizona
Arkansas
California
Colorado Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa Kansas
Kentucky Louisiana
Maine Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon Pennsylvania
Rhode Island
South Carolina
South Dakota Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Murder
Assault
UrbanPop
Rape
Figure 3 The PCA biplot Variables are symbolized byarrows and they are superimposed to the individuals dis-play The scale of the graph is given by a grid which sizeis given in the upper right corner Here the length of theside of grid squares is equal to one The eigenvalues barchart is drawn in the upper left corner with the two blackbars corresponding to the two axes used to draw the bi-plot Grey bars correspond to axes that were kept in theanalysis but not used to draw the graph
Separate factor maps can be drawn with thescorcircle (see figure 2) and slabel functions
gt slabel(pca1$li)
d = 1
Alabama Alaska
Arizona
Arkansas
California
Colorado Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa Kansas
Kentucky Louisiana
Maine Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma Oregon Pennsylvania
Rhode Island
South Carolina
South Dakota Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Figure 4 Individuals factor map in a normed PCA
Distance matrices
A duality diagram can also come from a distancematrix if this matrix is Euclidean (ie if the dis-tances in the matrix are the distances between somepoints in a Euclidean space) The ade4 packagecontains functions to compute dissimilarity matrices(distbinary for binary data and distprop for fre-quency data) test whether they are Euclidean Gowerand Legendre (1986) and make them Euclidean(quasieuclid lingoes Lingoes (1971) cailliezCailliez (1983)) These functions are useful to ecol-ogists who use the works of Legendre and Anderson(1999) and Legendre and Legendre (1998)
The Yanomama data set (Manly (1991)) con-tains three distance matrices between 19 villages ofYanomama Indians The dudipco function can beused to compute a principal coordinates analysis(PCO Gower (1966)) that gives a Euclidean repre-sentation of the 19 villages This Euclidean represen-tation allows to compare the geographical geneticand anthropometric distances
gt data(yanomama)
gt gen lt- quasieuclid(asdist(yanomama$gen))
gt geo lt- quasieuclid(asdist(yanomama$geo))
gt ant lt- quasieuclid(asdist(yanomama$ant))
gt geo1 lt- dudipco(geo scann = FALSE nf = 3)
gt gen1 lt- dudipco(gen scann = FALSE nf = 3)
gt ant1 lt- dudipco(ant scann = FALSE nf = 3)
gt par(mfrow=c(22))
gt scatter(geo1)
gt scatter(gen1)
gt scatter(ant1posi=bottom)
d = 100
1 2
3
4 5 6 7
8 9
10
11 12 13
14
15
16 17 18
19
d = 20
1
2
3
4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
19
d = 100
1
2 3
4
5
6
7
8
9 10
11
12 13
14 15
16
17 18
19
Figure 5 Comparison of the PCO analysis of the threedistance matrices of the Yanomama data set geographicgenetic and anthropometric distances
R News ISSN 1609-3631
Vol 41 June 2004 8
Taking into account groups of indi-viduals
In sites x species tables rows correspond to sitescolumns correspond to species and the values arethe number of individuals of species j found at sitei These tables can have many columns and can-not be used in a discriminant analysis In this casebetween-class analyses (between function) are a bet-ter alternative and they can be used with any dualitydiagram The between-class analysis of triplet (X QD) for a given factor f is the analysis of the triplet (GQ Dw) where G is the table of the means of table Xfor the groups defined by f and Dw is the diagonalmatrix of group weights For example a between-class correspondence analysis (BCA) is very simplyobtained after a correspondence analysis (CA)
gt data(meaudret)
gt coa1lt-dudicoa(meaudret$fau scannf = FALSE)
gt bet1lt-between(coa1meaudret$plan$sta
+ scannf=FALSE)
gt plot(bet1)
The meaudret$fau dataframe is an ecological ta-ble with 24 rows corresponding to six sampling sitesalong a small French stream (the Meaudret) Thesesix sampling sites were sampled four times (springsummer winter and autumn) hence the 24 rowsThe 13 columns correspond to 13 ephemeroteraspecies The CA of this data table is done with thedudicoa function giving the coa1 duality diagramThe corresponding bewteen-class analysis is donewith the between function considering the sites asclasses (meaudret$plan$sta is a factor defining theclasses) Therefore this is a between-sites analysiswhich aim is to discriminate the sites given the dis-tribution of ephemeroptera species This gives thebet1 duality diagram and Figure 6 shows the graphobtained by plotting this object
d = 1
Canonical weights
d = 1
Eda
Bsp
Brh
Bni
Bpu
Cen
Ecd
Rhi Hla
Hab
Par
Cae
Eig
Canonical weights
Variables
Eda
Bsp
Brh
Bni Bpu
Cen
Ecd
Rhi Hla
Hab
Par
Cae
Eig
Variables
00
000
501
001
5
Eigenvalues
d = 05
Scores and classes
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
S1
S2
S3
S4
S5
Axis1 Axis2
Inertia axes
d = 02
Classes
S1
S2
S3
S4
S5
Figure 6 BCA plot This is a composed plot made of 1- the species canonical weights (top left) 2- the speciesscores (middle left) 3- the eigenvalues bar chart (bottomleft) 4- the plot of plain CA axes projected into BCA(bottom center) 5- the gravity centers of classes (bottomright) 6- the projection of the rows with ellipses and grav-ity center of classes (main graph)
Like between-class analyses linear discriminantanalysis (discrimin function) can be extended to anyduality diagram (G (XtDX)minus Dw) where (XtDX)minus
is a generalized inverse This gives for example acorrespondence discriminant analysis (Perriegravere et al(1996) Perriegravere and Thioulouse (2002)) that can becomputed with the discrimincoa function
Opposite to between-class analyses are within-class analyses corresponding to diagrams (X minusYDminus1
w YtDX Q D) (within functions) These anal-yses extend to any type of variables the Multi-ple Group Principal Component Analysis (MGPCAThorpe (1983a) Thorpe (1983b) Thorpe and Leamy(1983) Furthermore the withincoa function intro-duces the double within-class correspondence anal-ysis (also named internal correspondence analysisCazes et al (1988))
Permutation tests
Permutation tests (also called Monte-Carlo tests orrandomization tests) can be used to assess the sta-tistical significance of between-class analyses Manypermutation tests are available in the ade4 packagefor example mantelrandtest procusterandtestrandtestbetween randtestcoinertia RVrtestrandtestdiscrimin and several of these testsare available both in R (mantelrtest) and in C(mantelrandtest) programming langage The R
R News ISSN 1609-3631
Vol 41 June 2004 9
version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons
The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values
gt test1lt-randtestbetween(bet1)
gt test1
Monte-Carlo test
Observation 04292
Call randtestbetween(xtest = bet1)
Based on 999 replicates
Simulated p-value 0001
gt plot(test1main=Between class inertia)
Between class inertia
sim
Fre
quency
01 02 03 04 05
0100
200
300
Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram
Conclusion
We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)
We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))
The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that
try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space
Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development
Bibliography
Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107
Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8
Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5
Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5
Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9
Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5
Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7
Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7
Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5
Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5
R News ISSN 1609-3631
Vol 41 June 2004 10
Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125
Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5
Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9
Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7
Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9
Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7
Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7
Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399
Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate
method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8
Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8
Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5
Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5
Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9
Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8
Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8
Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8
R News ISSN 1609-3631
Vol 41 June 2004 11
qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca
Introduction
The qcc package for the R statistical environment al-lows to
bull plot Shewhart quality control charts for contin-uous attribute and count data
bull plot Cusum and EWMA charts for continuousdata
bull draw operating characteristic curves
bull perform process capability analyses
bull draw Pareto charts and cause-and-effect dia-grams
I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate
The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy
Creating a qcc object
A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)
Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed
Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess
gt data(pistonrings)
gt attach(pistonrings)
gt dim(pistonrings)
[1] 200 3
gt pistonrings
diameter sample trial
1 74030 1 TRUE
2 74002 1 TRUE
3 74019 1 TRUE
4 73992 1 TRUE
5 74008 1 TRUE
6 73995 2 TRUE
7 73992 2 TRUE
199 74000 40 FALSE
200 74020 40 FALSE
gt diameter lt- qccgroups(diameter sample)
gt dim(diameter)
[1] 40 5
gt diameter
[1] [2] [3] [4] [5]
1 74030 74002 74019 73992 74008
2 73995 73992 74001 74011 74004
40 74010 74005 74029 74000 74020
Using the first 25 samples as training data an X chartcan be obtained as follows
gt obj lt- qcc(diameter[125] type=xbar)
By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)
gt summary(obj)
Call
qcc(data = diameter[125 ] type = xbar)
xbar chart for diameter[125 ]
Summary of group statistics
Min 1st Qu Median Mean 3rd Qu Max
7399 7400 7400 7400 7400 7401
Group sample size 5
Number of groups 25
Center of group statistics 7400118
Standard deviation 0009887547
Control limits
LCL UCL
7398791 7401444
R News ISSN 1609-3631
Vol 41 June 2004 12
type Control charts for variablesxbar X chart Sample means are plotted to
control the mean level of a con-tinuous process variable
xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable
R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable
S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able
type Control charts for attributesp p chart The proportion of nonconform-
ing units is plotted Controllimits are based on the binomialdistribution
np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution
c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion
u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units
Table 1 Shewhart control charts available in the qcc package
R News ISSN 1609-3631
Vol 41 June 2004 13
The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned
The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument
Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve
Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively
xbar Chartfor diameter[125 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9073
995
740
0074
005
740
1074
015
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24
LCL
UCL
Number of groups = 25Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 0Number violating runs = 0
Figure 1 A X chart for samples data
Shewhart control charts
A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows
gt plot(obj)
giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the
sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)
Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample
gt obj lt- qcc(diameter[125] type=xbar
+ newdata=diameter[2640])
plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples
xbar Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9074
000
740
1074
020
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
LCL
UCL
Calibration data in diameter[125 ] New data in diameter[2640 ]
Number of groups = 40Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 3Number violating runs = 1
Figure 2 A X chart with calibration data and newdata samples
If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method
gt plot(obj chartall=FALSE)
Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))
As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points
Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows
gt data(orangejuice)
gt attach(orangejuice)
R News ISSN 1609-3631
Vol 41 June 2004 14
gt orangejuice
sample D size trial
1 1 12 50 TRUE
2 2 15 50 TRUE
3 3 8 50 TRUE
53 53 3 50 FALSE
54 54 5 50 FALSE
gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)
where we use the logical indicator variable trial toselect the first 30 samples as calibration data
Operating characteristic function
An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo
The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo
gt par(mfrow=c(12))
gt occurves(obj)
gt occurves(obj2)
The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot
0 1 2 3 4 5
00
02
04
06
08
10
OC curves for xbar chart
Process shift (stddev)
Pro
b ty
pe II
err
or
n = 5n = 1n = 10n = 15n = 20
00 02 04 06 08 10
00
02
04
06
08
10
OC curves for p chart
p
Pro
b ty
pe II
err
or
Figure 3 Operating characteristics curves
Cusum charts
Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process
The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code
gt cusum(obj)
and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1
Cusum Chartfor diameter[125 ] and diameter[2640 ]
Group
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
minus5
minus3
minus1
13
57
911
1315
17
Cum
ulat
ive
Sum
Abo
ve T
arge
tB
elow
Tar
get
LDB
UDB
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4
Figure 4 Cusum chart
EWMA charts
Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process
The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows
gt ewma(obj)
and the resulting graph is shown in Figure 5
R News ISSN 1609-3631
Vol 41 June 2004 15
EWMA Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up S
umm
ary
Sta
tistic
s
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
739
9073
995
740
0074
005
740
1074
015
740
20
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Smoothing parameter = 02Control limits at 3sigma
Figure 5 EWMA chart
Process capability analysis
Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use
gt processcapability(obj
+ speclimits=c(73957405))
Process Capability Analysis
Call
processcapability(object = obj
speclimits = c(7395 7405))
Number of obs = 125 Target = 74
Center = 7400118 LSL = 7395
StdDev = 0009887547 USL = 7405
Capability indices
Value 25 975
Cp 1686 1476 1895
Cp_l 1725 1539 1912
Cp_u 1646 1467 1825
Cp_k 1646 1433 1859
Cpm 1674 1465 1882
ExpltLSL 0 ObsltLSL 0
ExpgtUSL 0 ObsgtUSL 0
The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if
available otherwise the target is set at the middlevalue between the specification limits
The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument
The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)
Process Capability Analysisfor diameter[125 ]
7394 7396 7398 7400 7402 7404 7406
LSL USLTarget
Number of obs = 125Center = 7400118StdDev = 0009887547
Target = 74LSL = 7395USL = 7405
Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167
ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0
Figure 6 Process capability analysis
A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice
bull a X chart
bull a R or S chart (if sample sizes gt 10)
bull a run chart
bull a histogram with specification limits
bull a normal QndashQ plot
bull a capability plot
As an example we may input the following code
gt processcapabilitysixpack(obj
+ speclimits=c(73957405)
+ target= 7401)
Pareto charts and cause-and-effectdiagrams
A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system
R News ISSN 1609-3631
Vol 41 June 2004 16
and thus screen out the less significant factors in ananalysis
A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example
gt defect lt- c(80 27 66 94 33)
gt names(defect) lt-
+ c(price code schedule date
+ supplier code contact num
+ part num)
gt paretochart(defect)
The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))
contact num
price code
supplier code
part num
schedule date
Pareto
Ch
art for d
efect
Frequency
0 50 100 150 200 250 300
0 25 50 75 100
Cumulative Percentage
Figure 7 Pareto chart
A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it
A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8
gt causeandeffect(
+ cause=list(Measurements=c(Microscopes
+ Inspectors)
+ Materials=c(Alloys
+ Suppliers)
+ Personnel=c(Supervisors
+ Operators)
+ Environment=c(Condensation
+ Moisture)
+ Methods=c(Brake Engager
+ Angle)
+ Machines=c(Speed Bits
+ Sockets))
+ effect=Surface Flaws)
CauseminusandminusEffect diagram
Surface Flaws
Measurements
Environment
Materials
Methods
Personnel
Machines
Microscopes
Inspectors
Alloys
Suppliers
Supervisors
Operators
Condensation
Moisture
Brake
Engager
Angle
Speed
Bits
Sockets
Figure 8 Cause-and-effect diagram
Tuning and extending the package
Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)
Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be
R News ISSN 1609-3631
Vol 41 July 8 2004 17
called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart
statspstd lt- function(data sizes)
data lt- asvector(data)
sizes lt- asvector(sizes)
pbar lt- sum(data)sum(sizes)
z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)
list(statistics = z center = 0)
sdpstd lt- function(data sizes) return(1)
limitspstd lt- function(center stddev sizes conf)
if (conf gt= 1) lcl lt- -conf
ucl lt- +conf
else
if (conf gt 0 amp conf lt 1)
nsigmas lt- qnorm(1 - (1 - conf)2)
lcl lt- -nsigmas
ucl lt- +nsigmas
else stop(invalid conf argument)
limits lt- matrix(c(lcl ucl) ncol = 2)
rownames(limits) lt- rep( length = nrow(limits))
colnames(limits) lt- c(LCL UCL)
return(limits)
Then we may source the above code and obtainthe control charts in Figure 9 as follows
set unequal sample sizes
gt n lt- c(rep(505) rep(1005) rep(25 5))
generate randomly the number of successes
gt x lt- rbinom(length(n) n 02)
gt par(mfrow=c(12))
plot the control chart with variable limits
gt qcc(x type=p size=n)
plot the standardized control chart
gt qcc(x type=pstd size=n)
p Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
00
01
02
03
04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 01931429StdDev = 03947641
LCL is variableUCL is variable
Number beyond limits = 0Number violating runs = 0
pstd Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
minus3
minus2
minus1
01
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 0StdDev = 1
LCL = minus3UCL = 3
Number beyond limits = 0Number violating runs = 0
Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)
Summary
In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram
References
Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall
Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit
Least Squares Calculations in RTiming different approaches
by Douglas Bates
Introduction
The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -
how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods
Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses
In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-
R News ISSN 1609-3631
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 3
and act in a responsible and professional manner andmeet legal and regulatory obligations There wouldbe no value in the software if my clients and I couldnot trust the results I obtained
In the case of professional tools such as softwarebeing used internally one typically has many alter-natives to consider You must first be sure that thepotential choices are validated against a list of re-quired functional and quality benchmarks driven bythe needs of your prospective clients You must ofcourse be sure that the choices fit within your op-erating budget In the specific case of analytic soft-ware applications the list to consider and the pricepoints were quite varied At the high end of the pricerange are applications like SAS and SPSS whichhave become prohibitively expensive for small busi-nesses that require commercial licenses Despite mypast lengthy experience with SAS it would not havebeen financially practical to consider using it Asingle-user desktop license for the three key compo-nents (Base Stat and Graph) would have cost about(US)$5000 per year well beyond my budget
Over a period of several months through late2000 I evaluated demonstration versions of severalstatistical applications At the time I was using Win-dows XP and thus considered Windows versions ofany applications I narrowed the list to S-PLUS andStata Because of my prior experience with both SASand S-PLUS I was comfortable with command lineapplications as opposed to those with menu-drivenrdquopoint and clickrdquo GUIs Both S-PLUS and Statawould meet my functional requirements regardinganalytic methods and graphics Because both pro-vided for programming within the language I couldreasonably expect to introduce some productivityenhancements and reduce my costs by more efficientuse of my time
rdquoAn Introduction To Rrdquo
About then (late 2000) I became aware of R throughcontacts with other users and through posts to var-ious online forums I had not yet decided on S-PLUS or Stata so I downloaded and installed R togain some experience with it This was circa R ver-sion 12x I began to experiment with R readingthe available online documentation perusing poststo the r-help list and using some of the books onS-PLUS that I had previously purchased most no-tably the first edition (1994) of Venables and RipleyrsquosMASS
Although I was not yet specifically aware of theR Community and the general culture of the GPLand Open-Source application development the basicpremise and philosophy was not foreign to me I wasquite comfortable with online support mechanismshaving used Usenet online rdquoknowledge basesrdquo andother online community forums in the past I had
found that more often that not these provided morecompetent and expedient support than did the tele-phone support from a vendor I also began to re-alize that the key factor in the success of my busi-ness would be superior client service and not expen-sive proprietary technology My experience and myneeds helped me accept the basic notion of Open-Source development and online community-basedsupport
From my earliest exposure to R it was evidentto me that this application had evolved substantiallyin a relatively short period of time (as it has contin-ued to do) under the leadership of some of the bestminds in the business It also became rapidly evi-dent that it would fit my functional requirements andmy comfort level with it increased substantially overa period of months Of course my prior experiencewith S-PLUS certainly helped me to learn R rapidly(although there are important differences between S-PLUS and R as described in the R FAQ)
I therefore postponed any purchase decisions re-garding S-PLUS or Stata and made a commitment tousing R
I began to develop and implement various func-tions that I would need in providing certain analyticand reporting capabilities for clients Over a periodof several weeks I created a package of functionsfor internal use that substantially reduced the timeit takes to import and structure data from certainclients (typically provided as Excel workbooks or asASCII files) and then generate a large number of ta-bles and graphs These types of reports typically usestandardized datasets and are created reasonably fre-quently Moreover I could easily enhance and intro-duce new components to the reports as required bythe changing needs of these clients Over time I in-troduced further efficiencies and was able to take aprocess that initially had taken several days quite lit-erally down to a few hours As a result I was able tospend less time on basic data manipulation and moretime on the review interpretation and description ofkey findings
In this way I could reduce the total time requiredfor a standardized project reducing my costs andalso increasing the value to the client by being able tospend more time on the higher-value services to theclient instead of on internal processes By enhancingthe quality of the product and ultimately passing mycost reductions on to the client I am able to increasesubstantially the value of my services
A Continued Evolution
In my more than three years of using R I have con-tinued to implement other internal process changesMost notably I have moved day-to-day computingoperations from Windows XP to Linux Needless tosay Rrsquos support of multiple operating systems made
R News ISSN 1609-3631
Vol 41 June 2004 4
this transition essentially transparent for the applica-tion software I also started using ESS as my primaryworking environment for R and enjoyed its higherlevel of support for R compared to the editor that Ihad used under Windows which simply providedsyntax highlighting
My transformation to Linux occurred graduallyduring the later Red Hat (RH) 80 beta releases as Ibecame more comfortable with Linux and with theprocess of replacing Windows functionality that hadbecome rdquosecond naturerdquo over my years of MS op-erating system use (dating back to the original IBMPCDOS days of 1981) For example I needed to re-place my practice in R of generating WMF files to beincorporated into Word and Powerpoint documentsAt first I switched to generating EPS files for incorpo-ration into OpenOfficeorgrsquos (OOorg) Writer and Im-press documents but now I primarily use LATEX andthe rdquoseminarrdquo slide creation package frequently inconjunction with Sweave Rather than using Power-point or Impress for presentations I create landscapeoriented PDF files and display them with Adobe Ac-robat Reader which supports a full screen presenta-tion mode
I also use OOorgrsquos Writer for most documentsthat I need to exchange with clients For the finalversion I export a PDF file from Writer knowingthat clients will have access to Adobersquos Reader Be-cause OOorgrsquos Writer supports MS Word input for-mats it provides for interchange with clients whenjointly editable documents (or tracking of changes)are required When clients send me Excel worksheetsI use OOorgrsquos Calc which supports Excel data for-mats (and in fact exports much better CSV files fromthese worksheets than does Excel) I have replacedOutlook with Ximianrsquos Evolution for my e-mail con-tact list and calendar management including the co-ordination of electronic meeting requests with busi-ness partners and clients and of course synching mySony Clie (which Sony recently announced will soonbe deprecated in favor of smarter cell phones)
With respect to Linux I am now using FedoraCore (FC) 2 having moved from RH 80 to RH 9 toFC 1 I have in time replaced all Windows-basedapplications with Linux-based alternatives with oneexception and that is some accounting software that Ineed in order to exchange electronic files with my ac-countant In time I suspect that this final hurdle willbe removed and with it the need to use Windowsat all I should make it clear that this is not a goalfor philosophic reasons but for business reasons Ihave found that the stability and functionality of theLinux environment has enabled me to engage in a va-riety of technical and business process related activi-ties that would either be problematic or prohibitivelyexpensive under Windows This enables me to real-ize additional productivity gains that further reduceoperating costs
Giving Back
In closing I would like to make a few commentsabout the R Community
I have felt fortunate to be a useR and I also havefelt strongly that I should give back to the Commu-nity - recognizing that ultimately through MedAna-lytics and the use of R I am making a living usingsoftware that has been made available through vol-untary efforts and which I did not need to purchase
To that end over the past three years I have com-mitted to contributing and giving back to the Com-munity in several ways First and foremost by re-sponding to queries on the r-help list and later onthe r-devel list There has been and will continueto be a continuous stream of new useRs who needbasic assistance As current userRs progress in expe-rience each of us can contribute to the Communityby assisting new useRs In this way a cycle developsin which the students evolve into teachers
Secondly in keeping with the philosophy thatrdquouseRs become developersrdquo as I have found theneed for particular features and have developed spe-cific solutions I have made them available via CRANor the lists Two functions in particular have beenavailable in the rdquogregmiscrdquo package since mid-2002thanks to Greg Warnes These are the barplot2()and CrossTable() functions As time permits I planto enhance most notably the CrossTable() functionwith a variety of non-parametric correlation mea-sures and will make those available when ready
Thirdly in a different area I have committed tofinancially contributing back to the Community bysupporting the R Foundation and the recent useR2004 meeting This is distinct from volunteering mytime and recognizes that ultimately even a volun-tary Community such as this one has some oper-ational costs to cover When in 2000-2001 I wascontemplating the purchase of S-PLUS and Statathese would have cost approximately (US)$2500 and(US)$1200 respectively In effect I have committedthose dollars to supporting the R Foundation at theBenefactor level over the coming years
I would challenge and urge other commercialuseRs to contribute to the R Foundation at whateverlevel makes sense for your organization
Thanks
It has been my pleasure and privilege to be a mem-ber of this Community over the past three years Ithas been nothing short of phenomenal to experiencethe growth of this Community during that time
I would like to thank Doug Bates for his kind in-vitation to contribute this article the idea for whicharose in a discussion we had during the recent useRmeeting in Vienna I would also like to extend mythanks to R Core and to the R Community at large
R News ISSN 1609-3631
Vol 41 June 2004 5
for their continued drive to create and support thiswonderful vehicle for international collaboration
Marc SchwartzMedAnalytics Inc Minneapolis Minnesota USAMSchwartzMedAnalyticscom
The ade4 package - I One-table methodsby Daniel Chessel Anne B Dufour and Jean Thioulouse
Introduction
This paper is a short summary of the main classesdefined in the ade4 package for one table analysismethods (eg principal component analysis) Otherpapers will detail the classes defined in ade4 for two-tables coupling methods (such as canonical corre-spondence analysis redundancy analysis and co-inertia analysis) for methods dealing with K-tablesanalysis (ie three-ways tables) and for graphicalmethods
This package is a complete rewrite of the ADE4software (Thioulouse et al (1997) httppbiluniv-lyon1frADE-4) for the R environment It containsData Analysis functions to analyse Ecological andEnvironmental data in the framework of EuclideanExploratory methods hence the name ade4 (ie 4 isnot a version number but means that there are four Ein the acronym)
The ade4 package is available in CRAN but itcan also be used directly online thanks to the Rwebsystem (httppbiluniv-lyon1frRweb) Thispossibility is being used to provide multivariateanalysis services in the field of bioinformatics par-ticularly for sequence and genome structure analysisat the PBIL (httppbiluniv-lyon1fr) An ex-ample of these services is the automated analysis ofthe codon usage of a set of DNA sequences by corre-spondence analysis (httppbiluniv-lyon1frmvacoaphp)
The duality diagram class
The basic tool in ade4 is the duality diagram Es-coufier (1987) A duality diagram is simply a list thatcontains a triplet (X Q D)
- X is a table with n rows and p columns consid-ered as p points in Rn (column vectors) or n points inRp (row vectors)
- Q is a p times p diagonal matrix containing theweights of the p columns of X and used as a scalarproduct in Rp (Q is stored under the form of a vectorof length p)
- D is a n times n diagonal matrix containing theweights of the n rows of X and used as a scalar prod-uct in Rn (D is stored under the form of a vector oflength n)
For example if X is a table containing normalizedquantitative variables if Q is the identity matrix Ip
and if D is equal to 1n In the triplet corresponds to a
principal component analysis on correlation matrix(normed PCA) Each basic method corresponds toa particular triplet (see table 1) but more complexmethods can also be represented by their duality di-agram
Functions Analyses Notesdudipca principal component 1dudicoa correspondence 2dudiacm multiple correspondence 3dudifca fuzzy correspondence 4dudimix analysis of a mixture of
numeric and factors5
dudinsc non symmetric corre-spondence
6
dudidec decentered correspon-dence
7
The dudi functions 1 Principal component analysissame as prcompprincomp 2 Correspondence analysisGreenacre (1984) 3 Multiple correspondence analysisTenenhaus and Young (1985) 4 Fuzzy correspondenceanalysis Chevenet et al (1994) 5 Analysis of a mixtureof numeric variables and factors Hill and Smith (1976)Kiers (1994) 6 Non symmetric correspondence analysisKroonenberg and Lombardo (1999) 7 Decentered corre-spondence analysis Doleacutedec et al (1995)
The singular value decomposition of a tripletgives principal axes principal components and rowand column coordinates which are added to thetriplet for later use
We can use for example a well-known datasetfrom the base package
gt data(USArrests)
gt pca1 lt- dudipca(USArrestsscannf=FALSEnf=3)
scannf = FALSE means that the number of prin-cipal components that will be used to compute rowand column coordinates should not be asked interac-tively to the user but taken as the value of argumentnf (by default nf = 2) Other parameters allow tochoose between centered normed or raw PCA (de-fault is centered and normed) and to set arbitraryrow and column weights The pca1 object is a du-ality diagram ie a list made of several vectors anddataframes
R News ISSN 1609-3631
Vol 41 June 2004 6
gt pca1
Duality diagramm
class pca dudi
$call dudipca(df = USArrests scannf = FALSE nf=3)
$nf 3 axis-components saved
$rank 4
eigen values 248 09898 03566 01734
vector length mode content
1 $cw 4 numeric column weights
2 $lw 50 numeric row weights
3 $eig 4 numeric eigen values
dataframe nrow ncol content
1 $tab 50 4 modified array
2 $li 50 3 row coordinates
3 $l1 50 3 row normed scores
4 $co 4 3 column coordinates
5 $c1 4 3 column normed scores
other elements cent norm
pca1$lw and pca1$cw are the row and columnweights that define the duality diagram togetherwith the data table (pca1$tab) pca1$eig containsthe eigenvalues The row and column coordinatesare stored in pca1$li and pca1$co The varianceof these coordinates is equal to the correspondingeigenvalue and unit variance coordinates are storedin pca1$l1 and pca1$c1 (this is usefull to draw bi-plots)
The general optimization theorems of data anal-ysis take particular meanings for each type of anal-ysis and graphical functions are proposed to drawthe canonical graphs ie the graphical expression cor-responding to the mathematical property of the ob-ject For example the normed PCA of a quantita-tive variable table gives a score that maximizes thesum of squared correlations with variables The PCAcanonical graph is therefore a graph showing howthe sum of squared correlations is maximized for thevariables of the data set On the USArrests examplewe obtain the following graphs
gt score(pca1)
gt scorcircle(pca1$co)
minus2 minus1 0 1 2
510
15
Murder
minus2 minus1 0 1 2
5010
015
020
025
030
0
Assault
minus2 minus1 0 1 2
3040
5060
7080
90 UrbanPop
minus2 minus1 0 1 2
1020
3040
Rape
Figure 1 One dimensional canonical graph for a normedPCA Variables are displayed as a function of row scoresto get a picture of the maximization of the sum of squaredcorrelations
Murder
Assault
UrbanPop
Rape
Figure 2 Two dimensional canonical graph for a normedPCA (correlation circle) the direction and length of ar-rows show the quality of the correlation between variablesand between variables and principal components
The scatter function draws the biplot of the PCA(ie a graph with both rows and columns superim-posed)
gt scatter(pca1)
R News ISSN 1609-3631
Vol 41 June 2004 7
d = 1
Alabama Alaska
Arizona
Arkansas
California
Colorado Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa Kansas
Kentucky Louisiana
Maine Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon Pennsylvania
Rhode Island
South Carolina
South Dakota Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Murder
Assault
UrbanPop
Rape
Figure 3 The PCA biplot Variables are symbolized byarrows and they are superimposed to the individuals dis-play The scale of the graph is given by a grid which sizeis given in the upper right corner Here the length of theside of grid squares is equal to one The eigenvalues barchart is drawn in the upper left corner with the two blackbars corresponding to the two axes used to draw the bi-plot Grey bars correspond to axes that were kept in theanalysis but not used to draw the graph
Separate factor maps can be drawn with thescorcircle (see figure 2) and slabel functions
gt slabel(pca1$li)
d = 1
Alabama Alaska
Arizona
Arkansas
California
Colorado Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa Kansas
Kentucky Louisiana
Maine Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma Oregon Pennsylvania
Rhode Island
South Carolina
South Dakota Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Figure 4 Individuals factor map in a normed PCA
Distance matrices
A duality diagram can also come from a distancematrix if this matrix is Euclidean (ie if the dis-tances in the matrix are the distances between somepoints in a Euclidean space) The ade4 packagecontains functions to compute dissimilarity matrices(distbinary for binary data and distprop for fre-quency data) test whether they are Euclidean Gowerand Legendre (1986) and make them Euclidean(quasieuclid lingoes Lingoes (1971) cailliezCailliez (1983)) These functions are useful to ecol-ogists who use the works of Legendre and Anderson(1999) and Legendre and Legendre (1998)
The Yanomama data set (Manly (1991)) con-tains three distance matrices between 19 villages ofYanomama Indians The dudipco function can beused to compute a principal coordinates analysis(PCO Gower (1966)) that gives a Euclidean repre-sentation of the 19 villages This Euclidean represen-tation allows to compare the geographical geneticand anthropometric distances
gt data(yanomama)
gt gen lt- quasieuclid(asdist(yanomama$gen))
gt geo lt- quasieuclid(asdist(yanomama$geo))
gt ant lt- quasieuclid(asdist(yanomama$ant))
gt geo1 lt- dudipco(geo scann = FALSE nf = 3)
gt gen1 lt- dudipco(gen scann = FALSE nf = 3)
gt ant1 lt- dudipco(ant scann = FALSE nf = 3)
gt par(mfrow=c(22))
gt scatter(geo1)
gt scatter(gen1)
gt scatter(ant1posi=bottom)
d = 100
1 2
3
4 5 6 7
8 9
10
11 12 13
14
15
16 17 18
19
d = 20
1
2
3
4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
19
d = 100
1
2 3
4
5
6
7
8
9 10
11
12 13
14 15
16
17 18
19
Figure 5 Comparison of the PCO analysis of the threedistance matrices of the Yanomama data set geographicgenetic and anthropometric distances
R News ISSN 1609-3631
Vol 41 June 2004 8
Taking into account groups of indi-viduals
In sites x species tables rows correspond to sitescolumns correspond to species and the values arethe number of individuals of species j found at sitei These tables can have many columns and can-not be used in a discriminant analysis In this casebetween-class analyses (between function) are a bet-ter alternative and they can be used with any dualitydiagram The between-class analysis of triplet (X QD) for a given factor f is the analysis of the triplet (GQ Dw) where G is the table of the means of table Xfor the groups defined by f and Dw is the diagonalmatrix of group weights For example a between-class correspondence analysis (BCA) is very simplyobtained after a correspondence analysis (CA)
gt data(meaudret)
gt coa1lt-dudicoa(meaudret$fau scannf = FALSE)
gt bet1lt-between(coa1meaudret$plan$sta
+ scannf=FALSE)
gt plot(bet1)
The meaudret$fau dataframe is an ecological ta-ble with 24 rows corresponding to six sampling sitesalong a small French stream (the Meaudret) Thesesix sampling sites were sampled four times (springsummer winter and autumn) hence the 24 rowsThe 13 columns correspond to 13 ephemeroteraspecies The CA of this data table is done with thedudicoa function giving the coa1 duality diagramThe corresponding bewteen-class analysis is donewith the between function considering the sites asclasses (meaudret$plan$sta is a factor defining theclasses) Therefore this is a between-sites analysiswhich aim is to discriminate the sites given the dis-tribution of ephemeroptera species This gives thebet1 duality diagram and Figure 6 shows the graphobtained by plotting this object
d = 1
Canonical weights
d = 1
Eda
Bsp
Brh
Bni
Bpu
Cen
Ecd
Rhi Hla
Hab
Par
Cae
Eig
Canonical weights
Variables
Eda
Bsp
Brh
Bni Bpu
Cen
Ecd
Rhi Hla
Hab
Par
Cae
Eig
Variables
00
000
501
001
5
Eigenvalues
d = 05
Scores and classes
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
S1
S2
S3
S4
S5
Axis1 Axis2
Inertia axes
d = 02
Classes
S1
S2
S3
S4
S5
Figure 6 BCA plot This is a composed plot made of 1- the species canonical weights (top left) 2- the speciesscores (middle left) 3- the eigenvalues bar chart (bottomleft) 4- the plot of plain CA axes projected into BCA(bottom center) 5- the gravity centers of classes (bottomright) 6- the projection of the rows with ellipses and grav-ity center of classes (main graph)
Like between-class analyses linear discriminantanalysis (discrimin function) can be extended to anyduality diagram (G (XtDX)minus Dw) where (XtDX)minus
is a generalized inverse This gives for example acorrespondence discriminant analysis (Perriegravere et al(1996) Perriegravere and Thioulouse (2002)) that can becomputed with the discrimincoa function
Opposite to between-class analyses are within-class analyses corresponding to diagrams (X minusYDminus1
w YtDX Q D) (within functions) These anal-yses extend to any type of variables the Multi-ple Group Principal Component Analysis (MGPCAThorpe (1983a) Thorpe (1983b) Thorpe and Leamy(1983) Furthermore the withincoa function intro-duces the double within-class correspondence anal-ysis (also named internal correspondence analysisCazes et al (1988))
Permutation tests
Permutation tests (also called Monte-Carlo tests orrandomization tests) can be used to assess the sta-tistical significance of between-class analyses Manypermutation tests are available in the ade4 packagefor example mantelrandtest procusterandtestrandtestbetween randtestcoinertia RVrtestrandtestdiscrimin and several of these testsare available both in R (mantelrtest) and in C(mantelrandtest) programming langage The R
R News ISSN 1609-3631
Vol 41 June 2004 9
version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons
The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values
gt test1lt-randtestbetween(bet1)
gt test1
Monte-Carlo test
Observation 04292
Call randtestbetween(xtest = bet1)
Based on 999 replicates
Simulated p-value 0001
gt plot(test1main=Between class inertia)
Between class inertia
sim
Fre
quency
01 02 03 04 05
0100
200
300
Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram
Conclusion
We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)
We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))
The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that
try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space
Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development
Bibliography
Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107
Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8
Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5
Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5
Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9
Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5
Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7
Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7
Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5
Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5
R News ISSN 1609-3631
Vol 41 June 2004 10
Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125
Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5
Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9
Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7
Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9
Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7
Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7
Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399
Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate
method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8
Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8
Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5
Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5
Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9
Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8
Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8
Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8
R News ISSN 1609-3631
Vol 41 June 2004 11
qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca
Introduction
The qcc package for the R statistical environment al-lows to
bull plot Shewhart quality control charts for contin-uous attribute and count data
bull plot Cusum and EWMA charts for continuousdata
bull draw operating characteristic curves
bull perform process capability analyses
bull draw Pareto charts and cause-and-effect dia-grams
I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate
The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy
Creating a qcc object
A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)
Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed
Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess
gt data(pistonrings)
gt attach(pistonrings)
gt dim(pistonrings)
[1] 200 3
gt pistonrings
diameter sample trial
1 74030 1 TRUE
2 74002 1 TRUE
3 74019 1 TRUE
4 73992 1 TRUE
5 74008 1 TRUE
6 73995 2 TRUE
7 73992 2 TRUE
199 74000 40 FALSE
200 74020 40 FALSE
gt diameter lt- qccgroups(diameter sample)
gt dim(diameter)
[1] 40 5
gt diameter
[1] [2] [3] [4] [5]
1 74030 74002 74019 73992 74008
2 73995 73992 74001 74011 74004
40 74010 74005 74029 74000 74020
Using the first 25 samples as training data an X chartcan be obtained as follows
gt obj lt- qcc(diameter[125] type=xbar)
By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)
gt summary(obj)
Call
qcc(data = diameter[125 ] type = xbar)
xbar chart for diameter[125 ]
Summary of group statistics
Min 1st Qu Median Mean 3rd Qu Max
7399 7400 7400 7400 7400 7401
Group sample size 5
Number of groups 25
Center of group statistics 7400118
Standard deviation 0009887547
Control limits
LCL UCL
7398791 7401444
R News ISSN 1609-3631
Vol 41 June 2004 12
type Control charts for variablesxbar X chart Sample means are plotted to
control the mean level of a con-tinuous process variable
xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable
R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable
S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able
type Control charts for attributesp p chart The proportion of nonconform-
ing units is plotted Controllimits are based on the binomialdistribution
np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution
c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion
u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units
Table 1 Shewhart control charts available in the qcc package
R News ISSN 1609-3631
Vol 41 June 2004 13
The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned
The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument
Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve
Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively
xbar Chartfor diameter[125 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9073
995
740
0074
005
740
1074
015
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24
LCL
UCL
Number of groups = 25Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 0Number violating runs = 0
Figure 1 A X chart for samples data
Shewhart control charts
A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows
gt plot(obj)
giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the
sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)
Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample
gt obj lt- qcc(diameter[125] type=xbar
+ newdata=diameter[2640])
plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples
xbar Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9074
000
740
1074
020
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
LCL
UCL
Calibration data in diameter[125 ] New data in diameter[2640 ]
Number of groups = 40Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 3Number violating runs = 1
Figure 2 A X chart with calibration data and newdata samples
If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method
gt plot(obj chartall=FALSE)
Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))
As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points
Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows
gt data(orangejuice)
gt attach(orangejuice)
R News ISSN 1609-3631
Vol 41 June 2004 14
gt orangejuice
sample D size trial
1 1 12 50 TRUE
2 2 15 50 TRUE
3 3 8 50 TRUE
53 53 3 50 FALSE
54 54 5 50 FALSE
gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)
where we use the logical indicator variable trial toselect the first 30 samples as calibration data
Operating characteristic function
An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo
The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo
gt par(mfrow=c(12))
gt occurves(obj)
gt occurves(obj2)
The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot
0 1 2 3 4 5
00
02
04
06
08
10
OC curves for xbar chart
Process shift (stddev)
Pro
b ty
pe II
err
or
n = 5n = 1n = 10n = 15n = 20
00 02 04 06 08 10
00
02
04
06
08
10
OC curves for p chart
p
Pro
b ty
pe II
err
or
Figure 3 Operating characteristics curves
Cusum charts
Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process
The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code
gt cusum(obj)
and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1
Cusum Chartfor diameter[125 ] and diameter[2640 ]
Group
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
minus5
minus3
minus1
13
57
911
1315
17
Cum
ulat
ive
Sum
Abo
ve T
arge
tB
elow
Tar
get
LDB
UDB
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4
Figure 4 Cusum chart
EWMA charts
Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process
The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows
gt ewma(obj)
and the resulting graph is shown in Figure 5
R News ISSN 1609-3631
Vol 41 June 2004 15
EWMA Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up S
umm
ary
Sta
tistic
s
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
739
9073
995
740
0074
005
740
1074
015
740
20
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Smoothing parameter = 02Control limits at 3sigma
Figure 5 EWMA chart
Process capability analysis
Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use
gt processcapability(obj
+ speclimits=c(73957405))
Process Capability Analysis
Call
processcapability(object = obj
speclimits = c(7395 7405))
Number of obs = 125 Target = 74
Center = 7400118 LSL = 7395
StdDev = 0009887547 USL = 7405
Capability indices
Value 25 975
Cp 1686 1476 1895
Cp_l 1725 1539 1912
Cp_u 1646 1467 1825
Cp_k 1646 1433 1859
Cpm 1674 1465 1882
ExpltLSL 0 ObsltLSL 0
ExpgtUSL 0 ObsgtUSL 0
The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if
available otherwise the target is set at the middlevalue between the specification limits
The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument
The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)
Process Capability Analysisfor diameter[125 ]
7394 7396 7398 7400 7402 7404 7406
LSL USLTarget
Number of obs = 125Center = 7400118StdDev = 0009887547
Target = 74LSL = 7395USL = 7405
Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167
ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0
Figure 6 Process capability analysis
A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice
bull a X chart
bull a R or S chart (if sample sizes gt 10)
bull a run chart
bull a histogram with specification limits
bull a normal QndashQ plot
bull a capability plot
As an example we may input the following code
gt processcapabilitysixpack(obj
+ speclimits=c(73957405)
+ target= 7401)
Pareto charts and cause-and-effectdiagrams
A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system
R News ISSN 1609-3631
Vol 41 June 2004 16
and thus screen out the less significant factors in ananalysis
A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example
gt defect lt- c(80 27 66 94 33)
gt names(defect) lt-
+ c(price code schedule date
+ supplier code contact num
+ part num)
gt paretochart(defect)
The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))
contact num
price code
supplier code
part num
schedule date
Pareto
Ch
art for d
efect
Frequency
0 50 100 150 200 250 300
0 25 50 75 100
Cumulative Percentage
Figure 7 Pareto chart
A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it
A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8
gt causeandeffect(
+ cause=list(Measurements=c(Microscopes
+ Inspectors)
+ Materials=c(Alloys
+ Suppliers)
+ Personnel=c(Supervisors
+ Operators)
+ Environment=c(Condensation
+ Moisture)
+ Methods=c(Brake Engager
+ Angle)
+ Machines=c(Speed Bits
+ Sockets))
+ effect=Surface Flaws)
CauseminusandminusEffect diagram
Surface Flaws
Measurements
Environment
Materials
Methods
Personnel
Machines
Microscopes
Inspectors
Alloys
Suppliers
Supervisors
Operators
Condensation
Moisture
Brake
Engager
Angle
Speed
Bits
Sockets
Figure 8 Cause-and-effect diagram
Tuning and extending the package
Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)
Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be
R News ISSN 1609-3631
Vol 41 July 8 2004 17
called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart
statspstd lt- function(data sizes)
data lt- asvector(data)
sizes lt- asvector(sizes)
pbar lt- sum(data)sum(sizes)
z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)
list(statistics = z center = 0)
sdpstd lt- function(data sizes) return(1)
limitspstd lt- function(center stddev sizes conf)
if (conf gt= 1) lcl lt- -conf
ucl lt- +conf
else
if (conf gt 0 amp conf lt 1)
nsigmas lt- qnorm(1 - (1 - conf)2)
lcl lt- -nsigmas
ucl lt- +nsigmas
else stop(invalid conf argument)
limits lt- matrix(c(lcl ucl) ncol = 2)
rownames(limits) lt- rep( length = nrow(limits))
colnames(limits) lt- c(LCL UCL)
return(limits)
Then we may source the above code and obtainthe control charts in Figure 9 as follows
set unequal sample sizes
gt n lt- c(rep(505) rep(1005) rep(25 5))
generate randomly the number of successes
gt x lt- rbinom(length(n) n 02)
gt par(mfrow=c(12))
plot the control chart with variable limits
gt qcc(x type=p size=n)
plot the standardized control chart
gt qcc(x type=pstd size=n)
p Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
00
01
02
03
04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 01931429StdDev = 03947641
LCL is variableUCL is variable
Number beyond limits = 0Number violating runs = 0
pstd Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
minus3
minus2
minus1
01
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 0StdDev = 1
LCL = minus3UCL = 3
Number beyond limits = 0Number violating runs = 0
Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)
Summary
In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram
References
Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall
Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit
Least Squares Calculations in RTiming different approaches
by Douglas Bates
Introduction
The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -
how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods
Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses
In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-
R News ISSN 1609-3631
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 4
this transition essentially transparent for the applica-tion software I also started using ESS as my primaryworking environment for R and enjoyed its higherlevel of support for R compared to the editor that Ihad used under Windows which simply providedsyntax highlighting
My transformation to Linux occurred graduallyduring the later Red Hat (RH) 80 beta releases as Ibecame more comfortable with Linux and with theprocess of replacing Windows functionality that hadbecome rdquosecond naturerdquo over my years of MS op-erating system use (dating back to the original IBMPCDOS days of 1981) For example I needed to re-place my practice in R of generating WMF files to beincorporated into Word and Powerpoint documentsAt first I switched to generating EPS files for incorpo-ration into OpenOfficeorgrsquos (OOorg) Writer and Im-press documents but now I primarily use LATEX andthe rdquoseminarrdquo slide creation package frequently inconjunction with Sweave Rather than using Power-point or Impress for presentations I create landscapeoriented PDF files and display them with Adobe Ac-robat Reader which supports a full screen presenta-tion mode
I also use OOorgrsquos Writer for most documentsthat I need to exchange with clients For the finalversion I export a PDF file from Writer knowingthat clients will have access to Adobersquos Reader Be-cause OOorgrsquos Writer supports MS Word input for-mats it provides for interchange with clients whenjointly editable documents (or tracking of changes)are required When clients send me Excel worksheetsI use OOorgrsquos Calc which supports Excel data for-mats (and in fact exports much better CSV files fromthese worksheets than does Excel) I have replacedOutlook with Ximianrsquos Evolution for my e-mail con-tact list and calendar management including the co-ordination of electronic meeting requests with busi-ness partners and clients and of course synching mySony Clie (which Sony recently announced will soonbe deprecated in favor of smarter cell phones)
With respect to Linux I am now using FedoraCore (FC) 2 having moved from RH 80 to RH 9 toFC 1 I have in time replaced all Windows-basedapplications with Linux-based alternatives with oneexception and that is some accounting software that Ineed in order to exchange electronic files with my ac-countant In time I suspect that this final hurdle willbe removed and with it the need to use Windowsat all I should make it clear that this is not a goalfor philosophic reasons but for business reasons Ihave found that the stability and functionality of theLinux environment has enabled me to engage in a va-riety of technical and business process related activi-ties that would either be problematic or prohibitivelyexpensive under Windows This enables me to real-ize additional productivity gains that further reduceoperating costs
Giving Back
In closing I would like to make a few commentsabout the R Community
I have felt fortunate to be a useR and I also havefelt strongly that I should give back to the Commu-nity - recognizing that ultimately through MedAna-lytics and the use of R I am making a living usingsoftware that has been made available through vol-untary efforts and which I did not need to purchase
To that end over the past three years I have com-mitted to contributing and giving back to the Com-munity in several ways First and foremost by re-sponding to queries on the r-help list and later onthe r-devel list There has been and will continueto be a continuous stream of new useRs who needbasic assistance As current userRs progress in expe-rience each of us can contribute to the Communityby assisting new useRs In this way a cycle developsin which the students evolve into teachers
Secondly in keeping with the philosophy thatrdquouseRs become developersrdquo as I have found theneed for particular features and have developed spe-cific solutions I have made them available via CRANor the lists Two functions in particular have beenavailable in the rdquogregmiscrdquo package since mid-2002thanks to Greg Warnes These are the barplot2()and CrossTable() functions As time permits I planto enhance most notably the CrossTable() functionwith a variety of non-parametric correlation mea-sures and will make those available when ready
Thirdly in a different area I have committed tofinancially contributing back to the Community bysupporting the R Foundation and the recent useR2004 meeting This is distinct from volunteering mytime and recognizes that ultimately even a volun-tary Community such as this one has some oper-ational costs to cover When in 2000-2001 I wascontemplating the purchase of S-PLUS and Statathese would have cost approximately (US)$2500 and(US)$1200 respectively In effect I have committedthose dollars to supporting the R Foundation at theBenefactor level over the coming years
I would challenge and urge other commercialuseRs to contribute to the R Foundation at whateverlevel makes sense for your organization
Thanks
It has been my pleasure and privilege to be a mem-ber of this Community over the past three years Ithas been nothing short of phenomenal to experiencethe growth of this Community during that time
I would like to thank Doug Bates for his kind in-vitation to contribute this article the idea for whicharose in a discussion we had during the recent useRmeeting in Vienna I would also like to extend mythanks to R Core and to the R Community at large
R News ISSN 1609-3631
Vol 41 June 2004 5
for their continued drive to create and support thiswonderful vehicle for international collaboration
Marc SchwartzMedAnalytics Inc Minneapolis Minnesota USAMSchwartzMedAnalyticscom
The ade4 package - I One-table methodsby Daniel Chessel Anne B Dufour and Jean Thioulouse
Introduction
This paper is a short summary of the main classesdefined in the ade4 package for one table analysismethods (eg principal component analysis) Otherpapers will detail the classes defined in ade4 for two-tables coupling methods (such as canonical corre-spondence analysis redundancy analysis and co-inertia analysis) for methods dealing with K-tablesanalysis (ie three-ways tables) and for graphicalmethods
This package is a complete rewrite of the ADE4software (Thioulouse et al (1997) httppbiluniv-lyon1frADE-4) for the R environment It containsData Analysis functions to analyse Ecological andEnvironmental data in the framework of EuclideanExploratory methods hence the name ade4 (ie 4 isnot a version number but means that there are four Ein the acronym)
The ade4 package is available in CRAN but itcan also be used directly online thanks to the Rwebsystem (httppbiluniv-lyon1frRweb) Thispossibility is being used to provide multivariateanalysis services in the field of bioinformatics par-ticularly for sequence and genome structure analysisat the PBIL (httppbiluniv-lyon1fr) An ex-ample of these services is the automated analysis ofthe codon usage of a set of DNA sequences by corre-spondence analysis (httppbiluniv-lyon1frmvacoaphp)
The duality diagram class
The basic tool in ade4 is the duality diagram Es-coufier (1987) A duality diagram is simply a list thatcontains a triplet (X Q D)
- X is a table with n rows and p columns consid-ered as p points in Rn (column vectors) or n points inRp (row vectors)
- Q is a p times p diagonal matrix containing theweights of the p columns of X and used as a scalarproduct in Rp (Q is stored under the form of a vectorof length p)
- D is a n times n diagonal matrix containing theweights of the n rows of X and used as a scalar prod-uct in Rn (D is stored under the form of a vector oflength n)
For example if X is a table containing normalizedquantitative variables if Q is the identity matrix Ip
and if D is equal to 1n In the triplet corresponds to a
principal component analysis on correlation matrix(normed PCA) Each basic method corresponds toa particular triplet (see table 1) but more complexmethods can also be represented by their duality di-agram
Functions Analyses Notesdudipca principal component 1dudicoa correspondence 2dudiacm multiple correspondence 3dudifca fuzzy correspondence 4dudimix analysis of a mixture of
numeric and factors5
dudinsc non symmetric corre-spondence
6
dudidec decentered correspon-dence
7
The dudi functions 1 Principal component analysissame as prcompprincomp 2 Correspondence analysisGreenacre (1984) 3 Multiple correspondence analysisTenenhaus and Young (1985) 4 Fuzzy correspondenceanalysis Chevenet et al (1994) 5 Analysis of a mixtureof numeric variables and factors Hill and Smith (1976)Kiers (1994) 6 Non symmetric correspondence analysisKroonenberg and Lombardo (1999) 7 Decentered corre-spondence analysis Doleacutedec et al (1995)
The singular value decomposition of a tripletgives principal axes principal components and rowand column coordinates which are added to thetriplet for later use
We can use for example a well-known datasetfrom the base package
gt data(USArrests)
gt pca1 lt- dudipca(USArrestsscannf=FALSEnf=3)
scannf = FALSE means that the number of prin-cipal components that will be used to compute rowand column coordinates should not be asked interac-tively to the user but taken as the value of argumentnf (by default nf = 2) Other parameters allow tochoose between centered normed or raw PCA (de-fault is centered and normed) and to set arbitraryrow and column weights The pca1 object is a du-ality diagram ie a list made of several vectors anddataframes
R News ISSN 1609-3631
Vol 41 June 2004 6
gt pca1
Duality diagramm
class pca dudi
$call dudipca(df = USArrests scannf = FALSE nf=3)
$nf 3 axis-components saved
$rank 4
eigen values 248 09898 03566 01734
vector length mode content
1 $cw 4 numeric column weights
2 $lw 50 numeric row weights
3 $eig 4 numeric eigen values
dataframe nrow ncol content
1 $tab 50 4 modified array
2 $li 50 3 row coordinates
3 $l1 50 3 row normed scores
4 $co 4 3 column coordinates
5 $c1 4 3 column normed scores
other elements cent norm
pca1$lw and pca1$cw are the row and columnweights that define the duality diagram togetherwith the data table (pca1$tab) pca1$eig containsthe eigenvalues The row and column coordinatesare stored in pca1$li and pca1$co The varianceof these coordinates is equal to the correspondingeigenvalue and unit variance coordinates are storedin pca1$l1 and pca1$c1 (this is usefull to draw bi-plots)
The general optimization theorems of data anal-ysis take particular meanings for each type of anal-ysis and graphical functions are proposed to drawthe canonical graphs ie the graphical expression cor-responding to the mathematical property of the ob-ject For example the normed PCA of a quantita-tive variable table gives a score that maximizes thesum of squared correlations with variables The PCAcanonical graph is therefore a graph showing howthe sum of squared correlations is maximized for thevariables of the data set On the USArrests examplewe obtain the following graphs
gt score(pca1)
gt scorcircle(pca1$co)
minus2 minus1 0 1 2
510
15
Murder
minus2 minus1 0 1 2
5010
015
020
025
030
0
Assault
minus2 minus1 0 1 2
3040
5060
7080
90 UrbanPop
minus2 minus1 0 1 2
1020
3040
Rape
Figure 1 One dimensional canonical graph for a normedPCA Variables are displayed as a function of row scoresto get a picture of the maximization of the sum of squaredcorrelations
Murder
Assault
UrbanPop
Rape
Figure 2 Two dimensional canonical graph for a normedPCA (correlation circle) the direction and length of ar-rows show the quality of the correlation between variablesand between variables and principal components
The scatter function draws the biplot of the PCA(ie a graph with both rows and columns superim-posed)
gt scatter(pca1)
R News ISSN 1609-3631
Vol 41 June 2004 7
d = 1
Alabama Alaska
Arizona
Arkansas
California
Colorado Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa Kansas
Kentucky Louisiana
Maine Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon Pennsylvania
Rhode Island
South Carolina
South Dakota Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Murder
Assault
UrbanPop
Rape
Figure 3 The PCA biplot Variables are symbolized byarrows and they are superimposed to the individuals dis-play The scale of the graph is given by a grid which sizeis given in the upper right corner Here the length of theside of grid squares is equal to one The eigenvalues barchart is drawn in the upper left corner with the two blackbars corresponding to the two axes used to draw the bi-plot Grey bars correspond to axes that were kept in theanalysis but not used to draw the graph
Separate factor maps can be drawn with thescorcircle (see figure 2) and slabel functions
gt slabel(pca1$li)
d = 1
Alabama Alaska
Arizona
Arkansas
California
Colorado Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa Kansas
Kentucky Louisiana
Maine Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma Oregon Pennsylvania
Rhode Island
South Carolina
South Dakota Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Figure 4 Individuals factor map in a normed PCA
Distance matrices
A duality diagram can also come from a distancematrix if this matrix is Euclidean (ie if the dis-tances in the matrix are the distances between somepoints in a Euclidean space) The ade4 packagecontains functions to compute dissimilarity matrices(distbinary for binary data and distprop for fre-quency data) test whether they are Euclidean Gowerand Legendre (1986) and make them Euclidean(quasieuclid lingoes Lingoes (1971) cailliezCailliez (1983)) These functions are useful to ecol-ogists who use the works of Legendre and Anderson(1999) and Legendre and Legendre (1998)
The Yanomama data set (Manly (1991)) con-tains three distance matrices between 19 villages ofYanomama Indians The dudipco function can beused to compute a principal coordinates analysis(PCO Gower (1966)) that gives a Euclidean repre-sentation of the 19 villages This Euclidean represen-tation allows to compare the geographical geneticand anthropometric distances
gt data(yanomama)
gt gen lt- quasieuclid(asdist(yanomama$gen))
gt geo lt- quasieuclid(asdist(yanomama$geo))
gt ant lt- quasieuclid(asdist(yanomama$ant))
gt geo1 lt- dudipco(geo scann = FALSE nf = 3)
gt gen1 lt- dudipco(gen scann = FALSE nf = 3)
gt ant1 lt- dudipco(ant scann = FALSE nf = 3)
gt par(mfrow=c(22))
gt scatter(geo1)
gt scatter(gen1)
gt scatter(ant1posi=bottom)
d = 100
1 2
3
4 5 6 7
8 9
10
11 12 13
14
15
16 17 18
19
d = 20
1
2
3
4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
19
d = 100
1
2 3
4
5
6
7
8
9 10
11
12 13
14 15
16
17 18
19
Figure 5 Comparison of the PCO analysis of the threedistance matrices of the Yanomama data set geographicgenetic and anthropometric distances
R News ISSN 1609-3631
Vol 41 June 2004 8
Taking into account groups of indi-viduals
In sites x species tables rows correspond to sitescolumns correspond to species and the values arethe number of individuals of species j found at sitei These tables can have many columns and can-not be used in a discriminant analysis In this casebetween-class analyses (between function) are a bet-ter alternative and they can be used with any dualitydiagram The between-class analysis of triplet (X QD) for a given factor f is the analysis of the triplet (GQ Dw) where G is the table of the means of table Xfor the groups defined by f and Dw is the diagonalmatrix of group weights For example a between-class correspondence analysis (BCA) is very simplyobtained after a correspondence analysis (CA)
gt data(meaudret)
gt coa1lt-dudicoa(meaudret$fau scannf = FALSE)
gt bet1lt-between(coa1meaudret$plan$sta
+ scannf=FALSE)
gt plot(bet1)
The meaudret$fau dataframe is an ecological ta-ble with 24 rows corresponding to six sampling sitesalong a small French stream (the Meaudret) Thesesix sampling sites were sampled four times (springsummer winter and autumn) hence the 24 rowsThe 13 columns correspond to 13 ephemeroteraspecies The CA of this data table is done with thedudicoa function giving the coa1 duality diagramThe corresponding bewteen-class analysis is donewith the between function considering the sites asclasses (meaudret$plan$sta is a factor defining theclasses) Therefore this is a between-sites analysiswhich aim is to discriminate the sites given the dis-tribution of ephemeroptera species This gives thebet1 duality diagram and Figure 6 shows the graphobtained by plotting this object
d = 1
Canonical weights
d = 1
Eda
Bsp
Brh
Bni
Bpu
Cen
Ecd
Rhi Hla
Hab
Par
Cae
Eig
Canonical weights
Variables
Eda
Bsp
Brh
Bni Bpu
Cen
Ecd
Rhi Hla
Hab
Par
Cae
Eig
Variables
00
000
501
001
5
Eigenvalues
d = 05
Scores and classes
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
S1
S2
S3
S4
S5
Axis1 Axis2
Inertia axes
d = 02
Classes
S1
S2
S3
S4
S5
Figure 6 BCA plot This is a composed plot made of 1- the species canonical weights (top left) 2- the speciesscores (middle left) 3- the eigenvalues bar chart (bottomleft) 4- the plot of plain CA axes projected into BCA(bottom center) 5- the gravity centers of classes (bottomright) 6- the projection of the rows with ellipses and grav-ity center of classes (main graph)
Like between-class analyses linear discriminantanalysis (discrimin function) can be extended to anyduality diagram (G (XtDX)minus Dw) where (XtDX)minus
is a generalized inverse This gives for example acorrespondence discriminant analysis (Perriegravere et al(1996) Perriegravere and Thioulouse (2002)) that can becomputed with the discrimincoa function
Opposite to between-class analyses are within-class analyses corresponding to diagrams (X minusYDminus1
w YtDX Q D) (within functions) These anal-yses extend to any type of variables the Multi-ple Group Principal Component Analysis (MGPCAThorpe (1983a) Thorpe (1983b) Thorpe and Leamy(1983) Furthermore the withincoa function intro-duces the double within-class correspondence anal-ysis (also named internal correspondence analysisCazes et al (1988))
Permutation tests
Permutation tests (also called Monte-Carlo tests orrandomization tests) can be used to assess the sta-tistical significance of between-class analyses Manypermutation tests are available in the ade4 packagefor example mantelrandtest procusterandtestrandtestbetween randtestcoinertia RVrtestrandtestdiscrimin and several of these testsare available both in R (mantelrtest) and in C(mantelrandtest) programming langage The R
R News ISSN 1609-3631
Vol 41 June 2004 9
version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons
The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values
gt test1lt-randtestbetween(bet1)
gt test1
Monte-Carlo test
Observation 04292
Call randtestbetween(xtest = bet1)
Based on 999 replicates
Simulated p-value 0001
gt plot(test1main=Between class inertia)
Between class inertia
sim
Fre
quency
01 02 03 04 05
0100
200
300
Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram
Conclusion
We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)
We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))
The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that
try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space
Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development
Bibliography
Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107
Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8
Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5
Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5
Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9
Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5
Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7
Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7
Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5
Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5
R News ISSN 1609-3631
Vol 41 June 2004 10
Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125
Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5
Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9
Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7
Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9
Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7
Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7
Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399
Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate
method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8
Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8
Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5
Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5
Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9
Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8
Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8
Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8
R News ISSN 1609-3631
Vol 41 June 2004 11
qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca
Introduction
The qcc package for the R statistical environment al-lows to
bull plot Shewhart quality control charts for contin-uous attribute and count data
bull plot Cusum and EWMA charts for continuousdata
bull draw operating characteristic curves
bull perform process capability analyses
bull draw Pareto charts and cause-and-effect dia-grams
I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate
The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy
Creating a qcc object
A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)
Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed
Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess
gt data(pistonrings)
gt attach(pistonrings)
gt dim(pistonrings)
[1] 200 3
gt pistonrings
diameter sample trial
1 74030 1 TRUE
2 74002 1 TRUE
3 74019 1 TRUE
4 73992 1 TRUE
5 74008 1 TRUE
6 73995 2 TRUE
7 73992 2 TRUE
199 74000 40 FALSE
200 74020 40 FALSE
gt diameter lt- qccgroups(diameter sample)
gt dim(diameter)
[1] 40 5
gt diameter
[1] [2] [3] [4] [5]
1 74030 74002 74019 73992 74008
2 73995 73992 74001 74011 74004
40 74010 74005 74029 74000 74020
Using the first 25 samples as training data an X chartcan be obtained as follows
gt obj lt- qcc(diameter[125] type=xbar)
By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)
gt summary(obj)
Call
qcc(data = diameter[125 ] type = xbar)
xbar chart for diameter[125 ]
Summary of group statistics
Min 1st Qu Median Mean 3rd Qu Max
7399 7400 7400 7400 7400 7401
Group sample size 5
Number of groups 25
Center of group statistics 7400118
Standard deviation 0009887547
Control limits
LCL UCL
7398791 7401444
R News ISSN 1609-3631
Vol 41 June 2004 12
type Control charts for variablesxbar X chart Sample means are plotted to
control the mean level of a con-tinuous process variable
xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable
R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable
S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able
type Control charts for attributesp p chart The proportion of nonconform-
ing units is plotted Controllimits are based on the binomialdistribution
np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution
c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion
u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units
Table 1 Shewhart control charts available in the qcc package
R News ISSN 1609-3631
Vol 41 June 2004 13
The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned
The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument
Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve
Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively
xbar Chartfor diameter[125 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9073
995
740
0074
005
740
1074
015
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24
LCL
UCL
Number of groups = 25Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 0Number violating runs = 0
Figure 1 A X chart for samples data
Shewhart control charts
A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows
gt plot(obj)
giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the
sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)
Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample
gt obj lt- qcc(diameter[125] type=xbar
+ newdata=diameter[2640])
plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples
xbar Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9074
000
740
1074
020
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
LCL
UCL
Calibration data in diameter[125 ] New data in diameter[2640 ]
Number of groups = 40Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 3Number violating runs = 1
Figure 2 A X chart with calibration data and newdata samples
If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method
gt plot(obj chartall=FALSE)
Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))
As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points
Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows
gt data(orangejuice)
gt attach(orangejuice)
R News ISSN 1609-3631
Vol 41 June 2004 14
gt orangejuice
sample D size trial
1 1 12 50 TRUE
2 2 15 50 TRUE
3 3 8 50 TRUE
53 53 3 50 FALSE
54 54 5 50 FALSE
gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)
where we use the logical indicator variable trial toselect the first 30 samples as calibration data
Operating characteristic function
An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo
The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo
gt par(mfrow=c(12))
gt occurves(obj)
gt occurves(obj2)
The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot
0 1 2 3 4 5
00
02
04
06
08
10
OC curves for xbar chart
Process shift (stddev)
Pro
b ty
pe II
err
or
n = 5n = 1n = 10n = 15n = 20
00 02 04 06 08 10
00
02
04
06
08
10
OC curves for p chart
p
Pro
b ty
pe II
err
or
Figure 3 Operating characteristics curves
Cusum charts
Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process
The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code
gt cusum(obj)
and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1
Cusum Chartfor diameter[125 ] and diameter[2640 ]
Group
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
minus5
minus3
minus1
13
57
911
1315
17
Cum
ulat
ive
Sum
Abo
ve T
arge
tB
elow
Tar
get
LDB
UDB
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4
Figure 4 Cusum chart
EWMA charts
Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process
The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows
gt ewma(obj)
and the resulting graph is shown in Figure 5
R News ISSN 1609-3631
Vol 41 June 2004 15
EWMA Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up S
umm
ary
Sta
tistic
s
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
739
9073
995
740
0074
005
740
1074
015
740
20
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Smoothing parameter = 02Control limits at 3sigma
Figure 5 EWMA chart
Process capability analysis
Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use
gt processcapability(obj
+ speclimits=c(73957405))
Process Capability Analysis
Call
processcapability(object = obj
speclimits = c(7395 7405))
Number of obs = 125 Target = 74
Center = 7400118 LSL = 7395
StdDev = 0009887547 USL = 7405
Capability indices
Value 25 975
Cp 1686 1476 1895
Cp_l 1725 1539 1912
Cp_u 1646 1467 1825
Cp_k 1646 1433 1859
Cpm 1674 1465 1882
ExpltLSL 0 ObsltLSL 0
ExpgtUSL 0 ObsgtUSL 0
The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if
available otherwise the target is set at the middlevalue between the specification limits
The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument
The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)
Process Capability Analysisfor diameter[125 ]
7394 7396 7398 7400 7402 7404 7406
LSL USLTarget
Number of obs = 125Center = 7400118StdDev = 0009887547
Target = 74LSL = 7395USL = 7405
Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167
ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0
Figure 6 Process capability analysis
A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice
bull a X chart
bull a R or S chart (if sample sizes gt 10)
bull a run chart
bull a histogram with specification limits
bull a normal QndashQ plot
bull a capability plot
As an example we may input the following code
gt processcapabilitysixpack(obj
+ speclimits=c(73957405)
+ target= 7401)
Pareto charts and cause-and-effectdiagrams
A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system
R News ISSN 1609-3631
Vol 41 June 2004 16
and thus screen out the less significant factors in ananalysis
A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example
gt defect lt- c(80 27 66 94 33)
gt names(defect) lt-
+ c(price code schedule date
+ supplier code contact num
+ part num)
gt paretochart(defect)
The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))
contact num
price code
supplier code
part num
schedule date
Pareto
Ch
art for d
efect
Frequency
0 50 100 150 200 250 300
0 25 50 75 100
Cumulative Percentage
Figure 7 Pareto chart
A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it
A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8
gt causeandeffect(
+ cause=list(Measurements=c(Microscopes
+ Inspectors)
+ Materials=c(Alloys
+ Suppliers)
+ Personnel=c(Supervisors
+ Operators)
+ Environment=c(Condensation
+ Moisture)
+ Methods=c(Brake Engager
+ Angle)
+ Machines=c(Speed Bits
+ Sockets))
+ effect=Surface Flaws)
CauseminusandminusEffect diagram
Surface Flaws
Measurements
Environment
Materials
Methods
Personnel
Machines
Microscopes
Inspectors
Alloys
Suppliers
Supervisors
Operators
Condensation
Moisture
Brake
Engager
Angle
Speed
Bits
Sockets
Figure 8 Cause-and-effect diagram
Tuning and extending the package
Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)
Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be
R News ISSN 1609-3631
Vol 41 July 8 2004 17
called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart
statspstd lt- function(data sizes)
data lt- asvector(data)
sizes lt- asvector(sizes)
pbar lt- sum(data)sum(sizes)
z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)
list(statistics = z center = 0)
sdpstd lt- function(data sizes) return(1)
limitspstd lt- function(center stddev sizes conf)
if (conf gt= 1) lcl lt- -conf
ucl lt- +conf
else
if (conf gt 0 amp conf lt 1)
nsigmas lt- qnorm(1 - (1 - conf)2)
lcl lt- -nsigmas
ucl lt- +nsigmas
else stop(invalid conf argument)
limits lt- matrix(c(lcl ucl) ncol = 2)
rownames(limits) lt- rep( length = nrow(limits))
colnames(limits) lt- c(LCL UCL)
return(limits)
Then we may source the above code and obtainthe control charts in Figure 9 as follows
set unequal sample sizes
gt n lt- c(rep(505) rep(1005) rep(25 5))
generate randomly the number of successes
gt x lt- rbinom(length(n) n 02)
gt par(mfrow=c(12))
plot the control chart with variable limits
gt qcc(x type=p size=n)
plot the standardized control chart
gt qcc(x type=pstd size=n)
p Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
00
01
02
03
04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 01931429StdDev = 03947641
LCL is variableUCL is variable
Number beyond limits = 0Number violating runs = 0
pstd Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
minus3
minus2
minus1
01
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 0StdDev = 1
LCL = minus3UCL = 3
Number beyond limits = 0Number violating runs = 0
Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)
Summary
In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram
References
Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall
Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit
Least Squares Calculations in RTiming different approaches
by Douglas Bates
Introduction
The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -
how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods
Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses
In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-
R News ISSN 1609-3631
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 5
for their continued drive to create and support thiswonderful vehicle for international collaboration
Marc SchwartzMedAnalytics Inc Minneapolis Minnesota USAMSchwartzMedAnalyticscom
The ade4 package - I One-table methodsby Daniel Chessel Anne B Dufour and Jean Thioulouse
Introduction
This paper is a short summary of the main classesdefined in the ade4 package for one table analysismethods (eg principal component analysis) Otherpapers will detail the classes defined in ade4 for two-tables coupling methods (such as canonical corre-spondence analysis redundancy analysis and co-inertia analysis) for methods dealing with K-tablesanalysis (ie three-ways tables) and for graphicalmethods
This package is a complete rewrite of the ADE4software (Thioulouse et al (1997) httppbiluniv-lyon1frADE-4) for the R environment It containsData Analysis functions to analyse Ecological andEnvironmental data in the framework of EuclideanExploratory methods hence the name ade4 (ie 4 isnot a version number but means that there are four Ein the acronym)
The ade4 package is available in CRAN but itcan also be used directly online thanks to the Rwebsystem (httppbiluniv-lyon1frRweb) Thispossibility is being used to provide multivariateanalysis services in the field of bioinformatics par-ticularly for sequence and genome structure analysisat the PBIL (httppbiluniv-lyon1fr) An ex-ample of these services is the automated analysis ofthe codon usage of a set of DNA sequences by corre-spondence analysis (httppbiluniv-lyon1frmvacoaphp)
The duality diagram class
The basic tool in ade4 is the duality diagram Es-coufier (1987) A duality diagram is simply a list thatcontains a triplet (X Q D)
- X is a table with n rows and p columns consid-ered as p points in Rn (column vectors) or n points inRp (row vectors)
- Q is a p times p diagonal matrix containing theweights of the p columns of X and used as a scalarproduct in Rp (Q is stored under the form of a vectorof length p)
- D is a n times n diagonal matrix containing theweights of the n rows of X and used as a scalar prod-uct in Rn (D is stored under the form of a vector oflength n)
For example if X is a table containing normalizedquantitative variables if Q is the identity matrix Ip
and if D is equal to 1n In the triplet corresponds to a
principal component analysis on correlation matrix(normed PCA) Each basic method corresponds toa particular triplet (see table 1) but more complexmethods can also be represented by their duality di-agram
Functions Analyses Notesdudipca principal component 1dudicoa correspondence 2dudiacm multiple correspondence 3dudifca fuzzy correspondence 4dudimix analysis of a mixture of
numeric and factors5
dudinsc non symmetric corre-spondence
6
dudidec decentered correspon-dence
7
The dudi functions 1 Principal component analysissame as prcompprincomp 2 Correspondence analysisGreenacre (1984) 3 Multiple correspondence analysisTenenhaus and Young (1985) 4 Fuzzy correspondenceanalysis Chevenet et al (1994) 5 Analysis of a mixtureof numeric variables and factors Hill and Smith (1976)Kiers (1994) 6 Non symmetric correspondence analysisKroonenberg and Lombardo (1999) 7 Decentered corre-spondence analysis Doleacutedec et al (1995)
The singular value decomposition of a tripletgives principal axes principal components and rowand column coordinates which are added to thetriplet for later use
We can use for example a well-known datasetfrom the base package
gt data(USArrests)
gt pca1 lt- dudipca(USArrestsscannf=FALSEnf=3)
scannf = FALSE means that the number of prin-cipal components that will be used to compute rowand column coordinates should not be asked interac-tively to the user but taken as the value of argumentnf (by default nf = 2) Other parameters allow tochoose between centered normed or raw PCA (de-fault is centered and normed) and to set arbitraryrow and column weights The pca1 object is a du-ality diagram ie a list made of several vectors anddataframes
R News ISSN 1609-3631
Vol 41 June 2004 6
gt pca1
Duality diagramm
class pca dudi
$call dudipca(df = USArrests scannf = FALSE nf=3)
$nf 3 axis-components saved
$rank 4
eigen values 248 09898 03566 01734
vector length mode content
1 $cw 4 numeric column weights
2 $lw 50 numeric row weights
3 $eig 4 numeric eigen values
dataframe nrow ncol content
1 $tab 50 4 modified array
2 $li 50 3 row coordinates
3 $l1 50 3 row normed scores
4 $co 4 3 column coordinates
5 $c1 4 3 column normed scores
other elements cent norm
pca1$lw and pca1$cw are the row and columnweights that define the duality diagram togetherwith the data table (pca1$tab) pca1$eig containsthe eigenvalues The row and column coordinatesare stored in pca1$li and pca1$co The varianceof these coordinates is equal to the correspondingeigenvalue and unit variance coordinates are storedin pca1$l1 and pca1$c1 (this is usefull to draw bi-plots)
The general optimization theorems of data anal-ysis take particular meanings for each type of anal-ysis and graphical functions are proposed to drawthe canonical graphs ie the graphical expression cor-responding to the mathematical property of the ob-ject For example the normed PCA of a quantita-tive variable table gives a score that maximizes thesum of squared correlations with variables The PCAcanonical graph is therefore a graph showing howthe sum of squared correlations is maximized for thevariables of the data set On the USArrests examplewe obtain the following graphs
gt score(pca1)
gt scorcircle(pca1$co)
minus2 minus1 0 1 2
510
15
Murder
minus2 minus1 0 1 2
5010
015
020
025
030
0
Assault
minus2 minus1 0 1 2
3040
5060
7080
90 UrbanPop
minus2 minus1 0 1 2
1020
3040
Rape
Figure 1 One dimensional canonical graph for a normedPCA Variables are displayed as a function of row scoresto get a picture of the maximization of the sum of squaredcorrelations
Murder
Assault
UrbanPop
Rape
Figure 2 Two dimensional canonical graph for a normedPCA (correlation circle) the direction and length of ar-rows show the quality of the correlation between variablesand between variables and principal components
The scatter function draws the biplot of the PCA(ie a graph with both rows and columns superim-posed)
gt scatter(pca1)
R News ISSN 1609-3631
Vol 41 June 2004 7
d = 1
Alabama Alaska
Arizona
Arkansas
California
Colorado Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa Kansas
Kentucky Louisiana
Maine Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon Pennsylvania
Rhode Island
South Carolina
South Dakota Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Murder
Assault
UrbanPop
Rape
Figure 3 The PCA biplot Variables are symbolized byarrows and they are superimposed to the individuals dis-play The scale of the graph is given by a grid which sizeis given in the upper right corner Here the length of theside of grid squares is equal to one The eigenvalues barchart is drawn in the upper left corner with the two blackbars corresponding to the two axes used to draw the bi-plot Grey bars correspond to axes that were kept in theanalysis but not used to draw the graph
Separate factor maps can be drawn with thescorcircle (see figure 2) and slabel functions
gt slabel(pca1$li)
d = 1
Alabama Alaska
Arizona
Arkansas
California
Colorado Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa Kansas
Kentucky Louisiana
Maine Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma Oregon Pennsylvania
Rhode Island
South Carolina
South Dakota Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Figure 4 Individuals factor map in a normed PCA
Distance matrices
A duality diagram can also come from a distancematrix if this matrix is Euclidean (ie if the dis-tances in the matrix are the distances between somepoints in a Euclidean space) The ade4 packagecontains functions to compute dissimilarity matrices(distbinary for binary data and distprop for fre-quency data) test whether they are Euclidean Gowerand Legendre (1986) and make them Euclidean(quasieuclid lingoes Lingoes (1971) cailliezCailliez (1983)) These functions are useful to ecol-ogists who use the works of Legendre and Anderson(1999) and Legendre and Legendre (1998)
The Yanomama data set (Manly (1991)) con-tains three distance matrices between 19 villages ofYanomama Indians The dudipco function can beused to compute a principal coordinates analysis(PCO Gower (1966)) that gives a Euclidean repre-sentation of the 19 villages This Euclidean represen-tation allows to compare the geographical geneticand anthropometric distances
gt data(yanomama)
gt gen lt- quasieuclid(asdist(yanomama$gen))
gt geo lt- quasieuclid(asdist(yanomama$geo))
gt ant lt- quasieuclid(asdist(yanomama$ant))
gt geo1 lt- dudipco(geo scann = FALSE nf = 3)
gt gen1 lt- dudipco(gen scann = FALSE nf = 3)
gt ant1 lt- dudipco(ant scann = FALSE nf = 3)
gt par(mfrow=c(22))
gt scatter(geo1)
gt scatter(gen1)
gt scatter(ant1posi=bottom)
d = 100
1 2
3
4 5 6 7
8 9
10
11 12 13
14
15
16 17 18
19
d = 20
1
2
3
4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
19
d = 100
1
2 3
4
5
6
7
8
9 10
11
12 13
14 15
16
17 18
19
Figure 5 Comparison of the PCO analysis of the threedistance matrices of the Yanomama data set geographicgenetic and anthropometric distances
R News ISSN 1609-3631
Vol 41 June 2004 8
Taking into account groups of indi-viduals
In sites x species tables rows correspond to sitescolumns correspond to species and the values arethe number of individuals of species j found at sitei These tables can have many columns and can-not be used in a discriminant analysis In this casebetween-class analyses (between function) are a bet-ter alternative and they can be used with any dualitydiagram The between-class analysis of triplet (X QD) for a given factor f is the analysis of the triplet (GQ Dw) where G is the table of the means of table Xfor the groups defined by f and Dw is the diagonalmatrix of group weights For example a between-class correspondence analysis (BCA) is very simplyobtained after a correspondence analysis (CA)
gt data(meaudret)
gt coa1lt-dudicoa(meaudret$fau scannf = FALSE)
gt bet1lt-between(coa1meaudret$plan$sta
+ scannf=FALSE)
gt plot(bet1)
The meaudret$fau dataframe is an ecological ta-ble with 24 rows corresponding to six sampling sitesalong a small French stream (the Meaudret) Thesesix sampling sites were sampled four times (springsummer winter and autumn) hence the 24 rowsThe 13 columns correspond to 13 ephemeroteraspecies The CA of this data table is done with thedudicoa function giving the coa1 duality diagramThe corresponding bewteen-class analysis is donewith the between function considering the sites asclasses (meaudret$plan$sta is a factor defining theclasses) Therefore this is a between-sites analysiswhich aim is to discriminate the sites given the dis-tribution of ephemeroptera species This gives thebet1 duality diagram and Figure 6 shows the graphobtained by plotting this object
d = 1
Canonical weights
d = 1
Eda
Bsp
Brh
Bni
Bpu
Cen
Ecd
Rhi Hla
Hab
Par
Cae
Eig
Canonical weights
Variables
Eda
Bsp
Brh
Bni Bpu
Cen
Ecd
Rhi Hla
Hab
Par
Cae
Eig
Variables
00
000
501
001
5
Eigenvalues
d = 05
Scores and classes
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
S1
S2
S3
S4
S5
Axis1 Axis2
Inertia axes
d = 02
Classes
S1
S2
S3
S4
S5
Figure 6 BCA plot This is a composed plot made of 1- the species canonical weights (top left) 2- the speciesscores (middle left) 3- the eigenvalues bar chart (bottomleft) 4- the plot of plain CA axes projected into BCA(bottom center) 5- the gravity centers of classes (bottomright) 6- the projection of the rows with ellipses and grav-ity center of classes (main graph)
Like between-class analyses linear discriminantanalysis (discrimin function) can be extended to anyduality diagram (G (XtDX)minus Dw) where (XtDX)minus
is a generalized inverse This gives for example acorrespondence discriminant analysis (Perriegravere et al(1996) Perriegravere and Thioulouse (2002)) that can becomputed with the discrimincoa function
Opposite to between-class analyses are within-class analyses corresponding to diagrams (X minusYDminus1
w YtDX Q D) (within functions) These anal-yses extend to any type of variables the Multi-ple Group Principal Component Analysis (MGPCAThorpe (1983a) Thorpe (1983b) Thorpe and Leamy(1983) Furthermore the withincoa function intro-duces the double within-class correspondence anal-ysis (also named internal correspondence analysisCazes et al (1988))
Permutation tests
Permutation tests (also called Monte-Carlo tests orrandomization tests) can be used to assess the sta-tistical significance of between-class analyses Manypermutation tests are available in the ade4 packagefor example mantelrandtest procusterandtestrandtestbetween randtestcoinertia RVrtestrandtestdiscrimin and several of these testsare available both in R (mantelrtest) and in C(mantelrandtest) programming langage The R
R News ISSN 1609-3631
Vol 41 June 2004 9
version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons
The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values
gt test1lt-randtestbetween(bet1)
gt test1
Monte-Carlo test
Observation 04292
Call randtestbetween(xtest = bet1)
Based on 999 replicates
Simulated p-value 0001
gt plot(test1main=Between class inertia)
Between class inertia
sim
Fre
quency
01 02 03 04 05
0100
200
300
Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram
Conclusion
We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)
We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))
The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that
try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space
Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development
Bibliography
Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107
Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8
Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5
Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5
Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9
Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5
Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7
Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7
Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5
Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5
R News ISSN 1609-3631
Vol 41 June 2004 10
Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125
Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5
Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9
Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7
Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9
Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7
Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7
Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399
Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate
method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8
Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8
Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5
Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5
Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9
Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8
Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8
Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8
R News ISSN 1609-3631
Vol 41 June 2004 11
qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca
Introduction
The qcc package for the R statistical environment al-lows to
bull plot Shewhart quality control charts for contin-uous attribute and count data
bull plot Cusum and EWMA charts for continuousdata
bull draw operating characteristic curves
bull perform process capability analyses
bull draw Pareto charts and cause-and-effect dia-grams
I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate
The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy
Creating a qcc object
A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)
Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed
Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess
gt data(pistonrings)
gt attach(pistonrings)
gt dim(pistonrings)
[1] 200 3
gt pistonrings
diameter sample trial
1 74030 1 TRUE
2 74002 1 TRUE
3 74019 1 TRUE
4 73992 1 TRUE
5 74008 1 TRUE
6 73995 2 TRUE
7 73992 2 TRUE
199 74000 40 FALSE
200 74020 40 FALSE
gt diameter lt- qccgroups(diameter sample)
gt dim(diameter)
[1] 40 5
gt diameter
[1] [2] [3] [4] [5]
1 74030 74002 74019 73992 74008
2 73995 73992 74001 74011 74004
40 74010 74005 74029 74000 74020
Using the first 25 samples as training data an X chartcan be obtained as follows
gt obj lt- qcc(diameter[125] type=xbar)
By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)
gt summary(obj)
Call
qcc(data = diameter[125 ] type = xbar)
xbar chart for diameter[125 ]
Summary of group statistics
Min 1st Qu Median Mean 3rd Qu Max
7399 7400 7400 7400 7400 7401
Group sample size 5
Number of groups 25
Center of group statistics 7400118
Standard deviation 0009887547
Control limits
LCL UCL
7398791 7401444
R News ISSN 1609-3631
Vol 41 June 2004 12
type Control charts for variablesxbar X chart Sample means are plotted to
control the mean level of a con-tinuous process variable
xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable
R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable
S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able
type Control charts for attributesp p chart The proportion of nonconform-
ing units is plotted Controllimits are based on the binomialdistribution
np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution
c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion
u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units
Table 1 Shewhart control charts available in the qcc package
R News ISSN 1609-3631
Vol 41 June 2004 13
The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned
The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument
Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve
Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively
xbar Chartfor diameter[125 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9073
995
740
0074
005
740
1074
015
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24
LCL
UCL
Number of groups = 25Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 0Number violating runs = 0
Figure 1 A X chart for samples data
Shewhart control charts
A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows
gt plot(obj)
giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the
sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)
Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample
gt obj lt- qcc(diameter[125] type=xbar
+ newdata=diameter[2640])
plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples
xbar Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9074
000
740
1074
020
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
LCL
UCL
Calibration data in diameter[125 ] New data in diameter[2640 ]
Number of groups = 40Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 3Number violating runs = 1
Figure 2 A X chart with calibration data and newdata samples
If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method
gt plot(obj chartall=FALSE)
Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))
As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points
Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows
gt data(orangejuice)
gt attach(orangejuice)
R News ISSN 1609-3631
Vol 41 June 2004 14
gt orangejuice
sample D size trial
1 1 12 50 TRUE
2 2 15 50 TRUE
3 3 8 50 TRUE
53 53 3 50 FALSE
54 54 5 50 FALSE
gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)
where we use the logical indicator variable trial toselect the first 30 samples as calibration data
Operating characteristic function
An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo
The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo
gt par(mfrow=c(12))
gt occurves(obj)
gt occurves(obj2)
The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot
0 1 2 3 4 5
00
02
04
06
08
10
OC curves for xbar chart
Process shift (stddev)
Pro
b ty
pe II
err
or
n = 5n = 1n = 10n = 15n = 20
00 02 04 06 08 10
00
02
04
06
08
10
OC curves for p chart
p
Pro
b ty
pe II
err
or
Figure 3 Operating characteristics curves
Cusum charts
Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process
The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code
gt cusum(obj)
and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1
Cusum Chartfor diameter[125 ] and diameter[2640 ]
Group
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
minus5
minus3
minus1
13
57
911
1315
17
Cum
ulat
ive
Sum
Abo
ve T
arge
tB
elow
Tar
get
LDB
UDB
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4
Figure 4 Cusum chart
EWMA charts
Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process
The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows
gt ewma(obj)
and the resulting graph is shown in Figure 5
R News ISSN 1609-3631
Vol 41 June 2004 15
EWMA Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up S
umm
ary
Sta
tistic
s
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
739
9073
995
740
0074
005
740
1074
015
740
20
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Smoothing parameter = 02Control limits at 3sigma
Figure 5 EWMA chart
Process capability analysis
Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use
gt processcapability(obj
+ speclimits=c(73957405))
Process Capability Analysis
Call
processcapability(object = obj
speclimits = c(7395 7405))
Number of obs = 125 Target = 74
Center = 7400118 LSL = 7395
StdDev = 0009887547 USL = 7405
Capability indices
Value 25 975
Cp 1686 1476 1895
Cp_l 1725 1539 1912
Cp_u 1646 1467 1825
Cp_k 1646 1433 1859
Cpm 1674 1465 1882
ExpltLSL 0 ObsltLSL 0
ExpgtUSL 0 ObsgtUSL 0
The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if
available otherwise the target is set at the middlevalue between the specification limits
The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument
The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)
Process Capability Analysisfor diameter[125 ]
7394 7396 7398 7400 7402 7404 7406
LSL USLTarget
Number of obs = 125Center = 7400118StdDev = 0009887547
Target = 74LSL = 7395USL = 7405
Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167
ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0
Figure 6 Process capability analysis
A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice
bull a X chart
bull a R or S chart (if sample sizes gt 10)
bull a run chart
bull a histogram with specification limits
bull a normal QndashQ plot
bull a capability plot
As an example we may input the following code
gt processcapabilitysixpack(obj
+ speclimits=c(73957405)
+ target= 7401)
Pareto charts and cause-and-effectdiagrams
A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system
R News ISSN 1609-3631
Vol 41 June 2004 16
and thus screen out the less significant factors in ananalysis
A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example
gt defect lt- c(80 27 66 94 33)
gt names(defect) lt-
+ c(price code schedule date
+ supplier code contact num
+ part num)
gt paretochart(defect)
The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))
contact num
price code
supplier code
part num
schedule date
Pareto
Ch
art for d
efect
Frequency
0 50 100 150 200 250 300
0 25 50 75 100
Cumulative Percentage
Figure 7 Pareto chart
A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it
A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8
gt causeandeffect(
+ cause=list(Measurements=c(Microscopes
+ Inspectors)
+ Materials=c(Alloys
+ Suppliers)
+ Personnel=c(Supervisors
+ Operators)
+ Environment=c(Condensation
+ Moisture)
+ Methods=c(Brake Engager
+ Angle)
+ Machines=c(Speed Bits
+ Sockets))
+ effect=Surface Flaws)
CauseminusandminusEffect diagram
Surface Flaws
Measurements
Environment
Materials
Methods
Personnel
Machines
Microscopes
Inspectors
Alloys
Suppliers
Supervisors
Operators
Condensation
Moisture
Brake
Engager
Angle
Speed
Bits
Sockets
Figure 8 Cause-and-effect diagram
Tuning and extending the package
Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)
Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be
R News ISSN 1609-3631
Vol 41 July 8 2004 17
called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart
statspstd lt- function(data sizes)
data lt- asvector(data)
sizes lt- asvector(sizes)
pbar lt- sum(data)sum(sizes)
z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)
list(statistics = z center = 0)
sdpstd lt- function(data sizes) return(1)
limitspstd lt- function(center stddev sizes conf)
if (conf gt= 1) lcl lt- -conf
ucl lt- +conf
else
if (conf gt 0 amp conf lt 1)
nsigmas lt- qnorm(1 - (1 - conf)2)
lcl lt- -nsigmas
ucl lt- +nsigmas
else stop(invalid conf argument)
limits lt- matrix(c(lcl ucl) ncol = 2)
rownames(limits) lt- rep( length = nrow(limits))
colnames(limits) lt- c(LCL UCL)
return(limits)
Then we may source the above code and obtainthe control charts in Figure 9 as follows
set unequal sample sizes
gt n lt- c(rep(505) rep(1005) rep(25 5))
generate randomly the number of successes
gt x lt- rbinom(length(n) n 02)
gt par(mfrow=c(12))
plot the control chart with variable limits
gt qcc(x type=p size=n)
plot the standardized control chart
gt qcc(x type=pstd size=n)
p Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
00
01
02
03
04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 01931429StdDev = 03947641
LCL is variableUCL is variable
Number beyond limits = 0Number violating runs = 0
pstd Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
minus3
minus2
minus1
01
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 0StdDev = 1
LCL = minus3UCL = 3
Number beyond limits = 0Number violating runs = 0
Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)
Summary
In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram
References
Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall
Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit
Least Squares Calculations in RTiming different approaches
by Douglas Bates
Introduction
The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -
how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods
Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses
In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-
R News ISSN 1609-3631
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 6
gt pca1
Duality diagramm
class pca dudi
$call dudipca(df = USArrests scannf = FALSE nf=3)
$nf 3 axis-components saved
$rank 4
eigen values 248 09898 03566 01734
vector length mode content
1 $cw 4 numeric column weights
2 $lw 50 numeric row weights
3 $eig 4 numeric eigen values
dataframe nrow ncol content
1 $tab 50 4 modified array
2 $li 50 3 row coordinates
3 $l1 50 3 row normed scores
4 $co 4 3 column coordinates
5 $c1 4 3 column normed scores
other elements cent norm
pca1$lw and pca1$cw are the row and columnweights that define the duality diagram togetherwith the data table (pca1$tab) pca1$eig containsthe eigenvalues The row and column coordinatesare stored in pca1$li and pca1$co The varianceof these coordinates is equal to the correspondingeigenvalue and unit variance coordinates are storedin pca1$l1 and pca1$c1 (this is usefull to draw bi-plots)
The general optimization theorems of data anal-ysis take particular meanings for each type of anal-ysis and graphical functions are proposed to drawthe canonical graphs ie the graphical expression cor-responding to the mathematical property of the ob-ject For example the normed PCA of a quantita-tive variable table gives a score that maximizes thesum of squared correlations with variables The PCAcanonical graph is therefore a graph showing howthe sum of squared correlations is maximized for thevariables of the data set On the USArrests examplewe obtain the following graphs
gt score(pca1)
gt scorcircle(pca1$co)
minus2 minus1 0 1 2
510
15
Murder
minus2 minus1 0 1 2
5010
015
020
025
030
0
Assault
minus2 minus1 0 1 2
3040
5060
7080
90 UrbanPop
minus2 minus1 0 1 2
1020
3040
Rape
Figure 1 One dimensional canonical graph for a normedPCA Variables are displayed as a function of row scoresto get a picture of the maximization of the sum of squaredcorrelations
Murder
Assault
UrbanPop
Rape
Figure 2 Two dimensional canonical graph for a normedPCA (correlation circle) the direction and length of ar-rows show the quality of the correlation between variablesand between variables and principal components
The scatter function draws the biplot of the PCA(ie a graph with both rows and columns superim-posed)
gt scatter(pca1)
R News ISSN 1609-3631
Vol 41 June 2004 7
d = 1
Alabama Alaska
Arizona
Arkansas
California
Colorado Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa Kansas
Kentucky Louisiana
Maine Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon Pennsylvania
Rhode Island
South Carolina
South Dakota Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Murder
Assault
UrbanPop
Rape
Figure 3 The PCA biplot Variables are symbolized byarrows and they are superimposed to the individuals dis-play The scale of the graph is given by a grid which sizeis given in the upper right corner Here the length of theside of grid squares is equal to one The eigenvalues barchart is drawn in the upper left corner with the two blackbars corresponding to the two axes used to draw the bi-plot Grey bars correspond to axes that were kept in theanalysis but not used to draw the graph
Separate factor maps can be drawn with thescorcircle (see figure 2) and slabel functions
gt slabel(pca1$li)
d = 1
Alabama Alaska
Arizona
Arkansas
California
Colorado Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa Kansas
Kentucky Louisiana
Maine Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma Oregon Pennsylvania
Rhode Island
South Carolina
South Dakota Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Figure 4 Individuals factor map in a normed PCA
Distance matrices
A duality diagram can also come from a distancematrix if this matrix is Euclidean (ie if the dis-tances in the matrix are the distances between somepoints in a Euclidean space) The ade4 packagecontains functions to compute dissimilarity matrices(distbinary for binary data and distprop for fre-quency data) test whether they are Euclidean Gowerand Legendre (1986) and make them Euclidean(quasieuclid lingoes Lingoes (1971) cailliezCailliez (1983)) These functions are useful to ecol-ogists who use the works of Legendre and Anderson(1999) and Legendre and Legendre (1998)
The Yanomama data set (Manly (1991)) con-tains three distance matrices between 19 villages ofYanomama Indians The dudipco function can beused to compute a principal coordinates analysis(PCO Gower (1966)) that gives a Euclidean repre-sentation of the 19 villages This Euclidean represen-tation allows to compare the geographical geneticand anthropometric distances
gt data(yanomama)
gt gen lt- quasieuclid(asdist(yanomama$gen))
gt geo lt- quasieuclid(asdist(yanomama$geo))
gt ant lt- quasieuclid(asdist(yanomama$ant))
gt geo1 lt- dudipco(geo scann = FALSE nf = 3)
gt gen1 lt- dudipco(gen scann = FALSE nf = 3)
gt ant1 lt- dudipco(ant scann = FALSE nf = 3)
gt par(mfrow=c(22))
gt scatter(geo1)
gt scatter(gen1)
gt scatter(ant1posi=bottom)
d = 100
1 2
3
4 5 6 7
8 9
10
11 12 13
14
15
16 17 18
19
d = 20
1
2
3
4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
19
d = 100
1
2 3
4
5
6
7
8
9 10
11
12 13
14 15
16
17 18
19
Figure 5 Comparison of the PCO analysis of the threedistance matrices of the Yanomama data set geographicgenetic and anthropometric distances
R News ISSN 1609-3631
Vol 41 June 2004 8
Taking into account groups of indi-viduals
In sites x species tables rows correspond to sitescolumns correspond to species and the values arethe number of individuals of species j found at sitei These tables can have many columns and can-not be used in a discriminant analysis In this casebetween-class analyses (between function) are a bet-ter alternative and they can be used with any dualitydiagram The between-class analysis of triplet (X QD) for a given factor f is the analysis of the triplet (GQ Dw) where G is the table of the means of table Xfor the groups defined by f and Dw is the diagonalmatrix of group weights For example a between-class correspondence analysis (BCA) is very simplyobtained after a correspondence analysis (CA)
gt data(meaudret)
gt coa1lt-dudicoa(meaudret$fau scannf = FALSE)
gt bet1lt-between(coa1meaudret$plan$sta
+ scannf=FALSE)
gt plot(bet1)
The meaudret$fau dataframe is an ecological ta-ble with 24 rows corresponding to six sampling sitesalong a small French stream (the Meaudret) Thesesix sampling sites were sampled four times (springsummer winter and autumn) hence the 24 rowsThe 13 columns correspond to 13 ephemeroteraspecies The CA of this data table is done with thedudicoa function giving the coa1 duality diagramThe corresponding bewteen-class analysis is donewith the between function considering the sites asclasses (meaudret$plan$sta is a factor defining theclasses) Therefore this is a between-sites analysiswhich aim is to discriminate the sites given the dis-tribution of ephemeroptera species This gives thebet1 duality diagram and Figure 6 shows the graphobtained by plotting this object
d = 1
Canonical weights
d = 1
Eda
Bsp
Brh
Bni
Bpu
Cen
Ecd
Rhi Hla
Hab
Par
Cae
Eig
Canonical weights
Variables
Eda
Bsp
Brh
Bni Bpu
Cen
Ecd
Rhi Hla
Hab
Par
Cae
Eig
Variables
00
000
501
001
5
Eigenvalues
d = 05
Scores and classes
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
S1
S2
S3
S4
S5
Axis1 Axis2
Inertia axes
d = 02
Classes
S1
S2
S3
S4
S5
Figure 6 BCA plot This is a composed plot made of 1- the species canonical weights (top left) 2- the speciesscores (middle left) 3- the eigenvalues bar chart (bottomleft) 4- the plot of plain CA axes projected into BCA(bottom center) 5- the gravity centers of classes (bottomright) 6- the projection of the rows with ellipses and grav-ity center of classes (main graph)
Like between-class analyses linear discriminantanalysis (discrimin function) can be extended to anyduality diagram (G (XtDX)minus Dw) where (XtDX)minus
is a generalized inverse This gives for example acorrespondence discriminant analysis (Perriegravere et al(1996) Perriegravere and Thioulouse (2002)) that can becomputed with the discrimincoa function
Opposite to between-class analyses are within-class analyses corresponding to diagrams (X minusYDminus1
w YtDX Q D) (within functions) These anal-yses extend to any type of variables the Multi-ple Group Principal Component Analysis (MGPCAThorpe (1983a) Thorpe (1983b) Thorpe and Leamy(1983) Furthermore the withincoa function intro-duces the double within-class correspondence anal-ysis (also named internal correspondence analysisCazes et al (1988))
Permutation tests
Permutation tests (also called Monte-Carlo tests orrandomization tests) can be used to assess the sta-tistical significance of between-class analyses Manypermutation tests are available in the ade4 packagefor example mantelrandtest procusterandtestrandtestbetween randtestcoinertia RVrtestrandtestdiscrimin and several of these testsare available both in R (mantelrtest) and in C(mantelrandtest) programming langage The R
R News ISSN 1609-3631
Vol 41 June 2004 9
version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons
The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values
gt test1lt-randtestbetween(bet1)
gt test1
Monte-Carlo test
Observation 04292
Call randtestbetween(xtest = bet1)
Based on 999 replicates
Simulated p-value 0001
gt plot(test1main=Between class inertia)
Between class inertia
sim
Fre
quency
01 02 03 04 05
0100
200
300
Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram
Conclusion
We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)
We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))
The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that
try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space
Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development
Bibliography
Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107
Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8
Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5
Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5
Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9
Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5
Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7
Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7
Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5
Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5
R News ISSN 1609-3631
Vol 41 June 2004 10
Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125
Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5
Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9
Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7
Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9
Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7
Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7
Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399
Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate
method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8
Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8
Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5
Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5
Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9
Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8
Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8
Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8
R News ISSN 1609-3631
Vol 41 June 2004 11
qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca
Introduction
The qcc package for the R statistical environment al-lows to
bull plot Shewhart quality control charts for contin-uous attribute and count data
bull plot Cusum and EWMA charts for continuousdata
bull draw operating characteristic curves
bull perform process capability analyses
bull draw Pareto charts and cause-and-effect dia-grams
I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate
The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy
Creating a qcc object
A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)
Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed
Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess
gt data(pistonrings)
gt attach(pistonrings)
gt dim(pistonrings)
[1] 200 3
gt pistonrings
diameter sample trial
1 74030 1 TRUE
2 74002 1 TRUE
3 74019 1 TRUE
4 73992 1 TRUE
5 74008 1 TRUE
6 73995 2 TRUE
7 73992 2 TRUE
199 74000 40 FALSE
200 74020 40 FALSE
gt diameter lt- qccgroups(diameter sample)
gt dim(diameter)
[1] 40 5
gt diameter
[1] [2] [3] [4] [5]
1 74030 74002 74019 73992 74008
2 73995 73992 74001 74011 74004
40 74010 74005 74029 74000 74020
Using the first 25 samples as training data an X chartcan be obtained as follows
gt obj lt- qcc(diameter[125] type=xbar)
By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)
gt summary(obj)
Call
qcc(data = diameter[125 ] type = xbar)
xbar chart for diameter[125 ]
Summary of group statistics
Min 1st Qu Median Mean 3rd Qu Max
7399 7400 7400 7400 7400 7401
Group sample size 5
Number of groups 25
Center of group statistics 7400118
Standard deviation 0009887547
Control limits
LCL UCL
7398791 7401444
R News ISSN 1609-3631
Vol 41 June 2004 12
type Control charts for variablesxbar X chart Sample means are plotted to
control the mean level of a con-tinuous process variable
xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable
R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable
S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able
type Control charts for attributesp p chart The proportion of nonconform-
ing units is plotted Controllimits are based on the binomialdistribution
np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution
c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion
u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units
Table 1 Shewhart control charts available in the qcc package
R News ISSN 1609-3631
Vol 41 June 2004 13
The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned
The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument
Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve
Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively
xbar Chartfor diameter[125 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9073
995
740
0074
005
740
1074
015
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24
LCL
UCL
Number of groups = 25Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 0Number violating runs = 0
Figure 1 A X chart for samples data
Shewhart control charts
A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows
gt plot(obj)
giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the
sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)
Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample
gt obj lt- qcc(diameter[125] type=xbar
+ newdata=diameter[2640])
plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples
xbar Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9074
000
740
1074
020
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
LCL
UCL
Calibration data in diameter[125 ] New data in diameter[2640 ]
Number of groups = 40Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 3Number violating runs = 1
Figure 2 A X chart with calibration data and newdata samples
If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method
gt plot(obj chartall=FALSE)
Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))
As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points
Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows
gt data(orangejuice)
gt attach(orangejuice)
R News ISSN 1609-3631
Vol 41 June 2004 14
gt orangejuice
sample D size trial
1 1 12 50 TRUE
2 2 15 50 TRUE
3 3 8 50 TRUE
53 53 3 50 FALSE
54 54 5 50 FALSE
gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)
where we use the logical indicator variable trial toselect the first 30 samples as calibration data
Operating characteristic function
An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo
The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo
gt par(mfrow=c(12))
gt occurves(obj)
gt occurves(obj2)
The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot
0 1 2 3 4 5
00
02
04
06
08
10
OC curves for xbar chart
Process shift (stddev)
Pro
b ty
pe II
err
or
n = 5n = 1n = 10n = 15n = 20
00 02 04 06 08 10
00
02
04
06
08
10
OC curves for p chart
p
Pro
b ty
pe II
err
or
Figure 3 Operating characteristics curves
Cusum charts
Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process
The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code
gt cusum(obj)
and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1
Cusum Chartfor diameter[125 ] and diameter[2640 ]
Group
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
minus5
minus3
minus1
13
57
911
1315
17
Cum
ulat
ive
Sum
Abo
ve T
arge
tB
elow
Tar
get
LDB
UDB
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4
Figure 4 Cusum chart
EWMA charts
Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process
The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows
gt ewma(obj)
and the resulting graph is shown in Figure 5
R News ISSN 1609-3631
Vol 41 June 2004 15
EWMA Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up S
umm
ary
Sta
tistic
s
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
739
9073
995
740
0074
005
740
1074
015
740
20
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Smoothing parameter = 02Control limits at 3sigma
Figure 5 EWMA chart
Process capability analysis
Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use
gt processcapability(obj
+ speclimits=c(73957405))
Process Capability Analysis
Call
processcapability(object = obj
speclimits = c(7395 7405))
Number of obs = 125 Target = 74
Center = 7400118 LSL = 7395
StdDev = 0009887547 USL = 7405
Capability indices
Value 25 975
Cp 1686 1476 1895
Cp_l 1725 1539 1912
Cp_u 1646 1467 1825
Cp_k 1646 1433 1859
Cpm 1674 1465 1882
ExpltLSL 0 ObsltLSL 0
ExpgtUSL 0 ObsgtUSL 0
The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if
available otherwise the target is set at the middlevalue between the specification limits
The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument
The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)
Process Capability Analysisfor diameter[125 ]
7394 7396 7398 7400 7402 7404 7406
LSL USLTarget
Number of obs = 125Center = 7400118StdDev = 0009887547
Target = 74LSL = 7395USL = 7405
Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167
ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0
Figure 6 Process capability analysis
A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice
bull a X chart
bull a R or S chart (if sample sizes gt 10)
bull a run chart
bull a histogram with specification limits
bull a normal QndashQ plot
bull a capability plot
As an example we may input the following code
gt processcapabilitysixpack(obj
+ speclimits=c(73957405)
+ target= 7401)
Pareto charts and cause-and-effectdiagrams
A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system
R News ISSN 1609-3631
Vol 41 June 2004 16
and thus screen out the less significant factors in ananalysis
A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example
gt defect lt- c(80 27 66 94 33)
gt names(defect) lt-
+ c(price code schedule date
+ supplier code contact num
+ part num)
gt paretochart(defect)
The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))
contact num
price code
supplier code
part num
schedule date
Pareto
Ch
art for d
efect
Frequency
0 50 100 150 200 250 300
0 25 50 75 100
Cumulative Percentage
Figure 7 Pareto chart
A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it
A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8
gt causeandeffect(
+ cause=list(Measurements=c(Microscopes
+ Inspectors)
+ Materials=c(Alloys
+ Suppliers)
+ Personnel=c(Supervisors
+ Operators)
+ Environment=c(Condensation
+ Moisture)
+ Methods=c(Brake Engager
+ Angle)
+ Machines=c(Speed Bits
+ Sockets))
+ effect=Surface Flaws)
CauseminusandminusEffect diagram
Surface Flaws
Measurements
Environment
Materials
Methods
Personnel
Machines
Microscopes
Inspectors
Alloys
Suppliers
Supervisors
Operators
Condensation
Moisture
Brake
Engager
Angle
Speed
Bits
Sockets
Figure 8 Cause-and-effect diagram
Tuning and extending the package
Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)
Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be
R News ISSN 1609-3631
Vol 41 July 8 2004 17
called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart
statspstd lt- function(data sizes)
data lt- asvector(data)
sizes lt- asvector(sizes)
pbar lt- sum(data)sum(sizes)
z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)
list(statistics = z center = 0)
sdpstd lt- function(data sizes) return(1)
limitspstd lt- function(center stddev sizes conf)
if (conf gt= 1) lcl lt- -conf
ucl lt- +conf
else
if (conf gt 0 amp conf lt 1)
nsigmas lt- qnorm(1 - (1 - conf)2)
lcl lt- -nsigmas
ucl lt- +nsigmas
else stop(invalid conf argument)
limits lt- matrix(c(lcl ucl) ncol = 2)
rownames(limits) lt- rep( length = nrow(limits))
colnames(limits) lt- c(LCL UCL)
return(limits)
Then we may source the above code and obtainthe control charts in Figure 9 as follows
set unequal sample sizes
gt n lt- c(rep(505) rep(1005) rep(25 5))
generate randomly the number of successes
gt x lt- rbinom(length(n) n 02)
gt par(mfrow=c(12))
plot the control chart with variable limits
gt qcc(x type=p size=n)
plot the standardized control chart
gt qcc(x type=pstd size=n)
p Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
00
01
02
03
04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 01931429StdDev = 03947641
LCL is variableUCL is variable
Number beyond limits = 0Number violating runs = 0
pstd Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
minus3
minus2
minus1
01
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 0StdDev = 1
LCL = minus3UCL = 3
Number beyond limits = 0Number violating runs = 0
Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)
Summary
In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram
References
Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall
Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit
Least Squares Calculations in RTiming different approaches
by Douglas Bates
Introduction
The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -
how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods
Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses
In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-
R News ISSN 1609-3631
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 7
d = 1
Alabama Alaska
Arizona
Arkansas
California
Colorado Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa Kansas
Kentucky Louisiana
Maine Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon Pennsylvania
Rhode Island
South Carolina
South Dakota Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Murder
Assault
UrbanPop
Rape
Figure 3 The PCA biplot Variables are symbolized byarrows and they are superimposed to the individuals dis-play The scale of the graph is given by a grid which sizeis given in the upper right corner Here the length of theside of grid squares is equal to one The eigenvalues barchart is drawn in the upper left corner with the two blackbars corresponding to the two axes used to draw the bi-plot Grey bars correspond to axes that were kept in theanalysis but not used to draw the graph
Separate factor maps can be drawn with thescorcircle (see figure 2) and slabel functions
gt slabel(pca1$li)
d = 1
Alabama Alaska
Arizona
Arkansas
California
Colorado Connecticut
Delaware
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana Iowa Kansas
Kentucky Louisiana
Maine Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma Oregon Pennsylvania
Rhode Island
South Carolina
South Dakota Tennessee
Texas
Utah
Vermont
Virginia
Washington
West Virginia
Wisconsin
Wyoming
Figure 4 Individuals factor map in a normed PCA
Distance matrices
A duality diagram can also come from a distancematrix if this matrix is Euclidean (ie if the dis-tances in the matrix are the distances between somepoints in a Euclidean space) The ade4 packagecontains functions to compute dissimilarity matrices(distbinary for binary data and distprop for fre-quency data) test whether they are Euclidean Gowerand Legendre (1986) and make them Euclidean(quasieuclid lingoes Lingoes (1971) cailliezCailliez (1983)) These functions are useful to ecol-ogists who use the works of Legendre and Anderson(1999) and Legendre and Legendre (1998)
The Yanomama data set (Manly (1991)) con-tains three distance matrices between 19 villages ofYanomama Indians The dudipco function can beused to compute a principal coordinates analysis(PCO Gower (1966)) that gives a Euclidean repre-sentation of the 19 villages This Euclidean represen-tation allows to compare the geographical geneticand anthropometric distances
gt data(yanomama)
gt gen lt- quasieuclid(asdist(yanomama$gen))
gt geo lt- quasieuclid(asdist(yanomama$geo))
gt ant lt- quasieuclid(asdist(yanomama$ant))
gt geo1 lt- dudipco(geo scann = FALSE nf = 3)
gt gen1 lt- dudipco(gen scann = FALSE nf = 3)
gt ant1 lt- dudipco(ant scann = FALSE nf = 3)
gt par(mfrow=c(22))
gt scatter(geo1)
gt scatter(gen1)
gt scatter(ant1posi=bottom)
d = 100
1 2
3
4 5 6 7
8 9
10
11 12 13
14
15
16 17 18
19
d = 20
1
2
3
4
5
6
7
8
9
10
11 12
13
14
15
16
17
18
19
d = 100
1
2 3
4
5
6
7
8
9 10
11
12 13
14 15
16
17 18
19
Figure 5 Comparison of the PCO analysis of the threedistance matrices of the Yanomama data set geographicgenetic and anthropometric distances
R News ISSN 1609-3631
Vol 41 June 2004 8
Taking into account groups of indi-viduals
In sites x species tables rows correspond to sitescolumns correspond to species and the values arethe number of individuals of species j found at sitei These tables can have many columns and can-not be used in a discriminant analysis In this casebetween-class analyses (between function) are a bet-ter alternative and they can be used with any dualitydiagram The between-class analysis of triplet (X QD) for a given factor f is the analysis of the triplet (GQ Dw) where G is the table of the means of table Xfor the groups defined by f and Dw is the diagonalmatrix of group weights For example a between-class correspondence analysis (BCA) is very simplyobtained after a correspondence analysis (CA)
gt data(meaudret)
gt coa1lt-dudicoa(meaudret$fau scannf = FALSE)
gt bet1lt-between(coa1meaudret$plan$sta
+ scannf=FALSE)
gt plot(bet1)
The meaudret$fau dataframe is an ecological ta-ble with 24 rows corresponding to six sampling sitesalong a small French stream (the Meaudret) Thesesix sampling sites were sampled four times (springsummer winter and autumn) hence the 24 rowsThe 13 columns correspond to 13 ephemeroteraspecies The CA of this data table is done with thedudicoa function giving the coa1 duality diagramThe corresponding bewteen-class analysis is donewith the between function considering the sites asclasses (meaudret$plan$sta is a factor defining theclasses) Therefore this is a between-sites analysiswhich aim is to discriminate the sites given the dis-tribution of ephemeroptera species This gives thebet1 duality diagram and Figure 6 shows the graphobtained by plotting this object
d = 1
Canonical weights
d = 1
Eda
Bsp
Brh
Bni
Bpu
Cen
Ecd
Rhi Hla
Hab
Par
Cae
Eig
Canonical weights
Variables
Eda
Bsp
Brh
Bni Bpu
Cen
Ecd
Rhi Hla
Hab
Par
Cae
Eig
Variables
00
000
501
001
5
Eigenvalues
d = 05
Scores and classes
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
S1
S2
S3
S4
S5
Axis1 Axis2
Inertia axes
d = 02
Classes
S1
S2
S3
S4
S5
Figure 6 BCA plot This is a composed plot made of 1- the species canonical weights (top left) 2- the speciesscores (middle left) 3- the eigenvalues bar chart (bottomleft) 4- the plot of plain CA axes projected into BCA(bottom center) 5- the gravity centers of classes (bottomright) 6- the projection of the rows with ellipses and grav-ity center of classes (main graph)
Like between-class analyses linear discriminantanalysis (discrimin function) can be extended to anyduality diagram (G (XtDX)minus Dw) where (XtDX)minus
is a generalized inverse This gives for example acorrespondence discriminant analysis (Perriegravere et al(1996) Perriegravere and Thioulouse (2002)) that can becomputed with the discrimincoa function
Opposite to between-class analyses are within-class analyses corresponding to diagrams (X minusYDminus1
w YtDX Q D) (within functions) These anal-yses extend to any type of variables the Multi-ple Group Principal Component Analysis (MGPCAThorpe (1983a) Thorpe (1983b) Thorpe and Leamy(1983) Furthermore the withincoa function intro-duces the double within-class correspondence anal-ysis (also named internal correspondence analysisCazes et al (1988))
Permutation tests
Permutation tests (also called Monte-Carlo tests orrandomization tests) can be used to assess the sta-tistical significance of between-class analyses Manypermutation tests are available in the ade4 packagefor example mantelrandtest procusterandtestrandtestbetween randtestcoinertia RVrtestrandtestdiscrimin and several of these testsare available both in R (mantelrtest) and in C(mantelrandtest) programming langage The R
R News ISSN 1609-3631
Vol 41 June 2004 9
version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons
The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values
gt test1lt-randtestbetween(bet1)
gt test1
Monte-Carlo test
Observation 04292
Call randtestbetween(xtest = bet1)
Based on 999 replicates
Simulated p-value 0001
gt plot(test1main=Between class inertia)
Between class inertia
sim
Fre
quency
01 02 03 04 05
0100
200
300
Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram
Conclusion
We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)
We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))
The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that
try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space
Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development
Bibliography
Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107
Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8
Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5
Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5
Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9
Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5
Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7
Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7
Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5
Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5
R News ISSN 1609-3631
Vol 41 June 2004 10
Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125
Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5
Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9
Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7
Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9
Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7
Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7
Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399
Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate
method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8
Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8
Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5
Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5
Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9
Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8
Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8
Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8
R News ISSN 1609-3631
Vol 41 June 2004 11
qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca
Introduction
The qcc package for the R statistical environment al-lows to
bull plot Shewhart quality control charts for contin-uous attribute and count data
bull plot Cusum and EWMA charts for continuousdata
bull draw operating characteristic curves
bull perform process capability analyses
bull draw Pareto charts and cause-and-effect dia-grams
I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate
The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy
Creating a qcc object
A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)
Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed
Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess
gt data(pistonrings)
gt attach(pistonrings)
gt dim(pistonrings)
[1] 200 3
gt pistonrings
diameter sample trial
1 74030 1 TRUE
2 74002 1 TRUE
3 74019 1 TRUE
4 73992 1 TRUE
5 74008 1 TRUE
6 73995 2 TRUE
7 73992 2 TRUE
199 74000 40 FALSE
200 74020 40 FALSE
gt diameter lt- qccgroups(diameter sample)
gt dim(diameter)
[1] 40 5
gt diameter
[1] [2] [3] [4] [5]
1 74030 74002 74019 73992 74008
2 73995 73992 74001 74011 74004
40 74010 74005 74029 74000 74020
Using the first 25 samples as training data an X chartcan be obtained as follows
gt obj lt- qcc(diameter[125] type=xbar)
By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)
gt summary(obj)
Call
qcc(data = diameter[125 ] type = xbar)
xbar chart for diameter[125 ]
Summary of group statistics
Min 1st Qu Median Mean 3rd Qu Max
7399 7400 7400 7400 7400 7401
Group sample size 5
Number of groups 25
Center of group statistics 7400118
Standard deviation 0009887547
Control limits
LCL UCL
7398791 7401444
R News ISSN 1609-3631
Vol 41 June 2004 12
type Control charts for variablesxbar X chart Sample means are plotted to
control the mean level of a con-tinuous process variable
xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable
R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable
S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able
type Control charts for attributesp p chart The proportion of nonconform-
ing units is plotted Controllimits are based on the binomialdistribution
np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution
c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion
u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units
Table 1 Shewhart control charts available in the qcc package
R News ISSN 1609-3631
Vol 41 June 2004 13
The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned
The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument
Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve
Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively
xbar Chartfor diameter[125 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9073
995
740
0074
005
740
1074
015
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24
LCL
UCL
Number of groups = 25Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 0Number violating runs = 0
Figure 1 A X chart for samples data
Shewhart control charts
A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows
gt plot(obj)
giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the
sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)
Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample
gt obj lt- qcc(diameter[125] type=xbar
+ newdata=diameter[2640])
plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples
xbar Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9074
000
740
1074
020
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
LCL
UCL
Calibration data in diameter[125 ] New data in diameter[2640 ]
Number of groups = 40Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 3Number violating runs = 1
Figure 2 A X chart with calibration data and newdata samples
If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method
gt plot(obj chartall=FALSE)
Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))
As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points
Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows
gt data(orangejuice)
gt attach(orangejuice)
R News ISSN 1609-3631
Vol 41 June 2004 14
gt orangejuice
sample D size trial
1 1 12 50 TRUE
2 2 15 50 TRUE
3 3 8 50 TRUE
53 53 3 50 FALSE
54 54 5 50 FALSE
gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)
where we use the logical indicator variable trial toselect the first 30 samples as calibration data
Operating characteristic function
An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo
The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo
gt par(mfrow=c(12))
gt occurves(obj)
gt occurves(obj2)
The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot
0 1 2 3 4 5
00
02
04
06
08
10
OC curves for xbar chart
Process shift (stddev)
Pro
b ty
pe II
err
or
n = 5n = 1n = 10n = 15n = 20
00 02 04 06 08 10
00
02
04
06
08
10
OC curves for p chart
p
Pro
b ty
pe II
err
or
Figure 3 Operating characteristics curves
Cusum charts
Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process
The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code
gt cusum(obj)
and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1
Cusum Chartfor diameter[125 ] and diameter[2640 ]
Group
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
minus5
minus3
minus1
13
57
911
1315
17
Cum
ulat
ive
Sum
Abo
ve T
arge
tB
elow
Tar
get
LDB
UDB
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4
Figure 4 Cusum chart
EWMA charts
Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process
The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows
gt ewma(obj)
and the resulting graph is shown in Figure 5
R News ISSN 1609-3631
Vol 41 June 2004 15
EWMA Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up S
umm
ary
Sta
tistic
s
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
739
9073
995
740
0074
005
740
1074
015
740
20
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Smoothing parameter = 02Control limits at 3sigma
Figure 5 EWMA chart
Process capability analysis
Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use
gt processcapability(obj
+ speclimits=c(73957405))
Process Capability Analysis
Call
processcapability(object = obj
speclimits = c(7395 7405))
Number of obs = 125 Target = 74
Center = 7400118 LSL = 7395
StdDev = 0009887547 USL = 7405
Capability indices
Value 25 975
Cp 1686 1476 1895
Cp_l 1725 1539 1912
Cp_u 1646 1467 1825
Cp_k 1646 1433 1859
Cpm 1674 1465 1882
ExpltLSL 0 ObsltLSL 0
ExpgtUSL 0 ObsgtUSL 0
The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if
available otherwise the target is set at the middlevalue between the specification limits
The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument
The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)
Process Capability Analysisfor diameter[125 ]
7394 7396 7398 7400 7402 7404 7406
LSL USLTarget
Number of obs = 125Center = 7400118StdDev = 0009887547
Target = 74LSL = 7395USL = 7405
Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167
ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0
Figure 6 Process capability analysis
A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice
bull a X chart
bull a R or S chart (if sample sizes gt 10)
bull a run chart
bull a histogram with specification limits
bull a normal QndashQ plot
bull a capability plot
As an example we may input the following code
gt processcapabilitysixpack(obj
+ speclimits=c(73957405)
+ target= 7401)
Pareto charts and cause-and-effectdiagrams
A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system
R News ISSN 1609-3631
Vol 41 June 2004 16
and thus screen out the less significant factors in ananalysis
A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example
gt defect lt- c(80 27 66 94 33)
gt names(defect) lt-
+ c(price code schedule date
+ supplier code contact num
+ part num)
gt paretochart(defect)
The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))
contact num
price code
supplier code
part num
schedule date
Pareto
Ch
art for d
efect
Frequency
0 50 100 150 200 250 300
0 25 50 75 100
Cumulative Percentage
Figure 7 Pareto chart
A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it
A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8
gt causeandeffect(
+ cause=list(Measurements=c(Microscopes
+ Inspectors)
+ Materials=c(Alloys
+ Suppliers)
+ Personnel=c(Supervisors
+ Operators)
+ Environment=c(Condensation
+ Moisture)
+ Methods=c(Brake Engager
+ Angle)
+ Machines=c(Speed Bits
+ Sockets))
+ effect=Surface Flaws)
CauseminusandminusEffect diagram
Surface Flaws
Measurements
Environment
Materials
Methods
Personnel
Machines
Microscopes
Inspectors
Alloys
Suppliers
Supervisors
Operators
Condensation
Moisture
Brake
Engager
Angle
Speed
Bits
Sockets
Figure 8 Cause-and-effect diagram
Tuning and extending the package
Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)
Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be
R News ISSN 1609-3631
Vol 41 July 8 2004 17
called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart
statspstd lt- function(data sizes)
data lt- asvector(data)
sizes lt- asvector(sizes)
pbar lt- sum(data)sum(sizes)
z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)
list(statistics = z center = 0)
sdpstd lt- function(data sizes) return(1)
limitspstd lt- function(center stddev sizes conf)
if (conf gt= 1) lcl lt- -conf
ucl lt- +conf
else
if (conf gt 0 amp conf lt 1)
nsigmas lt- qnorm(1 - (1 - conf)2)
lcl lt- -nsigmas
ucl lt- +nsigmas
else stop(invalid conf argument)
limits lt- matrix(c(lcl ucl) ncol = 2)
rownames(limits) lt- rep( length = nrow(limits))
colnames(limits) lt- c(LCL UCL)
return(limits)
Then we may source the above code and obtainthe control charts in Figure 9 as follows
set unequal sample sizes
gt n lt- c(rep(505) rep(1005) rep(25 5))
generate randomly the number of successes
gt x lt- rbinom(length(n) n 02)
gt par(mfrow=c(12))
plot the control chart with variable limits
gt qcc(x type=p size=n)
plot the standardized control chart
gt qcc(x type=pstd size=n)
p Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
00
01
02
03
04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 01931429StdDev = 03947641
LCL is variableUCL is variable
Number beyond limits = 0Number violating runs = 0
pstd Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
minus3
minus2
minus1
01
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 0StdDev = 1
LCL = minus3UCL = 3
Number beyond limits = 0Number violating runs = 0
Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)
Summary
In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram
References
Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall
Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit
Least Squares Calculations in RTiming different approaches
by Douglas Bates
Introduction
The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -
how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods
Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses
In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-
R News ISSN 1609-3631
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 8
Taking into account groups of indi-viduals
In sites x species tables rows correspond to sitescolumns correspond to species and the values arethe number of individuals of species j found at sitei These tables can have many columns and can-not be used in a discriminant analysis In this casebetween-class analyses (between function) are a bet-ter alternative and they can be used with any dualitydiagram The between-class analysis of triplet (X QD) for a given factor f is the analysis of the triplet (GQ Dw) where G is the table of the means of table Xfor the groups defined by f and Dw is the diagonalmatrix of group weights For example a between-class correspondence analysis (BCA) is very simplyobtained after a correspondence analysis (CA)
gt data(meaudret)
gt coa1lt-dudicoa(meaudret$fau scannf = FALSE)
gt bet1lt-between(coa1meaudret$plan$sta
+ scannf=FALSE)
gt plot(bet1)
The meaudret$fau dataframe is an ecological ta-ble with 24 rows corresponding to six sampling sitesalong a small French stream (the Meaudret) Thesesix sampling sites were sampled four times (springsummer winter and autumn) hence the 24 rowsThe 13 columns correspond to 13 ephemeroteraspecies The CA of this data table is done with thedudicoa function giving the coa1 duality diagramThe corresponding bewteen-class analysis is donewith the between function considering the sites asclasses (meaudret$plan$sta is a factor defining theclasses) Therefore this is a between-sites analysiswhich aim is to discriminate the sites given the dis-tribution of ephemeroptera species This gives thebet1 duality diagram and Figure 6 shows the graphobtained by plotting this object
d = 1
Canonical weights
d = 1
Eda
Bsp
Brh
Bni
Bpu
Cen
Ecd
Rhi Hla
Hab
Par
Cae
Eig
Canonical weights
Variables
Eda
Bsp
Brh
Bni Bpu
Cen
Ecd
Rhi Hla
Hab
Par
Cae
Eig
Variables
00
000
501
001
5
Eigenvalues
d = 05
Scores and classes
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
S1
S2
S3
S4
S5
Axis1 Axis2
Inertia axes
d = 02
Classes
S1
S2
S3
S4
S5
Figure 6 BCA plot This is a composed plot made of 1- the species canonical weights (top left) 2- the speciesscores (middle left) 3- the eigenvalues bar chart (bottomleft) 4- the plot of plain CA axes projected into BCA(bottom center) 5- the gravity centers of classes (bottomright) 6- the projection of the rows with ellipses and grav-ity center of classes (main graph)
Like between-class analyses linear discriminantanalysis (discrimin function) can be extended to anyduality diagram (G (XtDX)minus Dw) where (XtDX)minus
is a generalized inverse This gives for example acorrespondence discriminant analysis (Perriegravere et al(1996) Perriegravere and Thioulouse (2002)) that can becomputed with the discrimincoa function
Opposite to between-class analyses are within-class analyses corresponding to diagrams (X minusYDminus1
w YtDX Q D) (within functions) These anal-yses extend to any type of variables the Multi-ple Group Principal Component Analysis (MGPCAThorpe (1983a) Thorpe (1983b) Thorpe and Leamy(1983) Furthermore the withincoa function intro-duces the double within-class correspondence anal-ysis (also named internal correspondence analysisCazes et al (1988))
Permutation tests
Permutation tests (also called Monte-Carlo tests orrandomization tests) can be used to assess the sta-tistical significance of between-class analyses Manypermutation tests are available in the ade4 packagefor example mantelrandtest procusterandtestrandtestbetween randtestcoinertia RVrtestrandtestdiscrimin and several of these testsare available both in R (mantelrtest) and in C(mantelrandtest) programming langage The R
R News ISSN 1609-3631
Vol 41 June 2004 9
version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons
The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values
gt test1lt-randtestbetween(bet1)
gt test1
Monte-Carlo test
Observation 04292
Call randtestbetween(xtest = bet1)
Based on 999 replicates
Simulated p-value 0001
gt plot(test1main=Between class inertia)
Between class inertia
sim
Fre
quency
01 02 03 04 05
0100
200
300
Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram
Conclusion
We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)
We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))
The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that
try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space
Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development
Bibliography
Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107
Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8
Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5
Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5
Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9
Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5
Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7
Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7
Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5
Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5
R News ISSN 1609-3631
Vol 41 June 2004 10
Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125
Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5
Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9
Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7
Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9
Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7
Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7
Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399
Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate
method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8
Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8
Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5
Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5
Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9
Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8
Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8
Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8
R News ISSN 1609-3631
Vol 41 June 2004 11
qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca
Introduction
The qcc package for the R statistical environment al-lows to
bull plot Shewhart quality control charts for contin-uous attribute and count data
bull plot Cusum and EWMA charts for continuousdata
bull draw operating characteristic curves
bull perform process capability analyses
bull draw Pareto charts and cause-and-effect dia-grams
I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate
The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy
Creating a qcc object
A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)
Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed
Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess
gt data(pistonrings)
gt attach(pistonrings)
gt dim(pistonrings)
[1] 200 3
gt pistonrings
diameter sample trial
1 74030 1 TRUE
2 74002 1 TRUE
3 74019 1 TRUE
4 73992 1 TRUE
5 74008 1 TRUE
6 73995 2 TRUE
7 73992 2 TRUE
199 74000 40 FALSE
200 74020 40 FALSE
gt diameter lt- qccgroups(diameter sample)
gt dim(diameter)
[1] 40 5
gt diameter
[1] [2] [3] [4] [5]
1 74030 74002 74019 73992 74008
2 73995 73992 74001 74011 74004
40 74010 74005 74029 74000 74020
Using the first 25 samples as training data an X chartcan be obtained as follows
gt obj lt- qcc(diameter[125] type=xbar)
By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)
gt summary(obj)
Call
qcc(data = diameter[125 ] type = xbar)
xbar chart for diameter[125 ]
Summary of group statistics
Min 1st Qu Median Mean 3rd Qu Max
7399 7400 7400 7400 7400 7401
Group sample size 5
Number of groups 25
Center of group statistics 7400118
Standard deviation 0009887547
Control limits
LCL UCL
7398791 7401444
R News ISSN 1609-3631
Vol 41 June 2004 12
type Control charts for variablesxbar X chart Sample means are plotted to
control the mean level of a con-tinuous process variable
xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable
R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable
S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able
type Control charts for attributesp p chart The proportion of nonconform-
ing units is plotted Controllimits are based on the binomialdistribution
np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution
c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion
u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units
Table 1 Shewhart control charts available in the qcc package
R News ISSN 1609-3631
Vol 41 June 2004 13
The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned
The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument
Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve
Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively
xbar Chartfor diameter[125 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9073
995
740
0074
005
740
1074
015
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24
LCL
UCL
Number of groups = 25Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 0Number violating runs = 0
Figure 1 A X chart for samples data
Shewhart control charts
A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows
gt plot(obj)
giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the
sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)
Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample
gt obj lt- qcc(diameter[125] type=xbar
+ newdata=diameter[2640])
plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples
xbar Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9074
000
740
1074
020
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
LCL
UCL
Calibration data in diameter[125 ] New data in diameter[2640 ]
Number of groups = 40Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 3Number violating runs = 1
Figure 2 A X chart with calibration data and newdata samples
If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method
gt plot(obj chartall=FALSE)
Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))
As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points
Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows
gt data(orangejuice)
gt attach(orangejuice)
R News ISSN 1609-3631
Vol 41 June 2004 14
gt orangejuice
sample D size trial
1 1 12 50 TRUE
2 2 15 50 TRUE
3 3 8 50 TRUE
53 53 3 50 FALSE
54 54 5 50 FALSE
gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)
where we use the logical indicator variable trial toselect the first 30 samples as calibration data
Operating characteristic function
An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo
The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo
gt par(mfrow=c(12))
gt occurves(obj)
gt occurves(obj2)
The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot
0 1 2 3 4 5
00
02
04
06
08
10
OC curves for xbar chart
Process shift (stddev)
Pro
b ty
pe II
err
or
n = 5n = 1n = 10n = 15n = 20
00 02 04 06 08 10
00
02
04
06
08
10
OC curves for p chart
p
Pro
b ty
pe II
err
or
Figure 3 Operating characteristics curves
Cusum charts
Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process
The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code
gt cusum(obj)
and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1
Cusum Chartfor diameter[125 ] and diameter[2640 ]
Group
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
minus5
minus3
minus1
13
57
911
1315
17
Cum
ulat
ive
Sum
Abo
ve T
arge
tB
elow
Tar
get
LDB
UDB
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4
Figure 4 Cusum chart
EWMA charts
Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process
The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows
gt ewma(obj)
and the resulting graph is shown in Figure 5
R News ISSN 1609-3631
Vol 41 June 2004 15
EWMA Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up S
umm
ary
Sta
tistic
s
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
739
9073
995
740
0074
005
740
1074
015
740
20
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Smoothing parameter = 02Control limits at 3sigma
Figure 5 EWMA chart
Process capability analysis
Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use
gt processcapability(obj
+ speclimits=c(73957405))
Process Capability Analysis
Call
processcapability(object = obj
speclimits = c(7395 7405))
Number of obs = 125 Target = 74
Center = 7400118 LSL = 7395
StdDev = 0009887547 USL = 7405
Capability indices
Value 25 975
Cp 1686 1476 1895
Cp_l 1725 1539 1912
Cp_u 1646 1467 1825
Cp_k 1646 1433 1859
Cpm 1674 1465 1882
ExpltLSL 0 ObsltLSL 0
ExpgtUSL 0 ObsgtUSL 0
The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if
available otherwise the target is set at the middlevalue between the specification limits
The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument
The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)
Process Capability Analysisfor diameter[125 ]
7394 7396 7398 7400 7402 7404 7406
LSL USLTarget
Number of obs = 125Center = 7400118StdDev = 0009887547
Target = 74LSL = 7395USL = 7405
Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167
ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0
Figure 6 Process capability analysis
A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice
bull a X chart
bull a R or S chart (if sample sizes gt 10)
bull a run chart
bull a histogram with specification limits
bull a normal QndashQ plot
bull a capability plot
As an example we may input the following code
gt processcapabilitysixpack(obj
+ speclimits=c(73957405)
+ target= 7401)
Pareto charts and cause-and-effectdiagrams
A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system
R News ISSN 1609-3631
Vol 41 June 2004 16
and thus screen out the less significant factors in ananalysis
A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example
gt defect lt- c(80 27 66 94 33)
gt names(defect) lt-
+ c(price code schedule date
+ supplier code contact num
+ part num)
gt paretochart(defect)
The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))
contact num
price code
supplier code
part num
schedule date
Pareto
Ch
art for d
efect
Frequency
0 50 100 150 200 250 300
0 25 50 75 100
Cumulative Percentage
Figure 7 Pareto chart
A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it
A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8
gt causeandeffect(
+ cause=list(Measurements=c(Microscopes
+ Inspectors)
+ Materials=c(Alloys
+ Suppliers)
+ Personnel=c(Supervisors
+ Operators)
+ Environment=c(Condensation
+ Moisture)
+ Methods=c(Brake Engager
+ Angle)
+ Machines=c(Speed Bits
+ Sockets))
+ effect=Surface Flaws)
CauseminusandminusEffect diagram
Surface Flaws
Measurements
Environment
Materials
Methods
Personnel
Machines
Microscopes
Inspectors
Alloys
Suppliers
Supervisors
Operators
Condensation
Moisture
Brake
Engager
Angle
Speed
Bits
Sockets
Figure 8 Cause-and-effect diagram
Tuning and extending the package
Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)
Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be
R News ISSN 1609-3631
Vol 41 July 8 2004 17
called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart
statspstd lt- function(data sizes)
data lt- asvector(data)
sizes lt- asvector(sizes)
pbar lt- sum(data)sum(sizes)
z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)
list(statistics = z center = 0)
sdpstd lt- function(data sizes) return(1)
limitspstd lt- function(center stddev sizes conf)
if (conf gt= 1) lcl lt- -conf
ucl lt- +conf
else
if (conf gt 0 amp conf lt 1)
nsigmas lt- qnorm(1 - (1 - conf)2)
lcl lt- -nsigmas
ucl lt- +nsigmas
else stop(invalid conf argument)
limits lt- matrix(c(lcl ucl) ncol = 2)
rownames(limits) lt- rep( length = nrow(limits))
colnames(limits) lt- c(LCL UCL)
return(limits)
Then we may source the above code and obtainthe control charts in Figure 9 as follows
set unequal sample sizes
gt n lt- c(rep(505) rep(1005) rep(25 5))
generate randomly the number of successes
gt x lt- rbinom(length(n) n 02)
gt par(mfrow=c(12))
plot the control chart with variable limits
gt qcc(x type=p size=n)
plot the standardized control chart
gt qcc(x type=pstd size=n)
p Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
00
01
02
03
04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 01931429StdDev = 03947641
LCL is variableUCL is variable
Number beyond limits = 0Number violating runs = 0
pstd Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
minus3
minus2
minus1
01
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 0StdDev = 1
LCL = minus3UCL = 3
Number beyond limits = 0Number violating runs = 0
Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)
Summary
In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram
References
Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall
Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit
Least Squares Calculations in RTiming different approaches
by Douglas Bates
Introduction
The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -
how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods
Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses
In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-
R News ISSN 1609-3631
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 9
version allows to see how computations are per-formed and to write easily other tests while the Cversion is needed for performance reasons
The statistical significance of the BCA can be eval-uated with the randtestbetween function By de-fault 999 permutations are simulated and the result-ing object (test1) can be plotted (Figure 7) The p-value is highly significant which confirms the exis-tence of differences between sampling sites The plotshows that the observed value is very far to the rightof the histogram of simulated values
gt test1lt-randtestbetween(bet1)
gt test1
Monte-Carlo test
Observation 04292
Call randtestbetween(xtest = bet1)
Based on 999 replicates
Simulated p-value 0001
gt plot(test1main=Between class inertia)
Between class inertia
sim
Fre
quency
01 02 03 04 05
0100
200
300
Figure 7 Histogram of the 999 simulated values of therandomization test of the bet1 BCA The observed valueis given by the vertical line at the right of the histogram
Conclusion
We have described only the most basic functions ofthe ade4 package considering only the simplest one-table data analysis methods Many other dudi meth-ods are available in ade4 for example multiple cor-respondence analysis (dudiacm) fuzzy correspon-dence analysis (dudifca) analysis of a mixture ofnumeric variables and factors (dudimix) non sym-metric correspondence analysis (dudinsc) decen-tered correspondence analysis (dudidec)
We are preparing a second paper dealing withtwo-tables coupling methods among which canon-ical correspondence analysis and redundancy analy-sis are the most frequently used in ecology (Legendreand Legendre (1998)) The ade4 package proposes analternative to these methods based on the co-inertiacriterion (Dray et al (2003))
The third category of data analysis methodsavailable in ade4 are K-tables analysis methods that
try to extract the stable part in a series of tablesThese methods come from the STATIS strategy Lavitet al (1994) (statis and pta functions) or from themultiple coinertia strategy (mcoa function) The mfaand foucart functions perform two variants of K-tables analysis and the STATICO method (functionktabmatch2ktabs Thioulouse et al (2004)) allowsto extract the stable part of species-environment re-lationships in time or in space
Methods taking into account spatial constraints(multispati function) and phylogenetic constraints(phylog function) are under development
Bibliography
Cailliez F (1983) The analytical solution of the addi-tive constant problem Psychometrika 48305ndash3107
Cazes P Chessel D and Doleacutedec S (1988)Lrsquoanalyse des correspondances internes drsquountableau partitionneacute son usage en hydrobiologieRevue de Statistique Appliqueacutee 3639ndash54 8
Chevenet F Doleacutedec S and Chessel D (1994) Afuzzy coding approach for the analysis of long-term ecological data Freshwater Biology 31295ndash309 5
Doleacutedec S Chessel D and Olivier J (1995)Lrsquoanalyse des correspondances deacutecentreacutee appli-cation aux peuplements ichtyologiques du haut-rhocircne Bulletin Franccedilais de la Pecircche et de la Piscicul-ture 33629ndash40 5
Dray S Chessel D and Thioulouse J (2003) Co-inertia analysis and the linking of ecological tablesEcology 843078ndash3089 9
Escoufier Y (1987) The duality diagramm a meansof better practical applications In Legendre P andLegendre L editors Development in numerical ecol-ogy pages 139ndash156 NATO advanced Institute Se-rie G Springer Verlag Berlin 5
Gower J (1966) Some distance properties of latentroot and vector methods used in multivariate anal-ysis Biometrika 53325ndash338 7
Gower J and Legendre P (1986) Metric andeuclidean properties of dissimilarity coefficientsJournal of Classification 35ndash48 7
Greenacre M (1984) Theory and applications of corre-spondence analysis Academic Press London 5
Hill M and Smith A (1976) Principal componentanalysis of taxonomic data with multi-state dis-crete characters Taxon 25249ndash255 5
R News ISSN 1609-3631
Vol 41 June 2004 10
Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125
Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5
Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9
Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7
Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9
Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7
Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7
Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399
Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate
method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8
Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8
Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5
Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5
Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9
Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8
Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8
Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8
R News ISSN 1609-3631
Vol 41 June 2004 11
qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca
Introduction
The qcc package for the R statistical environment al-lows to
bull plot Shewhart quality control charts for contin-uous attribute and count data
bull plot Cusum and EWMA charts for continuousdata
bull draw operating characteristic curves
bull perform process capability analyses
bull draw Pareto charts and cause-and-effect dia-grams
I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate
The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy
Creating a qcc object
A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)
Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed
Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess
gt data(pistonrings)
gt attach(pistonrings)
gt dim(pistonrings)
[1] 200 3
gt pistonrings
diameter sample trial
1 74030 1 TRUE
2 74002 1 TRUE
3 74019 1 TRUE
4 73992 1 TRUE
5 74008 1 TRUE
6 73995 2 TRUE
7 73992 2 TRUE
199 74000 40 FALSE
200 74020 40 FALSE
gt diameter lt- qccgroups(diameter sample)
gt dim(diameter)
[1] 40 5
gt diameter
[1] [2] [3] [4] [5]
1 74030 74002 74019 73992 74008
2 73995 73992 74001 74011 74004
40 74010 74005 74029 74000 74020
Using the first 25 samples as training data an X chartcan be obtained as follows
gt obj lt- qcc(diameter[125] type=xbar)
By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)
gt summary(obj)
Call
qcc(data = diameter[125 ] type = xbar)
xbar chart for diameter[125 ]
Summary of group statistics
Min 1st Qu Median Mean 3rd Qu Max
7399 7400 7400 7400 7400 7401
Group sample size 5
Number of groups 25
Center of group statistics 7400118
Standard deviation 0009887547
Control limits
LCL UCL
7398791 7401444
R News ISSN 1609-3631
Vol 41 June 2004 12
type Control charts for variablesxbar X chart Sample means are plotted to
control the mean level of a con-tinuous process variable
xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable
R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable
S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able
type Control charts for attributesp p chart The proportion of nonconform-
ing units is plotted Controllimits are based on the binomialdistribution
np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution
c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion
u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units
Table 1 Shewhart control charts available in the qcc package
R News ISSN 1609-3631
Vol 41 June 2004 13
The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned
The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument
Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve
Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively
xbar Chartfor diameter[125 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9073
995
740
0074
005
740
1074
015
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24
LCL
UCL
Number of groups = 25Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 0Number violating runs = 0
Figure 1 A X chart for samples data
Shewhart control charts
A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows
gt plot(obj)
giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the
sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)
Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample
gt obj lt- qcc(diameter[125] type=xbar
+ newdata=diameter[2640])
plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples
xbar Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9074
000
740
1074
020
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
LCL
UCL
Calibration data in diameter[125 ] New data in diameter[2640 ]
Number of groups = 40Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 3Number violating runs = 1
Figure 2 A X chart with calibration data and newdata samples
If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method
gt plot(obj chartall=FALSE)
Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))
As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points
Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows
gt data(orangejuice)
gt attach(orangejuice)
R News ISSN 1609-3631
Vol 41 June 2004 14
gt orangejuice
sample D size trial
1 1 12 50 TRUE
2 2 15 50 TRUE
3 3 8 50 TRUE
53 53 3 50 FALSE
54 54 5 50 FALSE
gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)
where we use the logical indicator variable trial toselect the first 30 samples as calibration data
Operating characteristic function
An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo
The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo
gt par(mfrow=c(12))
gt occurves(obj)
gt occurves(obj2)
The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot
0 1 2 3 4 5
00
02
04
06
08
10
OC curves for xbar chart
Process shift (stddev)
Pro
b ty
pe II
err
or
n = 5n = 1n = 10n = 15n = 20
00 02 04 06 08 10
00
02
04
06
08
10
OC curves for p chart
p
Pro
b ty
pe II
err
or
Figure 3 Operating characteristics curves
Cusum charts
Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process
The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code
gt cusum(obj)
and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1
Cusum Chartfor diameter[125 ] and diameter[2640 ]
Group
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
minus5
minus3
minus1
13
57
911
1315
17
Cum
ulat
ive
Sum
Abo
ve T
arge
tB
elow
Tar
get
LDB
UDB
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4
Figure 4 Cusum chart
EWMA charts
Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process
The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows
gt ewma(obj)
and the resulting graph is shown in Figure 5
R News ISSN 1609-3631
Vol 41 June 2004 15
EWMA Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up S
umm
ary
Sta
tistic
s
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
739
9073
995
740
0074
005
740
1074
015
740
20
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Smoothing parameter = 02Control limits at 3sigma
Figure 5 EWMA chart
Process capability analysis
Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use
gt processcapability(obj
+ speclimits=c(73957405))
Process Capability Analysis
Call
processcapability(object = obj
speclimits = c(7395 7405))
Number of obs = 125 Target = 74
Center = 7400118 LSL = 7395
StdDev = 0009887547 USL = 7405
Capability indices
Value 25 975
Cp 1686 1476 1895
Cp_l 1725 1539 1912
Cp_u 1646 1467 1825
Cp_k 1646 1433 1859
Cpm 1674 1465 1882
ExpltLSL 0 ObsltLSL 0
ExpgtUSL 0 ObsgtUSL 0
The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if
available otherwise the target is set at the middlevalue between the specification limits
The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument
The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)
Process Capability Analysisfor diameter[125 ]
7394 7396 7398 7400 7402 7404 7406
LSL USLTarget
Number of obs = 125Center = 7400118StdDev = 0009887547
Target = 74LSL = 7395USL = 7405
Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167
ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0
Figure 6 Process capability analysis
A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice
bull a X chart
bull a R or S chart (if sample sizes gt 10)
bull a run chart
bull a histogram with specification limits
bull a normal QndashQ plot
bull a capability plot
As an example we may input the following code
gt processcapabilitysixpack(obj
+ speclimits=c(73957405)
+ target= 7401)
Pareto charts and cause-and-effectdiagrams
A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system
R News ISSN 1609-3631
Vol 41 June 2004 16
and thus screen out the less significant factors in ananalysis
A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example
gt defect lt- c(80 27 66 94 33)
gt names(defect) lt-
+ c(price code schedule date
+ supplier code contact num
+ part num)
gt paretochart(defect)
The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))
contact num
price code
supplier code
part num
schedule date
Pareto
Ch
art for d
efect
Frequency
0 50 100 150 200 250 300
0 25 50 75 100
Cumulative Percentage
Figure 7 Pareto chart
A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it
A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8
gt causeandeffect(
+ cause=list(Measurements=c(Microscopes
+ Inspectors)
+ Materials=c(Alloys
+ Suppliers)
+ Personnel=c(Supervisors
+ Operators)
+ Environment=c(Condensation
+ Moisture)
+ Methods=c(Brake Engager
+ Angle)
+ Machines=c(Speed Bits
+ Sockets))
+ effect=Surface Flaws)
CauseminusandminusEffect diagram
Surface Flaws
Measurements
Environment
Materials
Methods
Personnel
Machines
Microscopes
Inspectors
Alloys
Suppliers
Supervisors
Operators
Condensation
Moisture
Brake
Engager
Angle
Speed
Bits
Sockets
Figure 8 Cause-and-effect diagram
Tuning and extending the package
Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)
Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be
R News ISSN 1609-3631
Vol 41 July 8 2004 17
called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart
statspstd lt- function(data sizes)
data lt- asvector(data)
sizes lt- asvector(sizes)
pbar lt- sum(data)sum(sizes)
z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)
list(statistics = z center = 0)
sdpstd lt- function(data sizes) return(1)
limitspstd lt- function(center stddev sizes conf)
if (conf gt= 1) lcl lt- -conf
ucl lt- +conf
else
if (conf gt 0 amp conf lt 1)
nsigmas lt- qnorm(1 - (1 - conf)2)
lcl lt- -nsigmas
ucl lt- +nsigmas
else stop(invalid conf argument)
limits lt- matrix(c(lcl ucl) ncol = 2)
rownames(limits) lt- rep( length = nrow(limits))
colnames(limits) lt- c(LCL UCL)
return(limits)
Then we may source the above code and obtainthe control charts in Figure 9 as follows
set unequal sample sizes
gt n lt- c(rep(505) rep(1005) rep(25 5))
generate randomly the number of successes
gt x lt- rbinom(length(n) n 02)
gt par(mfrow=c(12))
plot the control chart with variable limits
gt qcc(x type=p size=n)
plot the standardized control chart
gt qcc(x type=pstd size=n)
p Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
00
01
02
03
04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 01931429StdDev = 03947641
LCL is variableUCL is variable
Number beyond limits = 0Number violating runs = 0
pstd Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
minus3
minus2
minus1
01
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 0StdDev = 1
LCL = minus3UCL = 3
Number beyond limits = 0Number violating runs = 0
Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)
Summary
In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram
References
Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall
Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit
Least Squares Calculations in RTiming different approaches
by Douglas Bates
Introduction
The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -
how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods
Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses
In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-
R News ISSN 1609-3631
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 10
Kiers H (1994) Simple structure in componentanalysis techniques for mixtures of qualitative ansquantitative variables Psychometrika 56197ndash2125
Kroonenberg P and Lombardo R (1999) Non-symmetric correspondence analysis a tool foranalysing contingency tables with a dependencestructure Multivariate Behavioral Research 34367ndash396 5
Lavit C Escoufier Y Sabatier R and Traissac P(1994) The act (statis method) ComputationalStatistics and Data Analysis 1897ndash119 9
Legendre P and Anderson M (1999) Distance-based redundancy analysis testing multispeciesresponses in multifactorial ecological experimentsEcological Monographs 691ndash24 7
Legendre P and Legendre L (1998) Numerical ecol-ogy Elsevier Science BV Amsterdam 2nd englishedition edition 7 9
Lingoes J (1971) Somme boundary conditions fora monotone analysis of symmetric matrices Psy-chometrika 36195ndash203 7
Manly B F J (1991) Randomization and Monte Carlomethods in biology Chapman and Hall London 7
Perriegravere G Combet C Penel S Blanchet CThioulouse J Geourjon C Grassot J CharavayC Gouy M Duret L and Deleacuteage G (2003) In-tegrated databanks access and sequencestructureanalysis services at the pbil Nucleic Acids Research313393ndash3399
Perriegravere G Lobry J and Thioulouse J (1996) Cor-respondence discriminant analysis a multivariate
method for comparing classes of protein and nu-cleic acid sequences Computer Applications in theBiosciences 12519ndash524 8
Perriegravere G and Thioulouse J (2002) Use of cor-respondence discriminant analysis to predict thesubcellular location of bacterial proteins ComputerMethods and Programs in Biomedicine in press 8
Tenenhaus M and Young F (1985) An analysis andsynthesis of multiple correspondence analysis op-timal scaling dual scaling homogeneity analysisans other methods for quantifying categorical mul-tivariate data Psychometrika 5091ndash119 5
Thioulouse J Chessel D Doleacutedec S and Olivier J(1997) Ade-4 a multivariate analysis and graphi-cal display software Statistics and Computing 775ndash83 5
Thioulouse J Simier M and Chessel D (2004) Si-multaneous analysis of a sequence of paired eco-logical tables Ecology 85272ndash283 9
Thorpe R S (1983a) A biometric study of the ef-fects of growth on the analysis of geographic varia-tion Tooth number in green geckos (reptilia Phel-suma) Journal of Zoology 20113ndash26 8
Thorpe R S (1983b) A review of the numeri-cal methods for recognising and analysing racialdifferentiation In Felsenstein J editor Numer-ical Taxonomy Series G pages 404ndash423 Springer-Verlag New York 8
Thorpe R S and Leamy L (1983) Morphometricstudies in inbred house mice (mus sp) Multivari-ate analysis of size and shape Journal of Zoology199421ndash432 8
R News ISSN 1609-3631
Vol 41 June 2004 11
qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca
Introduction
The qcc package for the R statistical environment al-lows to
bull plot Shewhart quality control charts for contin-uous attribute and count data
bull plot Cusum and EWMA charts for continuousdata
bull draw operating characteristic curves
bull perform process capability analyses
bull draw Pareto charts and cause-and-effect dia-grams
I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate
The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy
Creating a qcc object
A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)
Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed
Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess
gt data(pistonrings)
gt attach(pistonrings)
gt dim(pistonrings)
[1] 200 3
gt pistonrings
diameter sample trial
1 74030 1 TRUE
2 74002 1 TRUE
3 74019 1 TRUE
4 73992 1 TRUE
5 74008 1 TRUE
6 73995 2 TRUE
7 73992 2 TRUE
199 74000 40 FALSE
200 74020 40 FALSE
gt diameter lt- qccgroups(diameter sample)
gt dim(diameter)
[1] 40 5
gt diameter
[1] [2] [3] [4] [5]
1 74030 74002 74019 73992 74008
2 73995 73992 74001 74011 74004
40 74010 74005 74029 74000 74020
Using the first 25 samples as training data an X chartcan be obtained as follows
gt obj lt- qcc(diameter[125] type=xbar)
By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)
gt summary(obj)
Call
qcc(data = diameter[125 ] type = xbar)
xbar chart for diameter[125 ]
Summary of group statistics
Min 1st Qu Median Mean 3rd Qu Max
7399 7400 7400 7400 7400 7401
Group sample size 5
Number of groups 25
Center of group statistics 7400118
Standard deviation 0009887547
Control limits
LCL UCL
7398791 7401444
R News ISSN 1609-3631
Vol 41 June 2004 12
type Control charts for variablesxbar X chart Sample means are plotted to
control the mean level of a con-tinuous process variable
xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable
R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable
S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able
type Control charts for attributesp p chart The proportion of nonconform-
ing units is plotted Controllimits are based on the binomialdistribution
np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution
c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion
u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units
Table 1 Shewhart control charts available in the qcc package
R News ISSN 1609-3631
Vol 41 June 2004 13
The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned
The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument
Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve
Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively
xbar Chartfor diameter[125 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9073
995
740
0074
005
740
1074
015
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24
LCL
UCL
Number of groups = 25Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 0Number violating runs = 0
Figure 1 A X chart for samples data
Shewhart control charts
A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows
gt plot(obj)
giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the
sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)
Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample
gt obj lt- qcc(diameter[125] type=xbar
+ newdata=diameter[2640])
plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples
xbar Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9074
000
740
1074
020
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
LCL
UCL
Calibration data in diameter[125 ] New data in diameter[2640 ]
Number of groups = 40Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 3Number violating runs = 1
Figure 2 A X chart with calibration data and newdata samples
If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method
gt plot(obj chartall=FALSE)
Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))
As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points
Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows
gt data(orangejuice)
gt attach(orangejuice)
R News ISSN 1609-3631
Vol 41 June 2004 14
gt orangejuice
sample D size trial
1 1 12 50 TRUE
2 2 15 50 TRUE
3 3 8 50 TRUE
53 53 3 50 FALSE
54 54 5 50 FALSE
gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)
where we use the logical indicator variable trial toselect the first 30 samples as calibration data
Operating characteristic function
An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo
The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo
gt par(mfrow=c(12))
gt occurves(obj)
gt occurves(obj2)
The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot
0 1 2 3 4 5
00
02
04
06
08
10
OC curves for xbar chart
Process shift (stddev)
Pro
b ty
pe II
err
or
n = 5n = 1n = 10n = 15n = 20
00 02 04 06 08 10
00
02
04
06
08
10
OC curves for p chart
p
Pro
b ty
pe II
err
or
Figure 3 Operating characteristics curves
Cusum charts
Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process
The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code
gt cusum(obj)
and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1
Cusum Chartfor diameter[125 ] and diameter[2640 ]
Group
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
minus5
minus3
minus1
13
57
911
1315
17
Cum
ulat
ive
Sum
Abo
ve T
arge
tB
elow
Tar
get
LDB
UDB
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4
Figure 4 Cusum chart
EWMA charts
Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process
The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows
gt ewma(obj)
and the resulting graph is shown in Figure 5
R News ISSN 1609-3631
Vol 41 June 2004 15
EWMA Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up S
umm
ary
Sta
tistic
s
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
739
9073
995
740
0074
005
740
1074
015
740
20
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Smoothing parameter = 02Control limits at 3sigma
Figure 5 EWMA chart
Process capability analysis
Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use
gt processcapability(obj
+ speclimits=c(73957405))
Process Capability Analysis
Call
processcapability(object = obj
speclimits = c(7395 7405))
Number of obs = 125 Target = 74
Center = 7400118 LSL = 7395
StdDev = 0009887547 USL = 7405
Capability indices
Value 25 975
Cp 1686 1476 1895
Cp_l 1725 1539 1912
Cp_u 1646 1467 1825
Cp_k 1646 1433 1859
Cpm 1674 1465 1882
ExpltLSL 0 ObsltLSL 0
ExpgtUSL 0 ObsgtUSL 0
The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if
available otherwise the target is set at the middlevalue between the specification limits
The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument
The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)
Process Capability Analysisfor diameter[125 ]
7394 7396 7398 7400 7402 7404 7406
LSL USLTarget
Number of obs = 125Center = 7400118StdDev = 0009887547
Target = 74LSL = 7395USL = 7405
Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167
ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0
Figure 6 Process capability analysis
A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice
bull a X chart
bull a R or S chart (if sample sizes gt 10)
bull a run chart
bull a histogram with specification limits
bull a normal QndashQ plot
bull a capability plot
As an example we may input the following code
gt processcapabilitysixpack(obj
+ speclimits=c(73957405)
+ target= 7401)
Pareto charts and cause-and-effectdiagrams
A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system
R News ISSN 1609-3631
Vol 41 June 2004 16
and thus screen out the less significant factors in ananalysis
A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example
gt defect lt- c(80 27 66 94 33)
gt names(defect) lt-
+ c(price code schedule date
+ supplier code contact num
+ part num)
gt paretochart(defect)
The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))
contact num
price code
supplier code
part num
schedule date
Pareto
Ch
art for d
efect
Frequency
0 50 100 150 200 250 300
0 25 50 75 100
Cumulative Percentage
Figure 7 Pareto chart
A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it
A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8
gt causeandeffect(
+ cause=list(Measurements=c(Microscopes
+ Inspectors)
+ Materials=c(Alloys
+ Suppliers)
+ Personnel=c(Supervisors
+ Operators)
+ Environment=c(Condensation
+ Moisture)
+ Methods=c(Brake Engager
+ Angle)
+ Machines=c(Speed Bits
+ Sockets))
+ effect=Surface Flaws)
CauseminusandminusEffect diagram
Surface Flaws
Measurements
Environment
Materials
Methods
Personnel
Machines
Microscopes
Inspectors
Alloys
Suppliers
Supervisors
Operators
Condensation
Moisture
Brake
Engager
Angle
Speed
Bits
Sockets
Figure 8 Cause-and-effect diagram
Tuning and extending the package
Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)
Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be
R News ISSN 1609-3631
Vol 41 July 8 2004 17
called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart
statspstd lt- function(data sizes)
data lt- asvector(data)
sizes lt- asvector(sizes)
pbar lt- sum(data)sum(sizes)
z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)
list(statistics = z center = 0)
sdpstd lt- function(data sizes) return(1)
limitspstd lt- function(center stddev sizes conf)
if (conf gt= 1) lcl lt- -conf
ucl lt- +conf
else
if (conf gt 0 amp conf lt 1)
nsigmas lt- qnorm(1 - (1 - conf)2)
lcl lt- -nsigmas
ucl lt- +nsigmas
else stop(invalid conf argument)
limits lt- matrix(c(lcl ucl) ncol = 2)
rownames(limits) lt- rep( length = nrow(limits))
colnames(limits) lt- c(LCL UCL)
return(limits)
Then we may source the above code and obtainthe control charts in Figure 9 as follows
set unequal sample sizes
gt n lt- c(rep(505) rep(1005) rep(25 5))
generate randomly the number of successes
gt x lt- rbinom(length(n) n 02)
gt par(mfrow=c(12))
plot the control chart with variable limits
gt qcc(x type=p size=n)
plot the standardized control chart
gt qcc(x type=pstd size=n)
p Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
00
01
02
03
04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 01931429StdDev = 03947641
LCL is variableUCL is variable
Number beyond limits = 0Number violating runs = 0
pstd Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
minus3
minus2
minus1
01
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 0StdDev = 1
LCL = minus3UCL = 3
Number beyond limits = 0Number violating runs = 0
Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)
Summary
In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram
References
Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall
Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit
Least Squares Calculations in RTiming different approaches
by Douglas Bates
Introduction
The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -
how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods
Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses
In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-
R News ISSN 1609-3631
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 11
qcc An R package for quality controlcharting and statistical process controlby Luca Scrucca
Introduction
The qcc package for the R statistical environment al-lows to
bull plot Shewhart quality control charts for contin-uous attribute and count data
bull plot Cusum and EWMA charts for continuousdata
bull draw operating characteristic curves
bull perform process capability analyses
bull draw Pareto charts and cause-and-effect dia-grams
I started writing the package to provide the stu-dents in the class I teach a tool for learning the basicconcepts of statistical quality control at the introduc-tory level of a book such as Montgomery (2000) Dueto the intended final users a free statistical environ-ment was important and R was a natural candidate
The initial design albeit it was written fromscratch reflected the set of functions available in S-Plus Users of this last environment will find somesimilarities but also differences as well In particu-lar during the development I added more tools andre-designed most of the library to meet the require-ments I thought were important to satisfy
Creating a qcc object
A qcc object is the basic building block to start withThis can be simply created invoking the qcc functionwhich in its simplest form requires two arguments adata frame a matrix or a vector containing the ob-served data and a string value specifying the con-trol chart to be computed Standard Shewhart con-trol charts are available These are often classified ac-cording to the type of quality characteristic that theyare supposed to monitor (see Table 1)
Control charts for continuous variables are usu-ally based on several samples with observations col-lected at different time point Each sample or ldquora-tionale grouprdquo must be provided as a row of a dataframe or a matrix The function qccgroups can beused to easily group a vector of data values basedon a sample indicator Unequal sample sizes are al-lowed
Suppose we have extracted samples for a charac-teristic of interest from an ongoing production pro-cess
gt data(pistonrings)
gt attach(pistonrings)
gt dim(pistonrings)
[1] 200 3
gt pistonrings
diameter sample trial
1 74030 1 TRUE
2 74002 1 TRUE
3 74019 1 TRUE
4 73992 1 TRUE
5 74008 1 TRUE
6 73995 2 TRUE
7 73992 2 TRUE
199 74000 40 FALSE
200 74020 40 FALSE
gt diameter lt- qccgroups(diameter sample)
gt dim(diameter)
[1] 40 5
gt diameter
[1] [2] [3] [4] [5]
1 74030 74002 74019 73992 74008
2 73995 73992 74001 74011 74004
40 74010 74005 74029 74000 74020
Using the first 25 samples as training data an X chartcan be obtained as follows
gt obj lt- qcc(diameter[125] type=xbar)
By default a Shewhart chart is drawn (see Figure 1)and an object of class lsquoqccrsquo is returned Summarystatistics can be retrieved using the summary method(or simply by printing the object)
gt summary(obj)
Call
qcc(data = diameter[125 ] type = xbar)
xbar chart for diameter[125 ]
Summary of group statistics
Min 1st Qu Median Mean 3rd Qu Max
7399 7400 7400 7400 7400 7401
Group sample size 5
Number of groups 25
Center of group statistics 7400118
Standard deviation 0009887547
Control limits
LCL UCL
7398791 7401444
R News ISSN 1609-3631
Vol 41 June 2004 12
type Control charts for variablesxbar X chart Sample means are plotted to
control the mean level of a con-tinuous process variable
xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable
R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable
S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able
type Control charts for attributesp p chart The proportion of nonconform-
ing units is plotted Controllimits are based on the binomialdistribution
np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution
c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion
u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units
Table 1 Shewhart control charts available in the qcc package
R News ISSN 1609-3631
Vol 41 June 2004 13
The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned
The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument
Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve
Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively
xbar Chartfor diameter[125 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9073
995
740
0074
005
740
1074
015
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24
LCL
UCL
Number of groups = 25Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 0Number violating runs = 0
Figure 1 A X chart for samples data
Shewhart control charts
A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows
gt plot(obj)
giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the
sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)
Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample
gt obj lt- qcc(diameter[125] type=xbar
+ newdata=diameter[2640])
plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples
xbar Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9074
000
740
1074
020
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
LCL
UCL
Calibration data in diameter[125 ] New data in diameter[2640 ]
Number of groups = 40Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 3Number violating runs = 1
Figure 2 A X chart with calibration data and newdata samples
If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method
gt plot(obj chartall=FALSE)
Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))
As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points
Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows
gt data(orangejuice)
gt attach(orangejuice)
R News ISSN 1609-3631
Vol 41 June 2004 14
gt orangejuice
sample D size trial
1 1 12 50 TRUE
2 2 15 50 TRUE
3 3 8 50 TRUE
53 53 3 50 FALSE
54 54 5 50 FALSE
gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)
where we use the logical indicator variable trial toselect the first 30 samples as calibration data
Operating characteristic function
An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo
The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo
gt par(mfrow=c(12))
gt occurves(obj)
gt occurves(obj2)
The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot
0 1 2 3 4 5
00
02
04
06
08
10
OC curves for xbar chart
Process shift (stddev)
Pro
b ty
pe II
err
or
n = 5n = 1n = 10n = 15n = 20
00 02 04 06 08 10
00
02
04
06
08
10
OC curves for p chart
p
Pro
b ty
pe II
err
or
Figure 3 Operating characteristics curves
Cusum charts
Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process
The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code
gt cusum(obj)
and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1
Cusum Chartfor diameter[125 ] and diameter[2640 ]
Group
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
minus5
minus3
minus1
13
57
911
1315
17
Cum
ulat
ive
Sum
Abo
ve T
arge
tB
elow
Tar
get
LDB
UDB
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4
Figure 4 Cusum chart
EWMA charts
Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process
The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows
gt ewma(obj)
and the resulting graph is shown in Figure 5
R News ISSN 1609-3631
Vol 41 June 2004 15
EWMA Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up S
umm
ary
Sta
tistic
s
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
739
9073
995
740
0074
005
740
1074
015
740
20
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Smoothing parameter = 02Control limits at 3sigma
Figure 5 EWMA chart
Process capability analysis
Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use
gt processcapability(obj
+ speclimits=c(73957405))
Process Capability Analysis
Call
processcapability(object = obj
speclimits = c(7395 7405))
Number of obs = 125 Target = 74
Center = 7400118 LSL = 7395
StdDev = 0009887547 USL = 7405
Capability indices
Value 25 975
Cp 1686 1476 1895
Cp_l 1725 1539 1912
Cp_u 1646 1467 1825
Cp_k 1646 1433 1859
Cpm 1674 1465 1882
ExpltLSL 0 ObsltLSL 0
ExpgtUSL 0 ObsgtUSL 0
The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if
available otherwise the target is set at the middlevalue between the specification limits
The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument
The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)
Process Capability Analysisfor diameter[125 ]
7394 7396 7398 7400 7402 7404 7406
LSL USLTarget
Number of obs = 125Center = 7400118StdDev = 0009887547
Target = 74LSL = 7395USL = 7405
Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167
ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0
Figure 6 Process capability analysis
A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice
bull a X chart
bull a R or S chart (if sample sizes gt 10)
bull a run chart
bull a histogram with specification limits
bull a normal QndashQ plot
bull a capability plot
As an example we may input the following code
gt processcapabilitysixpack(obj
+ speclimits=c(73957405)
+ target= 7401)
Pareto charts and cause-and-effectdiagrams
A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system
R News ISSN 1609-3631
Vol 41 June 2004 16
and thus screen out the less significant factors in ananalysis
A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example
gt defect lt- c(80 27 66 94 33)
gt names(defect) lt-
+ c(price code schedule date
+ supplier code contact num
+ part num)
gt paretochart(defect)
The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))
contact num
price code
supplier code
part num
schedule date
Pareto
Ch
art for d
efect
Frequency
0 50 100 150 200 250 300
0 25 50 75 100
Cumulative Percentage
Figure 7 Pareto chart
A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it
A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8
gt causeandeffect(
+ cause=list(Measurements=c(Microscopes
+ Inspectors)
+ Materials=c(Alloys
+ Suppliers)
+ Personnel=c(Supervisors
+ Operators)
+ Environment=c(Condensation
+ Moisture)
+ Methods=c(Brake Engager
+ Angle)
+ Machines=c(Speed Bits
+ Sockets))
+ effect=Surface Flaws)
CauseminusandminusEffect diagram
Surface Flaws
Measurements
Environment
Materials
Methods
Personnel
Machines
Microscopes
Inspectors
Alloys
Suppliers
Supervisors
Operators
Condensation
Moisture
Brake
Engager
Angle
Speed
Bits
Sockets
Figure 8 Cause-and-effect diagram
Tuning and extending the package
Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)
Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be
R News ISSN 1609-3631
Vol 41 July 8 2004 17
called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart
statspstd lt- function(data sizes)
data lt- asvector(data)
sizes lt- asvector(sizes)
pbar lt- sum(data)sum(sizes)
z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)
list(statistics = z center = 0)
sdpstd lt- function(data sizes) return(1)
limitspstd lt- function(center stddev sizes conf)
if (conf gt= 1) lcl lt- -conf
ucl lt- +conf
else
if (conf gt 0 amp conf lt 1)
nsigmas lt- qnorm(1 - (1 - conf)2)
lcl lt- -nsigmas
ucl lt- +nsigmas
else stop(invalid conf argument)
limits lt- matrix(c(lcl ucl) ncol = 2)
rownames(limits) lt- rep( length = nrow(limits))
colnames(limits) lt- c(LCL UCL)
return(limits)
Then we may source the above code and obtainthe control charts in Figure 9 as follows
set unequal sample sizes
gt n lt- c(rep(505) rep(1005) rep(25 5))
generate randomly the number of successes
gt x lt- rbinom(length(n) n 02)
gt par(mfrow=c(12))
plot the control chart with variable limits
gt qcc(x type=p size=n)
plot the standardized control chart
gt qcc(x type=pstd size=n)
p Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
00
01
02
03
04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 01931429StdDev = 03947641
LCL is variableUCL is variable
Number beyond limits = 0Number violating runs = 0
pstd Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
minus3
minus2
minus1
01
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 0StdDev = 1
LCL = minus3UCL = 3
Number beyond limits = 0Number violating runs = 0
Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)
Summary
In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram
References
Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall
Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit
Least Squares Calculations in RTiming different approaches
by Douglas Bates
Introduction
The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -
how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods
Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses
In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-
R News ISSN 1609-3631
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 12
type Control charts for variablesxbar X chart Sample means are plotted to
control the mean level of a con-tinuous process variable
xbarone X chart Sample values from a onendashatndashtime data process to control themean level of a continuous pro-cess variable
R R chart Sample ranges are plotted tocontrol the variability of a con-tinuous process variable
S S chart Sample standard deviations areplotted to control the variabil-ity of a continuous process vari-able
type Control charts for attributesp p chart The proportion of nonconform-
ing units is plotted Controllimits are based on the binomialdistribution
np np chart The number of nonconformingunits is plotted Control limitsare based on the binomial dis-tribution
c c chart The number of defectives perunit are plotted This chartassumes that defects of thequality attribute are rare andthe control limits are computedbased on the Poisson distribu-tion
u u chart The average number of defec-tives per unit is plotted ThePoisson distribution is used tocompute control limits but un-like the c chart this chart doesnot require a constant numberof units
Table 1 Shewhart control charts available in the qcc package
R News ISSN 1609-3631
Vol 41 June 2004 13
The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned
The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument
Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve
Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively
xbar Chartfor diameter[125 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9073
995
740
0074
005
740
1074
015
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24
LCL
UCL
Number of groups = 25Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 0Number violating runs = 0
Figure 1 A X chart for samples data
Shewhart control charts
A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows
gt plot(obj)
giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the
sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)
Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample
gt obj lt- qcc(diameter[125] type=xbar
+ newdata=diameter[2640])
plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples
xbar Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9074
000
740
1074
020
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
LCL
UCL
Calibration data in diameter[125 ] New data in diameter[2640 ]
Number of groups = 40Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 3Number violating runs = 1
Figure 2 A X chart with calibration data and newdata samples
If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method
gt plot(obj chartall=FALSE)
Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))
As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points
Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows
gt data(orangejuice)
gt attach(orangejuice)
R News ISSN 1609-3631
Vol 41 June 2004 14
gt orangejuice
sample D size trial
1 1 12 50 TRUE
2 2 15 50 TRUE
3 3 8 50 TRUE
53 53 3 50 FALSE
54 54 5 50 FALSE
gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)
where we use the logical indicator variable trial toselect the first 30 samples as calibration data
Operating characteristic function
An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo
The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo
gt par(mfrow=c(12))
gt occurves(obj)
gt occurves(obj2)
The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot
0 1 2 3 4 5
00
02
04
06
08
10
OC curves for xbar chart
Process shift (stddev)
Pro
b ty
pe II
err
or
n = 5n = 1n = 10n = 15n = 20
00 02 04 06 08 10
00
02
04
06
08
10
OC curves for p chart
p
Pro
b ty
pe II
err
or
Figure 3 Operating characteristics curves
Cusum charts
Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process
The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code
gt cusum(obj)
and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1
Cusum Chartfor diameter[125 ] and diameter[2640 ]
Group
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
minus5
minus3
minus1
13
57
911
1315
17
Cum
ulat
ive
Sum
Abo
ve T
arge
tB
elow
Tar
get
LDB
UDB
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4
Figure 4 Cusum chart
EWMA charts
Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process
The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows
gt ewma(obj)
and the resulting graph is shown in Figure 5
R News ISSN 1609-3631
Vol 41 June 2004 15
EWMA Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up S
umm
ary
Sta
tistic
s
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
739
9073
995
740
0074
005
740
1074
015
740
20
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Smoothing parameter = 02Control limits at 3sigma
Figure 5 EWMA chart
Process capability analysis
Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use
gt processcapability(obj
+ speclimits=c(73957405))
Process Capability Analysis
Call
processcapability(object = obj
speclimits = c(7395 7405))
Number of obs = 125 Target = 74
Center = 7400118 LSL = 7395
StdDev = 0009887547 USL = 7405
Capability indices
Value 25 975
Cp 1686 1476 1895
Cp_l 1725 1539 1912
Cp_u 1646 1467 1825
Cp_k 1646 1433 1859
Cpm 1674 1465 1882
ExpltLSL 0 ObsltLSL 0
ExpgtUSL 0 ObsgtUSL 0
The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if
available otherwise the target is set at the middlevalue between the specification limits
The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument
The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)
Process Capability Analysisfor diameter[125 ]
7394 7396 7398 7400 7402 7404 7406
LSL USLTarget
Number of obs = 125Center = 7400118StdDev = 0009887547
Target = 74LSL = 7395USL = 7405
Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167
ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0
Figure 6 Process capability analysis
A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice
bull a X chart
bull a R or S chart (if sample sizes gt 10)
bull a run chart
bull a histogram with specification limits
bull a normal QndashQ plot
bull a capability plot
As an example we may input the following code
gt processcapabilitysixpack(obj
+ speclimits=c(73957405)
+ target= 7401)
Pareto charts and cause-and-effectdiagrams
A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system
R News ISSN 1609-3631
Vol 41 June 2004 16
and thus screen out the less significant factors in ananalysis
A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example
gt defect lt- c(80 27 66 94 33)
gt names(defect) lt-
+ c(price code schedule date
+ supplier code contact num
+ part num)
gt paretochart(defect)
The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))
contact num
price code
supplier code
part num
schedule date
Pareto
Ch
art for d
efect
Frequency
0 50 100 150 200 250 300
0 25 50 75 100
Cumulative Percentage
Figure 7 Pareto chart
A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it
A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8
gt causeandeffect(
+ cause=list(Measurements=c(Microscopes
+ Inspectors)
+ Materials=c(Alloys
+ Suppliers)
+ Personnel=c(Supervisors
+ Operators)
+ Environment=c(Condensation
+ Moisture)
+ Methods=c(Brake Engager
+ Angle)
+ Machines=c(Speed Bits
+ Sockets))
+ effect=Surface Flaws)
CauseminusandminusEffect diagram
Surface Flaws
Measurements
Environment
Materials
Methods
Personnel
Machines
Microscopes
Inspectors
Alloys
Suppliers
Supervisors
Operators
Condensation
Moisture
Brake
Engager
Angle
Speed
Bits
Sockets
Figure 8 Cause-and-effect diagram
Tuning and extending the package
Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)
Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be
R News ISSN 1609-3631
Vol 41 July 8 2004 17
called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart
statspstd lt- function(data sizes)
data lt- asvector(data)
sizes lt- asvector(sizes)
pbar lt- sum(data)sum(sizes)
z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)
list(statistics = z center = 0)
sdpstd lt- function(data sizes) return(1)
limitspstd lt- function(center stddev sizes conf)
if (conf gt= 1) lcl lt- -conf
ucl lt- +conf
else
if (conf gt 0 amp conf lt 1)
nsigmas lt- qnorm(1 - (1 - conf)2)
lcl lt- -nsigmas
ucl lt- +nsigmas
else stop(invalid conf argument)
limits lt- matrix(c(lcl ucl) ncol = 2)
rownames(limits) lt- rep( length = nrow(limits))
colnames(limits) lt- c(LCL UCL)
return(limits)
Then we may source the above code and obtainthe control charts in Figure 9 as follows
set unequal sample sizes
gt n lt- c(rep(505) rep(1005) rep(25 5))
generate randomly the number of successes
gt x lt- rbinom(length(n) n 02)
gt par(mfrow=c(12))
plot the control chart with variable limits
gt qcc(x type=p size=n)
plot the standardized control chart
gt qcc(x type=pstd size=n)
p Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
00
01
02
03
04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 01931429StdDev = 03947641
LCL is variableUCL is variable
Number beyond limits = 0Number violating runs = 0
pstd Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
minus3
minus2
minus1
01
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 0StdDev = 1
LCL = minus3UCL = 3
Number beyond limits = 0Number violating runs = 0
Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)
Summary
In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram
References
Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall
Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit
Least Squares Calculations in RTiming different approaches
by Douglas Bates
Introduction
The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -
how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods
Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses
In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-
R News ISSN 1609-3631
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 13
The number of groups and their sizes are reported inthis case whereas a table is provided in the case ofunequal sample sizes Moreover the center of groupstatistics (the overall mean for an X chart) and thewithin-group standard deviation of the process arereturned
The main goal of a control chart is to monitora process If special causes of variation are presentthe process is said to be ldquoout of controlrdquo and actionsare to be taken to find and possibly eliminate suchcauses A process is declared to be ldquoin controlrdquo if allpoints charted lie randomly within the control limitsThese are usually computed at plusmn3σs from the cen-ter This default can be changed using the argumentnsigmas (so called ldquoamerican practicerdquo) or specify-ing the confidence level (so called ldquobritish practicerdquo)through the confidencelevel argument
Assuming a Gaussian distribution is appropriatefor the statistic charted the two practices are equiv-alent since limits at plusmn3σs correspond to a two-tailsprobability of 00027 under a standard normal curve
Instead of using standard calculations to computethe group statistics and the corresponding controllimits (see Montgomery (2000) Wetherill and Brown(1991)) the center the within-group standard devia-tion and the control limits may be specified directlyby the user through the arguments center stddevand limits respectively
xbar Chartfor diameter[125 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9073
995
740
0074
005
740
1074
015
1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 22 24
LCL
UCL
Number of groups = 25Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 0Number violating runs = 0
Figure 1 A X chart for samples data
Shewhart control charts
A Shewhart chart is automatically plotted when anobject of class lsquoqccrsquo is created unless the qcc func-tion is called with the argument plot=FALSE Themethod responsible for plotting a control chart canhowever be invoked directly as follows
gt plot(obj)
giving the plot in Figure 1 This control charthas the center line (horizontal solid line) the up-per and lower control limits (dashed lines) and the
sample group statistics are drawn as points con-nected with lines Unless we provide the argu-ment addstats=FALSE at the bottom of the plotsome summary statistics are shown together withthe number of points beyond control limits and thenumber of violating runs (a run has length 5 by de-fault)
Once a process is considered to be ldquoin-controlrdquowe may use the computed limits for monitoring newdata sampled from the same ongoing process Forexample
gt obj lt- qcc(diameter[125] type=xbar
+ newdata=diameter[2640])
plot the X chart for both calibration data and newdata (see Figure 2) but all the statistics and the con-trol limits are solely based on the first 25 samples
xbar Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up s
umm
ary
stat
istic
s
739
9074
000
740
1074
020
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
LCL
UCL
Calibration data in diameter[125 ] New data in diameter[2640 ]
Number of groups = 40Center = 7400118StdDev = 0009887547
LCL = 7398791UCL = 7401444
Number beyond limits = 3Number violating runs = 1
Figure 2 A X chart with calibration data and newdata samples
If only the new data need to be plotted we mayprovide the argument chartall=FALSE in the callto the qcc function or through the plotting method
gt plot(obj chartall=FALSE)
Many other arguments may be provided in thecall to the qcc function and to the correspondingplotting method and we refer the reader to the onndashline documentation for details (help(qcc))
As reported on Table 1 an X chart foronendashatndashtime data may be obtained specifyingtype=xbarone In this case the data chartedare simply the sample values and the within-groupstandard deviation is estimated by moving ranges ofk (by default 2) points
Control charts for attributes mainly differ fromthe previous examples in that we need to providesample sizes through the size argument (except forthe c chart) For example a p chart can be obtainedas follows
gt data(orangejuice)
gt attach(orangejuice)
R News ISSN 1609-3631
Vol 41 June 2004 14
gt orangejuice
sample D size trial
1 1 12 50 TRUE
2 2 15 50 TRUE
3 3 8 50 TRUE
53 53 3 50 FALSE
54 54 5 50 FALSE
gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)
where we use the logical indicator variable trial toselect the first 30 samples as calibration data
Operating characteristic function
An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo
The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo
gt par(mfrow=c(12))
gt occurves(obj)
gt occurves(obj2)
The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot
0 1 2 3 4 5
00
02
04
06
08
10
OC curves for xbar chart
Process shift (stddev)
Pro
b ty
pe II
err
or
n = 5n = 1n = 10n = 15n = 20
00 02 04 06 08 10
00
02
04
06
08
10
OC curves for p chart
p
Pro
b ty
pe II
err
or
Figure 3 Operating characteristics curves
Cusum charts
Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process
The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code
gt cusum(obj)
and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1
Cusum Chartfor diameter[125 ] and diameter[2640 ]
Group
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
minus5
minus3
minus1
13
57
911
1315
17
Cum
ulat
ive
Sum
Abo
ve T
arge
tB
elow
Tar
get
LDB
UDB
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4
Figure 4 Cusum chart
EWMA charts
Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process
The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows
gt ewma(obj)
and the resulting graph is shown in Figure 5
R News ISSN 1609-3631
Vol 41 June 2004 15
EWMA Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up S
umm
ary
Sta
tistic
s
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
739
9073
995
740
0074
005
740
1074
015
740
20
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Smoothing parameter = 02Control limits at 3sigma
Figure 5 EWMA chart
Process capability analysis
Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use
gt processcapability(obj
+ speclimits=c(73957405))
Process Capability Analysis
Call
processcapability(object = obj
speclimits = c(7395 7405))
Number of obs = 125 Target = 74
Center = 7400118 LSL = 7395
StdDev = 0009887547 USL = 7405
Capability indices
Value 25 975
Cp 1686 1476 1895
Cp_l 1725 1539 1912
Cp_u 1646 1467 1825
Cp_k 1646 1433 1859
Cpm 1674 1465 1882
ExpltLSL 0 ObsltLSL 0
ExpgtUSL 0 ObsgtUSL 0
The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if
available otherwise the target is set at the middlevalue between the specification limits
The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument
The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)
Process Capability Analysisfor diameter[125 ]
7394 7396 7398 7400 7402 7404 7406
LSL USLTarget
Number of obs = 125Center = 7400118StdDev = 0009887547
Target = 74LSL = 7395USL = 7405
Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167
ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0
Figure 6 Process capability analysis
A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice
bull a X chart
bull a R or S chart (if sample sizes gt 10)
bull a run chart
bull a histogram with specification limits
bull a normal QndashQ plot
bull a capability plot
As an example we may input the following code
gt processcapabilitysixpack(obj
+ speclimits=c(73957405)
+ target= 7401)
Pareto charts and cause-and-effectdiagrams
A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system
R News ISSN 1609-3631
Vol 41 June 2004 16
and thus screen out the less significant factors in ananalysis
A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example
gt defect lt- c(80 27 66 94 33)
gt names(defect) lt-
+ c(price code schedule date
+ supplier code contact num
+ part num)
gt paretochart(defect)
The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))
contact num
price code
supplier code
part num
schedule date
Pareto
Ch
art for d
efect
Frequency
0 50 100 150 200 250 300
0 25 50 75 100
Cumulative Percentage
Figure 7 Pareto chart
A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it
A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8
gt causeandeffect(
+ cause=list(Measurements=c(Microscopes
+ Inspectors)
+ Materials=c(Alloys
+ Suppliers)
+ Personnel=c(Supervisors
+ Operators)
+ Environment=c(Condensation
+ Moisture)
+ Methods=c(Brake Engager
+ Angle)
+ Machines=c(Speed Bits
+ Sockets))
+ effect=Surface Flaws)
CauseminusandminusEffect diagram
Surface Flaws
Measurements
Environment
Materials
Methods
Personnel
Machines
Microscopes
Inspectors
Alloys
Suppliers
Supervisors
Operators
Condensation
Moisture
Brake
Engager
Angle
Speed
Bits
Sockets
Figure 8 Cause-and-effect diagram
Tuning and extending the package
Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)
Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be
R News ISSN 1609-3631
Vol 41 July 8 2004 17
called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart
statspstd lt- function(data sizes)
data lt- asvector(data)
sizes lt- asvector(sizes)
pbar lt- sum(data)sum(sizes)
z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)
list(statistics = z center = 0)
sdpstd lt- function(data sizes) return(1)
limitspstd lt- function(center stddev sizes conf)
if (conf gt= 1) lcl lt- -conf
ucl lt- +conf
else
if (conf gt 0 amp conf lt 1)
nsigmas lt- qnorm(1 - (1 - conf)2)
lcl lt- -nsigmas
ucl lt- +nsigmas
else stop(invalid conf argument)
limits lt- matrix(c(lcl ucl) ncol = 2)
rownames(limits) lt- rep( length = nrow(limits))
colnames(limits) lt- c(LCL UCL)
return(limits)
Then we may source the above code and obtainthe control charts in Figure 9 as follows
set unequal sample sizes
gt n lt- c(rep(505) rep(1005) rep(25 5))
generate randomly the number of successes
gt x lt- rbinom(length(n) n 02)
gt par(mfrow=c(12))
plot the control chart with variable limits
gt qcc(x type=p size=n)
plot the standardized control chart
gt qcc(x type=pstd size=n)
p Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
00
01
02
03
04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 01931429StdDev = 03947641
LCL is variableUCL is variable
Number beyond limits = 0Number violating runs = 0
pstd Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
minus3
minus2
minus1
01
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 0StdDev = 1
LCL = minus3UCL = 3
Number beyond limits = 0Number violating runs = 0
Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)
Summary
In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram
References
Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall
Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit
Least Squares Calculations in RTiming different approaches
by Douglas Bates
Introduction
The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -
how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods
Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses
In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-
R News ISSN 1609-3631
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 14
gt orangejuice
sample D size trial
1 1 12 50 TRUE
2 2 15 50 TRUE
3 3 8 50 TRUE
53 53 3 50 FALSE
54 54 5 50 FALSE
gt obj2 lt- qcc(D[trial] sizes=size[trial] type=p)
where we use the logical indicator variable trial toselect the first 30 samples as calibration data
Operating characteristic function
An operating characteristic (OC) curve provides in-formation about the probability of not detecting ashift in the process This is usually referred to as thetype II error that is the probability of erroneouslyaccepting a process as being ldquoin controlrdquo
The OC curves can be easily obtained from an ob-ject of class lsquoqccrsquo
gt par(mfrow=c(12))
gt occurves(obj)
gt occurves(obj2)
The function occurves invisibly returns a ma-trix or a vector of probabilities values for thetype II error More arguments are available (seehelp(occurves)) in particular identify=TRUE al-lows to interactively identify values on th e plot
0 1 2 3 4 5
00
02
04
06
08
10
OC curves for xbar chart
Process shift (stddev)
Pro
b ty
pe II
err
or
n = 5n = 1n = 10n = 15n = 20
00 02 04 06 08 10
00
02
04
06
08
10
OC curves for p chart
p
Pro
b ty
pe II
err
or
Figure 3 Operating characteristics curves
Cusum charts
Cusum charts display how the group summarystatistics deviate above or below the process centeror target value relative to the standard error of thesummary statistics They are useful to detect smalland permanent variation on the mean of the process
The basic cusum chart implemented in the qccpackage is only available for continuous variables atthe moment Assuming a lsquoqccrsquo object has been cre-ated we can simply input the following code
gt cusum(obj)
and the resulting cusum chart is shown in Figure 4This displays on a single plot both the positive de-viations and the negative deviations from the targetwhich by default is set to the center of group statis-tics If a different target for the process is requiredthis value may be provided through the argumenttarget when creating the object of class lsquoqccrsquo Twofurther aspects may need to be controlled The ar-gument decisionint can be used to specifies thenumber of standard errors of the summary statisticsat which the cumulative sum displays an out of con-trol by default it is set at 5 The amount of shift inthe process to be detected by the cusum chart mea-sured in standard errors of the summary statistics iscontrolled by the argument seshift by default setat 1
Cusum Chartfor diameter[125 ] and diameter[2640 ]
Group
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
minus5
minus3
minus1
13
57
911
1315
17
Cum
ulat
ive
Sum
Abo
ve T
arge
tB
elow
Tar
get
LDB
UDB
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Decision boundaries (std err) = 5Shift detection (std err) = 1No of points beyond boundaries = 4
Figure 4 Cusum chart
EWMA charts
Exponentially weighted moving average (EWMA)charts smooth a series of data based on a moving av-erage with weights which decay exponentially Asfor the cusum chart also this plot can be used to de-tect small and permanent variation on the mean ofthe process
The EWMA chart depends on the smoothing pa-rameter λ which controls the weighting scheme ap-plied By default it is set at 02 but it can be modifiedusing the argument lambda Assuming again that alsquoqccrsquo object has been created an EWMA chart canbe obtained as follows
gt ewma(obj)
and the resulting graph is shown in Figure 5
R News ISSN 1609-3631
Vol 41 June 2004 15
EWMA Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up S
umm
ary
Sta
tistic
s
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
739
9073
995
740
0074
005
740
1074
015
740
20
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Smoothing parameter = 02Control limits at 3sigma
Figure 5 EWMA chart
Process capability analysis
Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use
gt processcapability(obj
+ speclimits=c(73957405))
Process Capability Analysis
Call
processcapability(object = obj
speclimits = c(7395 7405))
Number of obs = 125 Target = 74
Center = 7400118 LSL = 7395
StdDev = 0009887547 USL = 7405
Capability indices
Value 25 975
Cp 1686 1476 1895
Cp_l 1725 1539 1912
Cp_u 1646 1467 1825
Cp_k 1646 1433 1859
Cpm 1674 1465 1882
ExpltLSL 0 ObsltLSL 0
ExpgtUSL 0 ObsgtUSL 0
The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if
available otherwise the target is set at the middlevalue between the specification limits
The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument
The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)
Process Capability Analysisfor diameter[125 ]
7394 7396 7398 7400 7402 7404 7406
LSL USLTarget
Number of obs = 125Center = 7400118StdDev = 0009887547
Target = 74LSL = 7395USL = 7405
Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167
ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0
Figure 6 Process capability analysis
A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice
bull a X chart
bull a R or S chart (if sample sizes gt 10)
bull a run chart
bull a histogram with specification limits
bull a normal QndashQ plot
bull a capability plot
As an example we may input the following code
gt processcapabilitysixpack(obj
+ speclimits=c(73957405)
+ target= 7401)
Pareto charts and cause-and-effectdiagrams
A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system
R News ISSN 1609-3631
Vol 41 June 2004 16
and thus screen out the less significant factors in ananalysis
A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example
gt defect lt- c(80 27 66 94 33)
gt names(defect) lt-
+ c(price code schedule date
+ supplier code contact num
+ part num)
gt paretochart(defect)
The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))
contact num
price code
supplier code
part num
schedule date
Pareto
Ch
art for d
efect
Frequency
0 50 100 150 200 250 300
0 25 50 75 100
Cumulative Percentage
Figure 7 Pareto chart
A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it
A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8
gt causeandeffect(
+ cause=list(Measurements=c(Microscopes
+ Inspectors)
+ Materials=c(Alloys
+ Suppliers)
+ Personnel=c(Supervisors
+ Operators)
+ Environment=c(Condensation
+ Moisture)
+ Methods=c(Brake Engager
+ Angle)
+ Machines=c(Speed Bits
+ Sockets))
+ effect=Surface Flaws)
CauseminusandminusEffect diagram
Surface Flaws
Measurements
Environment
Materials
Methods
Personnel
Machines
Microscopes
Inspectors
Alloys
Suppliers
Supervisors
Operators
Condensation
Moisture
Brake
Engager
Angle
Speed
Bits
Sockets
Figure 8 Cause-and-effect diagram
Tuning and extending the package
Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)
Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be
R News ISSN 1609-3631
Vol 41 July 8 2004 17
called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart
statspstd lt- function(data sizes)
data lt- asvector(data)
sizes lt- asvector(sizes)
pbar lt- sum(data)sum(sizes)
z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)
list(statistics = z center = 0)
sdpstd lt- function(data sizes) return(1)
limitspstd lt- function(center stddev sizes conf)
if (conf gt= 1) lcl lt- -conf
ucl lt- +conf
else
if (conf gt 0 amp conf lt 1)
nsigmas lt- qnorm(1 - (1 - conf)2)
lcl lt- -nsigmas
ucl lt- +nsigmas
else stop(invalid conf argument)
limits lt- matrix(c(lcl ucl) ncol = 2)
rownames(limits) lt- rep( length = nrow(limits))
colnames(limits) lt- c(LCL UCL)
return(limits)
Then we may source the above code and obtainthe control charts in Figure 9 as follows
set unequal sample sizes
gt n lt- c(rep(505) rep(1005) rep(25 5))
generate randomly the number of successes
gt x lt- rbinom(length(n) n 02)
gt par(mfrow=c(12))
plot the control chart with variable limits
gt qcc(x type=p size=n)
plot the standardized control chart
gt qcc(x type=pstd size=n)
p Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
00
01
02
03
04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 01931429StdDev = 03947641
LCL is variableUCL is variable
Number beyond limits = 0Number violating runs = 0
pstd Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
minus3
minus2
minus1
01
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 0StdDev = 1
LCL = minus3UCL = 3
Number beyond limits = 0Number violating runs = 0
Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)
Summary
In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram
References
Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall
Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit
Least Squares Calculations in RTiming different approaches
by Douglas Bates
Introduction
The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -
how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods
Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses
In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-
R News ISSN 1609-3631
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 15
EWMA Chartfor diameter[125 ] and diameter[2640 ]
Group
Gro
up S
umm
ary
Sta
tistic
s
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
739
9073
995
740
0074
005
740
1074
015
740
20
Calibration Data in diameter[125 ] New Data in diameter[2640 ]
Number of groups = 40Target = 7400118StdDev = 0009887547
Smoothing parameter = 02Control limits at 3sigma
Figure 5 EWMA chart
Process capability analysis
Process capability indices for a characteristic ofinterest from a continuous process can be ob-tained through the function processcapabilityThis takes as mandatory arguments an object ofclass lsquoqccrsquo for a xbar or xbarone type andspeclimits a vector specifying the lower (LSL)and upper (USL) specification limits For exampleusing the previously created lsquoqccrsquo object we maysimply use
gt processcapability(obj
+ speclimits=c(73957405))
Process Capability Analysis
Call
processcapability(object = obj
speclimits = c(7395 7405))
Number of obs = 125 Target = 74
Center = 7400118 LSL = 7395
StdDev = 0009887547 USL = 7405
Capability indices
Value 25 975
Cp 1686 1476 1895
Cp_l 1725 1539 1912
Cp_u 1646 1467 1825
Cp_k 1646 1433 1859
Cpm 1674 1465 1882
ExpltLSL 0 ObsltLSL 0
ExpgtUSL 0 ObsgtUSL 0
The graph shown in Figure 6 is an histogram ofdata values with vertical dotted lines for the up-per and the lower specification limits The target isshown as a vertical dashed line and its value can beprovided using the target argument in the call ifmissing the value from the lsquoqccrsquo object is used if
available otherwise the target is set at the middlevalue between the specification limits
The dashed curve is a normal density for a Gaus-sian distribution matching the observed mean andstandard deviation The latter is estimated using thewithin-group standard deviation unless a value isspecified in the call using the stddev argument
The capability indices reported on the graph arealso printed at console together with confidenceintervals computed at the level specified by theconfidencelevel argument (by default set at 095)
Process Capability Analysisfor diameter[125 ]
7394 7396 7398 7400 7402 7404 7406
LSL USLTarget
Number of obs = 125Center = 7400118StdDev = 0009887547
Target = 74LSL = 7395USL = 7405
Cp = 169Cp_l = 173Cp_u = 165Cp_k = 165Cpm = 167
ExpltLSL 0ExpgtUSL 0ObsltLSL 0ObsgtUSL 0
Figure 6 Process capability analysis
A further display called ldquoprocess capability six-pack plotsrdquo is also available This is a graphical sum-mary formed by plotting on the same graphical de-vice
bull a X chart
bull a R or S chart (if sample sizes gt 10)
bull a run chart
bull a histogram with specification limits
bull a normal QndashQ plot
bull a capability plot
As an example we may input the following code
gt processcapabilitysixpack(obj
+ speclimits=c(73957405)
+ target= 7401)
Pareto charts and cause-and-effectdiagrams
A Pareto chart is a barplot where the categories areordered using frequencies in non increasing orderwith a line added to show their cumulative sumPareto charts are useful to identify those factors thathave the greatest cumulative effect on the system
R News ISSN 1609-3631
Vol 41 June 2004 16
and thus screen out the less significant factors in ananalysis
A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example
gt defect lt- c(80 27 66 94 33)
gt names(defect) lt-
+ c(price code schedule date
+ supplier code contact num
+ part num)
gt paretochart(defect)
The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))
contact num
price code
supplier code
part num
schedule date
Pareto
Ch
art for d
efect
Frequency
0 50 100 150 200 250 300
0 25 50 75 100
Cumulative Percentage
Figure 7 Pareto chart
A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it
A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8
gt causeandeffect(
+ cause=list(Measurements=c(Microscopes
+ Inspectors)
+ Materials=c(Alloys
+ Suppliers)
+ Personnel=c(Supervisors
+ Operators)
+ Environment=c(Condensation
+ Moisture)
+ Methods=c(Brake Engager
+ Angle)
+ Machines=c(Speed Bits
+ Sockets))
+ effect=Surface Flaws)
CauseminusandminusEffect diagram
Surface Flaws
Measurements
Environment
Materials
Methods
Personnel
Machines
Microscopes
Inspectors
Alloys
Suppliers
Supervisors
Operators
Condensation
Moisture
Brake
Engager
Angle
Speed
Bits
Sockets
Figure 8 Cause-and-effect diagram
Tuning and extending the package
Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)
Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be
R News ISSN 1609-3631
Vol 41 July 8 2004 17
called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart
statspstd lt- function(data sizes)
data lt- asvector(data)
sizes lt- asvector(sizes)
pbar lt- sum(data)sum(sizes)
z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)
list(statistics = z center = 0)
sdpstd lt- function(data sizes) return(1)
limitspstd lt- function(center stddev sizes conf)
if (conf gt= 1) lcl lt- -conf
ucl lt- +conf
else
if (conf gt 0 amp conf lt 1)
nsigmas lt- qnorm(1 - (1 - conf)2)
lcl lt- -nsigmas
ucl lt- +nsigmas
else stop(invalid conf argument)
limits lt- matrix(c(lcl ucl) ncol = 2)
rownames(limits) lt- rep( length = nrow(limits))
colnames(limits) lt- c(LCL UCL)
return(limits)
Then we may source the above code and obtainthe control charts in Figure 9 as follows
set unequal sample sizes
gt n lt- c(rep(505) rep(1005) rep(25 5))
generate randomly the number of successes
gt x lt- rbinom(length(n) n 02)
gt par(mfrow=c(12))
plot the control chart with variable limits
gt qcc(x type=p size=n)
plot the standardized control chart
gt qcc(x type=pstd size=n)
p Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
00
01
02
03
04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 01931429StdDev = 03947641
LCL is variableUCL is variable
Number beyond limits = 0Number violating runs = 0
pstd Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
minus3
minus2
minus1
01
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 0StdDev = 1
LCL = minus3UCL = 3
Number beyond limits = 0Number violating runs = 0
Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)
Summary
In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram
References
Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall
Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit
Least Squares Calculations in RTiming different approaches
by Douglas Bates
Introduction
The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -
how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods
Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses
In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-
R News ISSN 1609-3631
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 16
and thus screen out the less significant factors in ananalysis
A simple Pareto chart may be drawn using thefunction paretochart which in its simpler form re-quires a vector of values For example
gt defect lt- c(80 27 66 94 33)
gt names(defect) lt-
+ c(price code schedule date
+ supplier code contact num
+ part num)
gt paretochart(defect)
The function returns a table of descriptive statis-tics and the Pareto chart shown in Figure 7 Morearguments can be provided to control axes la-bels and limits the main title and bar colors (seehelp(paretochart))
contact num
price code
supplier code
part num
schedule date
Pareto
Ch
art for d
efect
Frequency
0 50 100 150 200 250 300
0 25 50 75 100
Cumulative Percentage
Figure 7 Pareto chart
A cause-and-effect diagram also called anIshikawa diagram (or fish bone diagram) is used toassociate multiple possible causes with a single ef-fect Thus given a particular effect the diagram isconstructed to identify and organize possible causesfor it
A very basic implementation of this type of dia-gram is available in the qcc package The functioncauseandeffect requires at least two argumentscause a list of causes and branches providing de-scriptive labels and effect a string describing theeffect Other arguments are available such as cexand font which control respectively the characterexpansions and the fonts used for labeling branchescauses and the effect The following example createsthe diagram shown in Figure 8
gt causeandeffect(
+ cause=list(Measurements=c(Microscopes
+ Inspectors)
+ Materials=c(Alloys
+ Suppliers)
+ Personnel=c(Supervisors
+ Operators)
+ Environment=c(Condensation
+ Moisture)
+ Methods=c(Brake Engager
+ Angle)
+ Machines=c(Speed Bits
+ Sockets))
+ effect=Surface Flaws)
CauseminusandminusEffect diagram
Surface Flaws
Measurements
Environment
Materials
Methods
Personnel
Machines
Microscopes
Inspectors
Alloys
Suppliers
Supervisors
Operators
Condensation
Moisture
Brake
Engager
Angle
Speed
Bits
Sockets
Figure 8 Cause-and-effect diagram
Tuning and extending the package
Options to control some aspects of the qcc packagecan be set using the function qccoptions If this iscalled with no arguments it retrieves the list of avail-able options and their current values otherwise itsets the required option From a user point of viewthe most interesting options allow to set the plottingcharacter and the color used to highlight points be-yond control limits or violating runs the run lengthie the maximum value of a run before to signal apoint as out of control the background color used todraw the figure and the margin color the characterexpansion used to draw the labels the title and thestatistics at the bottom of any plot For details seehelp(qccoptions)
Another interesting feature is the possibility toeasily extend the package defining new controlcharts Suppose we want to plot a p chart basedon samples with different sizes The resulting con-trol chart will have nonndashconstant control limits andthus it may result difficult to read by some opera-tors One possible solution is to draw a standard-ized control chart This can be accomplished defin-ing three new functions one which computes andreturns the group statistics to be charted (ie the zscores) and their center (zero in this case) anotherone which returns the within-group standard devia-tion (one in this case) and finally a function whichreturns the control limits These functions need to be
R News ISSN 1609-3631
Vol 41 July 8 2004 17
called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart
statspstd lt- function(data sizes)
data lt- asvector(data)
sizes lt- asvector(sizes)
pbar lt- sum(data)sum(sizes)
z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)
list(statistics = z center = 0)
sdpstd lt- function(data sizes) return(1)
limitspstd lt- function(center stddev sizes conf)
if (conf gt= 1) lcl lt- -conf
ucl lt- +conf
else
if (conf gt 0 amp conf lt 1)
nsigmas lt- qnorm(1 - (1 - conf)2)
lcl lt- -nsigmas
ucl lt- +nsigmas
else stop(invalid conf argument)
limits lt- matrix(c(lcl ucl) ncol = 2)
rownames(limits) lt- rep( length = nrow(limits))
colnames(limits) lt- c(LCL UCL)
return(limits)
Then we may source the above code and obtainthe control charts in Figure 9 as follows
set unequal sample sizes
gt n lt- c(rep(505) rep(1005) rep(25 5))
generate randomly the number of successes
gt x lt- rbinom(length(n) n 02)
gt par(mfrow=c(12))
plot the control chart with variable limits
gt qcc(x type=p size=n)
plot the standardized control chart
gt qcc(x type=pstd size=n)
p Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
00
01
02
03
04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 01931429StdDev = 03947641
LCL is variableUCL is variable
Number beyond limits = 0Number violating runs = 0
pstd Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
minus3
minus2
minus1
01
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 0StdDev = 1
LCL = minus3UCL = 3
Number beyond limits = 0Number violating runs = 0
Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)
Summary
In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram
References
Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall
Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit
Least Squares Calculations in RTiming different approaches
by Douglas Bates
Introduction
The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -
how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods
Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses
In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-
R News ISSN 1609-3631
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 July 8 2004 17
called statstype sdtype and limitstype re-spectively for a new chart of type type The fol-lowing code may be used to define a standardized pchart
statspstd lt- function(data sizes)
data lt- asvector(data)
sizes lt- asvector(sizes)
pbar lt- sum(data)sum(sizes)
z lt- (datasizes - pbar)sqrt(pbar(1-pbar)sizes)
list(statistics = z center = 0)
sdpstd lt- function(data sizes) return(1)
limitspstd lt- function(center stddev sizes conf)
if (conf gt= 1) lcl lt- -conf
ucl lt- +conf
else
if (conf gt 0 amp conf lt 1)
nsigmas lt- qnorm(1 - (1 - conf)2)
lcl lt- -nsigmas
ucl lt- +nsigmas
else stop(invalid conf argument)
limits lt- matrix(c(lcl ucl) ncol = 2)
rownames(limits) lt- rep( length = nrow(limits))
colnames(limits) lt- c(LCL UCL)
return(limits)
Then we may source the above code and obtainthe control charts in Figure 9 as follows
set unequal sample sizes
gt n lt- c(rep(505) rep(1005) rep(25 5))
generate randomly the number of successes
gt x lt- rbinom(length(n) n 02)
gt par(mfrow=c(12))
plot the control chart with variable limits
gt qcc(x type=p size=n)
plot the standardized control chart
gt qcc(x type=pstd size=n)
p Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
00
01
02
03
04
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 01931429StdDev = 03947641
LCL is variableUCL is variable
Number beyond limits = 0Number violating runs = 0
pstd Chartfor x
Group
Gro
up s
umm
ary
stat
istic
s
minus3
minus2
minus1
01
23
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LCL
UCL
Number of groups = 15Center = 0StdDev = 1
LCL = minus3UCL = 3
Number beyond limits = 0Number violating runs = 0
Figure 9 A p chart with nonndashconstant control limits(left panel) and the corresponding standardized con-trol chart (right panel)
Summary
In this paper we briefly describe the qcc packageThis provides functions to draw basic Shewhartquality control charts for continuous attribute andcount data corresponding operating characteristiccurves are also implemented Other statistical qual-ity tools available are Cusum and EWMA charts forcontinuous data process capability analyses Paretocharts and cause-and-effect diagram
References
Montgomery DC (2000) Introduction to StatisticalQuality Control 4th ed New York John Wiley ampSonsWetherill GB and Brown DW (1991) Statistical Pro-cess Control New York Chapman amp Hall
Luca ScruccaDipartimento di Scienze Statistiche Universitagrave degliStudi di Perugia Italylucastatunipgit
Least Squares Calculations in RTiming different approaches
by Douglas Bates
Introduction
The S language encourages users to become pro-grammers as they express new ideas and techniquesof data analysis in S Frequently the calculations inthese techniques are expressed in terms of matrix op-erations The subject of numerical linear algebra -
how calculations with matrices can be carried outaccurately and efficiently - is quite different fromwhat many of us learn in linear algebra courses orin courses on linear statistical methods
Numerical linear algebra is primarily based ondecompositions of matrices such as the LU decom-position the QR decomposition and the Choleskydecomposition that are rarely discussed in linear al-gebra or statistics courses
In this article we discuss one of the most commonoperations in statistical computing calculating leastsquares estimates We can write the problem mathe-
R News ISSN 1609-3631
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 July 8 2004 18
matically as
β = arg minβ
y minus Xβ2 (1)
where X is an n times p model matrix (p le n) y is n-dimensional and β is p dimensional Most statisticstexts state that the solution to (1) is
β =(X primeX
)minus1 X primey (2)
when X has full column rank (ie the columns of Xare linearly independent)
Indeed (2) is a mathematically correct way ofwriting a least squares solution and all too fre-quently we see users of R calculating a least squaressolution in exactly this way That is they use codelike solve(t(X) X) t(X) y If the cal-culation is to be done only a few times on relativelysmall data sets then this is a reasonable way to do thecalculation However it is rare that code only getsused only for small data sets or only a few times Byfollowing a few rules we can make this code fasterand more stable
General principles
A few general principles of numerical linear algebraare
1 Donrsquot invert a matrix when you only need tosolve a single system of equations That is usesolve(A b) instead of solve(A) b
2 Use crossprod(X) not t(X) X to calculateX primeX Similarly use crossprod(Xy) not t(X) y to calculate X primey
3 If you have a matrix with a special formconsider using a decomposition suited to thatform Because the matrix X primeX is symmetricand positive semidefinite its Cholesky decom-position can be used to solve systems of equa-tions
Some timings on a large example
For a large ill-conditioned least squares problemsuch as that described in Koenker and Ng (2003) theliteral translation of (2) does not perform well
gt library(Matrix)
gt data(mm y)
gt mmm = as(mm matrix)
gt dim(mmm)
[1] 1850 712
gt sysgctime(naivesol lt- solve(t(mmm)
+ mmm) t(mmm) y)
[1] 364 018 411 000 000
(The function sysgctime is a modified version ofsystemtime that calls gc() before initializing thetimer thereby providing more consistent timing re-sults The data object mm is a sparse matrix as de-scribed below and we must coerce it to the densematrix mmm to perform this calculation)
According to the principles above it should bemore effective to compute
gt sysgctime(cpodsol lt- solve(crossprod(mmm)
+ crossprod(mmm y)))
[1] 063 007 073 000 000
gt allequal(naivesol cpodsol)
[1] TRUE
Timing results will vary between computers butin most cases the crossprod form of the calculationis at least four times as fast as the naive calculationIn fact the entire crossprod solution is usually fasterthan calculating X primeX the naive way
gt sysgctime(t(mmm) mmm)
[1] 083 003 086 000 000
because the crossprod function applied to a singlematrix takes advantage of symmetry when calculat-ing the product
However the value returned by crossprod doesnot retain the information that the product is sym-metric (and positive semidefinite) As a result the so-lution of (1) is performed using a general linear sys-tem solver based on an LU decomposition when itwould be faster and more stable numerically to usea Cholesky decomposition The Cholesky decompo-sition could be used explicitly but the code is awk-ward
gt sysgctime(ch lt- chol(crossprod(mmm)))
[1] 048 002 052 000 000
gt sysgctime(cholsol lt- backsolve(ch
+ forwardsolve(ch crossprod(mmm y)
+ upper = TRUE trans = TRUE)))
[1] 012 005 019 000 000
gt allequal(cholsol naivesol)
[1] TRUE
R News ISSN 1609-3631
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 July 8 2004 19
Least squares calculations with Matrixclasses
The Matrix package uses the S4 class system (Cham-bers 1998) to retain information on the structureof matrices returned by intermediate calculationsA general matrix in dense storage created bythe Matrix function has class geMatrix Itscrossproduct has class poMatrix The solve meth-ods for the poMatrix class use the Cholesky de-composition
gt mmg = as(mm geMatrix)
gt class(crossprod(mmg))
[1] poMatrixattr(package)[1] Matrix
gt sysgctime(Matsol lt- solve(crossprod(mmg)
+ crossprod(mmg y)))
[1] 046 004 050 000 000
gt allequal(naivesol as(Matsol matrix))
[1] TRUE
Furthermore any method that calculates a de-composition or factorization stores the resulting fac-torization with the original object so that it can bereused without recalculation
gt xpx = crossprod(mmg)
gt xpy = crossprod(mmg y)
gt sysgctime(solve(xpx xpy))
[1] 008 001 009 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
The model matrix mm is sparse that is most of theelements of mm are zero The Matrix package incor-porates special methods for sparse matrices whichproduce the fastest results of all
gt class(mm)
[1] cscMatrixattr(package)[1] Matrix
gt sysgctime(sparsesol lt- solve(crossprod(mm)
+ crossprod(mm y)))
[1] 003 000 004 000 000
gt allequal(naivesol as(sparsesol matrix))
[1] TRUE
The model matrix mm has class cscMatrix in-dicating that it is a compressed sparse column-oriented matrix This is the usual representation forsparse matrices in the Matrix package The class ofcrossprod(mm) is sscMatrix indicating that it isa symmetric sparse column-oriented matrix Thesolve methods for such matrices first attempts toform a Cholesky decomposition If this is success-ful the decomposition is retained as part of the objectand can be reused in solving other systems based onthis matrix In this case the decomposition is so fastthat it is difficult to determine the difference in thesolution times
gt xpx = crossprod(mm)
gt class(xpx)
[1] sscMatrixattr(package)[1] Matrix
gt xpy = crossprod(mm y)
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
gt sysgctime(solve(xpx xpy))
[1] 001 000 001 000 000
Epilogue
Another technique for speeding up linear algebracalculations is to use highly optimized libraries ofBasic Linear Algebra Subroutines (BLAS) such asKazushige Gotorsquos BLAS (Goto and van de Geijn2002) or Atlas (Whaley et al 2001) As described in RDevelopment Core Team (2004) on many platformsR can be configured to use these enhanced BLAS li-braries
Most of the fast BLAS implementations focuson the memory access patterns within the calcula-tions In particular they structure the calculation sothat on-processor ldquocacherdquo memory is used efficientlyin basic operations such as dgemm for matrix-matrixmultiplication
Goto and van de Geijn (2002) describe how theycalculate the matrix product AB by retrieving andstoring parts of the transpose of A In the prod-uct it is the rows of A and the columns of B thatwill be accessed sequentially In the Fortran column-major storage convention for matrices which is whatis used in R elements in the same column are in adja-cent storage locations Thus the memory access pat-terns in the calculation AprimeB are generally faster thanthose in AB
Notice that this means that calculating AprimeB in R ast(A) B instead of crossprod(A B) causes A tobe transposed twice once in the calculation of t(A)and a second time in the inner loop of the matrix
R News ISSN 1609-3631
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 20
product The crossprod function does not do anytransposition of matrices ndash yet another reason to usecrossprod
Bibliography
J M Chambers Programming with Data SpringerNew York 1998 ISBN 0-387-98503-4 19
K Goto and R van de Geijn On reducing TLB missesin matrix multiplication Technical Report TR02-55 Department of Computer Sciences U of Texasat Austin 2002 19
R Koenker and P Ng SparseM A sparse matrixpackage for R J of Statistical Software 8(6) 200318
R Development Core Team R Installation and Admin-istration R Foundation for Statistical ComputingVienna 2004 ISBN 3-900051-02-X 19
R C Whaley A Petitet and J J Dongarra Auto-mated empirical optimization of software and theATLAS project Parallel Computing 27(1ndash2)3ndash352001 Also available as University of TennesseeLAPACK Working Note 147 UT-CS-00-448 2000(wwwnetliborglapacklawnslawn147ps)19
DM BatesU of Wisconsin-Madisonbateswiscedu
Tools for interactively exploring Rpackagesby Jianhua Zhang and Robert Gentleman
Introduction
An R package has the required and perhaps addi-tional subdirectories containing files and informa-tion that may be of interest to users Althoughthese files can always be explored through com-mand line operations both within and outside R in-teractive tools that allow users to view and or op-erate on these files would be appealing to manyusers Here we report on some widgets of this formthat have been developed using the package tcltkThe functions are called pExplorer eExplorer andvExplorer and they are intended for exploring pack-ages examples and vignettes respectively Thesetools were developed as part of the Bioconductorproject and are available in the package tkWidgetsfrom wwwbioconductororg Automatic download-ing and package maintenance is also available usingthe reposTools mechanisms also from Bioconductor
Description
Our current implementation is in the form of tcltkwidgets and we are currently adding similar func-tionality to the RGtk package also available fromthe Bioconductor website Users must have afunctioning implementation (and be configured for)tcltk Once both tkWidgets and widgetTools havebeen loaded into the working session the user hasfull access to these interactive tools
pExplorer
pExplorer allows users to explore the contents of anR package The widget can be invoked by providinga package name a path to the directory containingthe package and the names of files and or subdirec-tory to be excluded from showing by the widget Thedefault behavior is to open the first package in the li-brary of locally installed R (libPaths()) with sub-directoriesfiles named Meta and latex excludedif nothing is provided by a user
Figure 1 shows the widget invoked by typingpExplorer(base)
The following tasks can be performed throughthe interface
bull Typing in a valid path to a local directory con-taining R packages and then press the Enterkey will update the name of the package be-ing explored and show the contents of the firstR package in the path in the Contents list box
bull Clicking the dropdown button for packagepath and selecting a path from the resultingdropdown list by double clicking will achievethe same results as typing in a new path
bull Clicking the Browse button allows users to se-lecting a directory containing R packages toachieve the same results as typing in a newpath
bull Clicking the Select Package button allowsusers to pick a new package to explore and thename and contents of the package being ex-plored are updated
R News ISSN 1609-3631
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 21
Figure 1 A snapshot of the widget when pExplorer is called
R News ISSN 1609-3631
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 22
bull Double clicking an item in the Contents listbox results in different responses dependingon the nature of the item being clicked If theitem is a subdirectory the Contents will be up-dated with the contents of the subdirectory cor-responding to the item just being clicked andthe Back right below the list box will be acti-vated If the item is a file the contents of the filewill be displayed in the Display window textbox
bull Clicking the Back button allows users to goback to the parent of the subdirectory whosecontents are being displayed in Contents listbox
bull Clicking the Try examples button invokeseExplorer going to be discussed next
eExplorer
eExplorer allows users to explore the example codefrom all the help files for the specified packageWhen invoked by passing a valid package nameeExplorer will display the names of the files for ex-ample code stored in the R-ex subdirectory in theExample code chunk list box as shown by Figure 2
Clicking an item in the list displays the code ex-ample in the R Source Code text box for users toview Users may view the help file containing theexample code by clicking the View Help button ex-ecute the example code and have the output of theexecution displayed in the Diaplay window by click-ing the Execute Code button or export the result ofexecution (if R objects have been created as the resultof execution) to the global environment of the currentR session by clicking the Export to R
vExplorer
vExplorer allows users to explore the vignettes fora given R package by taking the advantages of thefunctionalities provided by Sweave (Leisch 20022003) When vExplorer is invoked without a pack-age name as shown in Figure 3 the names of allthe installed R packages containing vignettes willbe shown in the Package List list box When
vExplorer is invoked with a package name theVignette List list box will be populated with thenames of vignettes of the package
Clicking a package name in the list box showsthe names of all the vignettes of the package in theVignette List list box When a vignette name isclicked a new widget as shown by Figure 4 appearsthat allows users to perform the following tasks ifthe selected vignettes contains any execute-able codechunks
bull Clicking an active button in the Code Chunktext box displays the code chunk in the RSource Code text box for users to view
bull Clicking the View PDF buttons shows the pdfversion of the vignette currently being ex-plored
bull Clicking the Execute Code button executesthe code shown in the text box and displaysthe results of the execution in Results ofExecution
bull Clicking the Export to R button export the re-sult of execution to the global environment ofthe current R session
The evaluation model for code chunks in a vi-gnette is that they are evaluated sequentially In or-der to ensure sequential evaluation and to help guidethe user in terms of which code chunk is active weinactivate all buttons except the one associated withthe first unevaluated code chunk
References
Bibliography
Friedrich Leisch Sweave part I Mixing R and LATEXR News 2(3)28ndash31 December 2002 URL httpCRANR-projectorgdocRnews 22
Friedrich Leisch Sweave part II Package vignettesR News 3(2)21ndash24 October 2003 URL httpCRANR-projectorgdocRnews 22
R News ISSN 1609-3631
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 23
Figure 2 A snapshot of the widget when pExplorer is called by typing eExplorer(base)
R News ISSN 1609-3631
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 24
Figure 3 A snapshot of the widget when vExplorer is called by typing vExplorer()
R News ISSN 1609-3631
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 25
Figure 4 A snapshot of the widget when BiobaseRnw of the Biobase package was clicked
R News ISSN 1609-3631
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 26
The survival packageby Thomas Lumley
This is another in our series of articles describing therecommended packages that come with the R distri-bution The survival package is a port of code writtenby T M Therneau of the Mayo Clinic code that hasalso been part of S-PLUS for many years Portingthe survival package in 1997 was my introduction toR programming It exposed quite a few bugs in Rand incompatibilities with S (some of which on re-flection might have been better left incompatible)
Overview
Survival analysis also called event history analysisin the social sciences and reliability analysis in en-gineering is concerned with the time until an eventoccurs The complication that makes survival analy-sis interesting is that individuals are often observedfor only part of the relevant time period They maybe recruited well after time zero and the study mayend with only a small proportion of events havingbeen seen
In medical statistics the most common survivalanalyses are estimation of the distribution of time tothe event testing for equality of two distributionsand regression modelling of the rate at which eventsoccur These are all covered by the survival package
Additional tools for regression modelling are inthe Design and eha packages and the muhaz packageallows estimation of the hazard function (the ana-logue in survival analysis of density estimation)
Specifying survival data
In the popular lsquocounting processrsquo formulation of sur-vival analysis each record in a dataset has three vari-ables describing the survival time A start variablespecifies when observation begins a stop variablespecifies when it ends and an event variable is 1 ifobservation ended with an event and 0 if it endedwithout seeing the event These variables are bun-dled into a Surv object by the Surv() functionIf the start variable is zero for everyone it may beomitted
In data(veteran) time measures the time fromthe start of a lung cancer clinical trial and status is 1if the patient died at time 0 if follow-up ended withthe patient still alive In addition diagtime gives thetime from initial diagnosis until entry into the clinicaltrial
gt library(survival)gt data(veteran) time from randomisation to death
gt with(veteran Surv(timestatus))[1] 72 411 228 126 118 10 82[8] 110 314 100+ 42 8 144 25+
time from diagnosis to deathgt with(veteran Surv(diagtime30
diagtime30+timestatus))
[1] ( 210 282 ] ( 150 561 ][3] ( 90 318 ] ( 270 396 ][9] ( 540 854 ] ( 180 280+]
The first Surv object prints the observation timewith a + to indicate that the event is still to come Thesecond object prints the start and end of observationagain using a + to indicate end of follow-up before anevent
One and two sample summaries
The survfit() function estimates the survival dis-tribution for a single group or multiple groups Itproduces a survfit object with a plot() methodFigure 1 illustrates some of the options for producingmore attractive graphs
Two (or more) survival distributions can be com-pared with survdiff which implements the logranktest (and the Gρ family of weighted logrank tests)In this example it is clear from the graphs and thetests that the new treatment (trt=2) and the standardtreatment (trt=1) were equally ineffective
gt data(veteran)gt plot(survfit(Surv(timestatus)~trtdata=veteran)
xlab=Years since randomisationxscale=365 ylab= surviving yscale=100col=c(forestgreenblue))
gt survdiff(Surv(timestatus)~trt data=veteran)
Callsurvdiff(formula = Surv(time status) ~ trt
data = veteran)
N Observed Expected (O-E)^2Etrt=1 69 64 645 000388trt=2 68 64 635 000394
(O-E)^2Vtrt=1 000823trt=2 000823
Chisq= 0 on 1 degrees of freedom p= 0928
R News ISSN 1609-3631
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 27
00 05 10 15 20 25
020
4060
8010
0
Years since randomisation
s
urvi
ving
StandardNew
Figure 1 Survival distributions for two lung cancertreatments
Proportional hazards models
The mainstay of survival analysis in the medicalworld is the Cox proportional hazards model andits extensions This expresses the hazard (or rate)of events as an unspecified baseline hazard functionmultiplied by a function of the predictor variables
Writing h(t z) for the hazard at time t with pre-dictor variables Z = z the Cox model specifies
log h(t z) = log h0(t)eβz
Somewhat unusually for a semiparametric modelthere is very little loss of efficiency by leaving h0(t)unspecified and computation is if anything easierthan for parametric models
A standard example of the Cox model is one con-structed at the Mayo Clinic to predict survival in pa-tients with primary biliary cirrhosis a rare liver dis-ease This disease is now treated by liver transplan-tation but at the same there was no effective treat-ment The model is based on data from 312 patientsin a randomised trial
gt data(pbc)gt mayomodellt-coxph(Surv(timestatus)~edtrt+
log(bili)+log(protime)+age+platelet
data=pbc subset=trtgt0)gt mayomodelCallcoxph(formula = Surv(time status) ~ edtrt +
log(bili) + log(protime) +age + platelet data = pbcsubset = trt gt 0)
coef exp(coef)edtrt 102980 2800log(bili) 095100 2588log(protime) 288544 17911
age 003544 1036platelet -000128 0999
se(coef) z pedtrt 0300321 343 000061log(bili) 0097771 973 000000log(protime) 1031908 280 000520age 0008489 418 000003platelet 0000927 -138 017000
Likelihood ratio test=185 on 5 df p=0 n= 312
The survexp function can be used to comparepredictions from a proportional hazards model to ac-tual survival Here the comparison is for 106 patientswho did not participate in the randomised trial Theyare divided into two groups based on whether theyhad edema (fluid accumulation in tissues) an impor-tant risk factor
gt plot(survfit(Surv(timestatus)~edtrtdata=pbcsubset=trt==-9))
gt lines(survexp(~edtrt+ratetable(edtrt=edtrtbili=bili
platelet=plateletage=ageprotime=protime)
data=pbcsubset=trt==-9ratetable=mayomodelcohort=TRUE)
col=purple)
The ratetable function in the model formula wrapsthe variables that are used to match the new sampleto the old model
Figure 2 shows the comparison of predicted sur-vival (purple) and observed survival (black) in these106 patients The fit is quite good especially aspeople who do and do not participate in a clin-ical trial are often quite different in many ways
0 1000 2000 3000 4000
00
02
04
06
08
10
Figure 2 Observed and predicted survival
The main assumption of the proportional hazardsmodel is that hazards for different groups are in factproportional ie that β is constant over time The
R News ISSN 1609-3631
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 jfoxmcmasterca 28
coxzph function provides formal tests and graphi-cal diagnostics for proportional hazards
gt coxzph(mayomodel)rho chisq p
edtrt -01602 3411 00648log(bili) 01507 2696 01006log(protime) -01646 2710 00997age -00708 0542 04617platelet -00435 0243 06221GLOBAL NA 9850 00796
graph for variable 1 (edtrt)gt plot(coxzph(mayomodel)[1])
Time
Bet
a(t)
for
edtr
t
210 1200 2400 3400
minus5
05
10
Figure 3 Testing proportional hazards
There is a suggestion that edtrt andlog(protime) may have non-proportional hazardsand Figure 3 confirms this for edtrt The curve
shows an estimate of β at each point in time ob-tained by smoothing the residuals on the graph Itappears that edtrt is quite strongly predictive for thefirst couple of years but has little predictive powerafter that
Additional features
Computation for the Cox model extends straightfor-wardly to situations where z can vary over time andwhen an individual can experience multiple eventsof the same or different types Interpretation of theparameters becomes substantially trickier of courseThe coxph() function allows formulas to contain acluster() term indicating which records belong tothe same individual and a strata() term indicat-ing subgroups that get their own unspecified base-line hazard function
Complementing coxph is survreg which fits lin-ear models for the mean of survival time or its log-arithm with various parametric error distributionsThe parametric models allow more flexible forms ofcensoring than does the Cox model
More recent additions include penalised likeli-hood estimation for smoothing splines and for ran-dom effects models
The survival package also comes with standardmortality tables for the US and a few individualstates together with functions for comparing the sur-vival of a sample to that of a population and comput-ing person-years of followup
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
useR 2004The R User Conference
by John FoxMcMaster University Canada
More than 200 R users and developers convergedon the Technische Universitaumlt Wien for the first Rusersrsquo conference mdash userR 2004 mdash which took placein Vienna between May 20 and May 22 The con-ference was organized by the Austrian Associationfor Statistical Computing (AASC) and sponsoredby the R Foundation for Statistical Computing theAustrian Science Foundation (FWF) and MedAn-alytics (httpwwwmedanalyticscom) TorstenHothorn Achim Zeileis and David Meyer servedas chairs of the conference program committee andlocal organizing committee Bettina Gruumln was incharge of local arrangements
The conference program included nearly 100 pre-sentations many in keynote plenary and semi-plenary sessions The diversity of presentations mdashfrom bioinformatics to user interfaces and finance tofights among lizards mdash reflects the wide range of ap-plications supported by R
A particularly interesting component of the pro-gram were keynote addresses by members of the Rcore team aimed primarily at improving program-ming skills and describing key features of R alongwith new and imminent developments These talksreflected a major theme of the conference mdash the closerelationship in R between use and development Thekeynote addresses included
bull Paul Murrell on grid graphics and program-ming
bull Martin Maumlchler on good programming practice
R News ISSN 1609-3631
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 29
bull Luke Tierney on namespaces and byte compi-lation
bull Kurt Hornik on packaging documentationand testing
bull Friedrich Leisch on S4 classes and methods
bull Peter Dalgaard on language interfaces usingCall and External
bull Douglas Bates on multilevel models in R
bull Brian D Ripley on datamining and related top-ics
Slides for the keynote addresses along with ab-stracts of other papers presented at userR 2004are available at the conference web site httpwwwcituwienacatConferencesuseR-2004programhtml
An innovation of useR 2004 was the deploymentof ldquoislandrdquo sessions in place of the more traditionalposter sessions in which presenters with common
interests made brief presentations followed by gen-eral discussion and demonstration Several islandsessions were conducted in parallel in the late after-noon on May 20 and 21 and conference participantswere able to circulate among these sessions
The conference sessions were augmented by alively social program including a pre-conference re-ception on the evening of May 19 and a conferencetrip in the afternoon of May 22 to the picturesqueWachau valley on the Danube river The trip waspunctuated by a stop at the historic baroque MelkAbbey and culminated in a memorable dinner at atraditional lsquoHeurigerrsquo restaurant Of course muchlively discussion took place at informal lunches din-ners and pub sessions and some of us took advan-tage of spring in Vienna to transform ourselves intotourists for a few days before and after the confer-ence
Everyone with whom I spoke both at and afteruseR 2004 was very enthusiastic about the confer-ence We look forward to an even larger useR in2006
R Help DeskDate and Time Classes in R
Gabor Grothendieck and Thomas Petzoldt
Introduction
R-190 and its contributed packages have a numberof datetime (ie date or datetime) classes In partic-ular the following three classes are discussed in thisarticle
bull Date This is the newest R date class just in-troduced into R-190 It supports dates with-out times Eliminating times simplifies datessubstantially since not only are times them-selves eliminated but the potential complica-tions of time zones and daylight savings timevs standard time need not be considered ei-ther Date has an interface similar to the POSIXclasses discussed below making it easy to movebetween them It is part of the R base pack-age Dates in the Date class are represented in-ternally as days since January 1 1970 Moreinformation on Date can be found in DatesThe percent codes used in character conver-sion with Date class objects can be found instrptime (although strptime itself is not aDate method)
bull chron Package chron provides dates and timesThere are no time zones or notion of daylight
vs standard time in chron which makes it sim-pler to use for most purposes than datetimepackages that do employ time zones It is a con-tributed package available on CRAN so it mustbe installed and loaded prior to use Datetimesin the chron package are represented internallyas days since January 1 1970 with times repre-sented by fractions of days Thus 15 would bethe internal representation of noon on January2nd 1970 The chron package was developedat Bell Labs and is discussed in James and Preg-ibon (1993)
bull POSIX classes POSIX classes refer to the twoclasses POSIXct POSIXlt and their common su-per class POSIXt These support times anddates including time zones and standard vsdaylight savings time They are primarilyuseful for time stamps on operating systemfiles data bases and other applications thatinvolve external communication with systemsthat also support dates times time zones anddaylightstandard times They are also usefulfor certain applications such as world currencymarkets where time zones are important Weshall refer to these classes collectively as thePOSIX classes POSIXct datetimes are repre-sented as seconds since January 1 1970 GMTwhile POSIXlt datetimes are represented by alist of 9 components plus an optional tzoneattribute POSIXt is a common superclass of
R News ISSN 1609-3631
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 30
the two which is normally not accessed di-rectly by the user We shall focus mostly onPOSIXct here More information on the POSIXtclasses are found in DateTimeClasses and instrptime Additional insight can be obtainedby inspecting the internals of POSIX datetimeobjects using unclass(x) where x is of anyPOSIX datetime class (This works for Date andchron too) The POSIX classes are discussed inRipley and Hornik (2001)
When considering which class to use alwayschoose the least complex class that will support theapplication That is use Date if possible otherwise usechron and otherwise use the POSIX classes Such a strat-egy will greatly reduce the potential for error and in-crease the reliability of your application
A full page table is provided at the end of this ar-ticle which illustrates and compares idioms writtenin each of the three datetime systems above
Aside from the three primary date classes dis-cussed so far there is also the date package whichrepresents dates as days since January 1 1960 Thereare some date functions in the pastecs package thatrepresent datetimes as years plus fraction of a yearThe date and pastecs packages are not discussed inthis article
Other Applications
Spreadsheets like Microsoft Excel on a WindowsPC or OpenOfficeorg represent datetimes as daysand fraction of days since December 30 1899(usually) If x is a vector of such numbersthen asDate(1899-12-30) + floor(x) will givea vector of Date class dates with respect to Datersquos ori-gin Similarly chron(12301899) + x will givechron dates relative to chronrsquos origin Excel on a Macusually represents dates as days and fraction of dayssince January 1 1904 so asDate(1904-01-01)+ floor(x) and chron(01011904) + x convertvectors of numbers representing such dates to Dateand chron respectively Its possible to set Excel touse either origin which is why the word usually wasemployed above
SPSS uses October 14 1582 as the origin therebyrepresenting datetimes as seconds since the begin-ning of the Gregorian calendar SAS uses secondssince January 1 1960 spssget and sasget in pack-age Hmisc can handle such datetimes automatically(Alzola and Harrell 2002)
Time Series
A common use of dates and datetimes are in time se-ries The ts class represents regular time series (ie
equally spaced points) and is mostly used if the fre-quency is monthly quarterly or yearly (Other seriesare labelled numerically rather than using dates) Ir-regular time series can be handled by classes irts (inpackage tseries) its (in package its) and zoo (in pack-age zoo) ts uses a scheme of its own for dates irtsand its use POSIXct dates to represent datetimes zoocan use any date or timedate package to representdatetimes
Input
By default readtable will read in numeric datasuch as 20040115 as numbers and will read non-numeric data such as 121504 or 2004-12-15 asfactors In either case one should convert the datato character class using ascharacter In the sec-ond (non-numeric) case one could alternately use theasis= argument to readtable to prevent the con-version from character to factor within readtable1
date col in all numeric format yyyymmdd
df lt- readtable(laketemptxt header = TRUE)
asDate(ascharacter(df$date) Y-m-d)
first two cols in format mmddyy hhmmss
Note asis= in readtable to force character
library(chron)
df lt- readtable(oxygentxt header = TRUE
asis = 12)
chron(df$date df$time)
Avoiding Errors
The easiest way to avoid errors is to use the leastcomplex class consistent with your application asdiscussed in the Introduction
With chron in order to avoid conflicts betweenmultiple applications it is recommended that theuser does not change the default settings of the fourchron options
default origin
options(chronorigin=c(month=1day=1year=1970))
if TRUE abbreviates year to 2 digits
options(chronyearabb = TRUE)
function to map 2 digit year to 4 digits
options(chronyearexpand = yearexpand)
if TRUE then year not displayed
options(chronsimplify = FALSE)
For the same reason not only should the globalchron origin not be changed but the per-variable ori-gin should not be changed either If one has a nu-meric vector of data x representing days since chrondate orig then orig+x represents that data as chrondates relative to the default origin With such a sim-ple conversion there is really no reason to have to re-sort to non-standard origins For example
1This example uses data sets found at httpwwwtu-dresdendefghhihbpetzoldtmodlimdata
R News ISSN 1609-3631
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 31
orig lt- chron(010160)
x lt- 09 days since 010160
chron dates with default origin
orig + x
Regarding POSIX classes the user should be awareof the following
bull time zone Know which time zone each func-tion that is being passed POSIX dates is assum-ing Note that the time of day day of the weekthe day of the month the month of the yearthe year and other date quantities can poten-tially all differ depending on the time zone andwhether it is daylight or standard time Theuser must keep track of which time zone eachfunction that is being used is assuming For ex-ample consider
dp lt- seq(Systime() len=10 by=day)
plot(dp 110)
This does not use the current wall clock timefor plotting today and the next 9 days sinceplot treats the datetimes as relative to GMTThe x values that result will be off from the wallclock time by a number of hours equal to thedifference between the current time zone andGMT See the plot example in table of this ar-ticle for an illustration of how to handle thisSome functions accept a tz= argument allow-ing the user to specify the time zone explicitlybut there are cases where the tz= argument isignored Always check the function with tz=and tz=GMT to be sure that tz= is not beingignored If there is no tz= argument check care-fully which time zone the function is assuming
bull OS Some of the date calculations done byPOSIX functions are passed off to the operatingsystem and so may not work if the operatingsystem has bugs with datetimes In some casesthe R code has compensated for OS bugs but ingeneral caveat emptor Another consequence isthat different operating systems will accept dif-ferent time zones Normally the current timezone and Greenwich Mean Time GMT canbe assumed to be available but other time zonescannot be assumed to be available across plat-forms
bull POSIXlt The tzone attribute on POSIXlt timesare ignored so it is safer to use POSIXct than
POSIXlt when performing arithmetic or othermanipulations that may depend on time zonesAlso POSIXlt datetimes have an isdst compo-nent that indicates whether it is daylight sav-ings time (isdst=1) or not (isdst=0) Whenconverting from another class isdst may notbe set as expected Converting to characterfirst and then to POSIXlt is safer than a directconversion Datetimes which are one hour offare the typical symptom of this last problemNote that POSIXlt should not be used in dataframesndashPOSIXct is preferred
bull conversion Conversion between different date-time classes and POSIX classes may not use thetime zone expected Convert to character firstto be sure The conversion recipes given in theaccompanying table should work as shown
Comparison Table
Table 1 provides idioms written in each of the threeclasses discussed The reader can use this table toquickly access the commands for those phrases ineach of the three classes and to translate commandsamong the classes
Bibliography
Alzola C F and Harrell F E (2002) An In-troduction to S and the Hmisc and Design Li-braries URL httpbiostatmcvanderbiltedutwikipubMainRSsintropdf 30
James D A and Pregibon D (1993) Chrono-logical objects for data analysis In Proceed-ings of the 25th Symposium of the Interface SanDiego URL httpcmbell-labscomcmmsdepartmentssiadjpaperschronpdf 29
Ripley B D and Hornik K (2001) Date-timeclasses R News 1 (2) 8ndash11 URL httpCRANR-projectorgdocRnews 30
Gabor Grothendieckggrothendieckmywaycom
Thomas Petzoldtpetzoldtrcsurztu-dresdende
R News ISSN 1609-3631
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 32Date
chron
POSIXct
now
SysDate()
aschron(asPOSIXct(format(Systime())tz=GMT))
Systime()
origin
structure(0class=Date)
chron(0)
structure(0class=c(POSIXtPOSIXct))
xdays
since
origin
structure(floor(x)class=Date)
chron(x)
structure(x246060class=c(POSIXtPOSIXct))
G
MT
days
diff
in
days
dt1-dt2
dc1-dc2
difftime(dp1dp2unit=day)
1
time
zone
difference
dp-asPOSIXct(format(dptz=GMT))
compare
dt1
gtdt2
dc1
gtdc2
dp1
gtdp2
next
day
dt+1
dc+1
seq(dp0length=2by=DSTday)[2]
2
previous
day
dt-1
dc-1
seq(dp0length=2by=-1
DSTday)[2]
3
xdays
since
date
dt0+floor(x)
dc0+
xseq(dp0length=2by=paste(xDSTday)[2]
4
sequence
dts
lt-
seq(dt0length=10by=day)
dcs
lt-
seq(dc0length=10)
dps
lt-
seq(dp0length=10by=DSTday)
plot
plot(dtsrnorm(10))
plot(dcsrnorm(10))
plot(asPOSIXct(format(dps)tz=GMT)rnorm(10))
5
every
2nd
week
seq(dt0length=3by=2
week)
dc0+seq(0length=3by=14)
seq(dp0length=3by=2
week)
first
day
of
month
asDate(format(dtY-m-01))
chron(dc)-monthdayyear(dc)$day+1
asPOSIXct(format(dpY-m-01))
month
which
sorts
asnumeric(format(dtm))
months(dc)
asn
umer
ic(f
orm
at(d
p
m)
)day
of
week
(Sun=0)
asnumeric(format(dtw))
asnumeric(dates(dc)-3)7
asnumeric(format(dpw))
day
of
year
asnumeric(format(dtj))
asnumeric(format(asDate(dc)j))
asnumeric(format(dpj))
mean
each
monthyear
tapply(xformat(dtY-m)mean)
tapply(xformat(asDate(dc)Y-m)mean)
tapply(xformat(dpY-m)mean)
12
monthly
means
tapply(xformat(dtm)mean)
tapply(xmonths(dc)mean)
tapply(xformat(dpm)mean)
Output
yyyy-mm-dd
format(dt)
format(asDate(dates(dc)))
format(dpY-m-d)
mmddyy
format(dtmdy)
format(dates(dc))
format(dpmdy)
Sat
Jul
1
1970
format(dta
b
dY)
format(asDate(dates(dc))a
b
d
Y)
format(dpa
b
d
Y)
Input
1970-10-15
asDate(z)
chron(z
format=y-m-d)
asPOSIXct(z)
10151970
asDate(zmdY)
chron(z)
asPOSIXct(strptime(zmdY))
Conversion
(tz=)
to
Date
asDate(dates(dc))
asDate(format(dp))
to
chron
chron(unclass(dt))
chron(format(dpmdY)format(dpHMS))
6
to
POSIXct
asPOSIXct(format(dt))
asPOSIXct(paste(asDate(dates(dc))times(dc)1))
Conversion
(tz=GMT)
to
Date
asDate(dates(dc))
asDate(dp)
to
chron
chron(unclass(dt))
aschron(dp)
to
POSIXct
asPOSIXct(format(dt)tz=GMT)
asPOSIXct(dc)
Not
esz
isa
vect
orof
char
acte
rsx
isa
vect
orof
num
bers
Var
iabl
esbe
ginn
ing
wit
hdt
are
Dat
eva
riab
les
Var
iabl
esbe
ginn
ing
wit
hdc
are
chro
nva
riab
les
and
vari
able
sbe
ginn
ing
wit
hdp
are
POSI
Xct
vari
able
sV
aria
bles
endi
ngin
0ar
esc
alar
sot
hers
are
vect
ors
Expr
essi
ons
invo
lvin
gch
ron
date
sas
sum
eal
lchr
onop
tion
sar
ese
tatd
efau
ltva
lues
See
strp
tim
efo
rm
ore
co
des
1 Can
bere
duce
dto
dp1-dp2
ifit
sac
cept
able
that
date
tim
escl
oser
than
one
day
are
retu
rned
inho
urs
orot
her
unit
sra
ther
than
days
2 C
anbe
redu
ced
todp+246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
3 C
anbe
redu
ced
todp-246060
ifon
eho
urde
viat
ion
atti
me
chan
ges
betw
een
stan
dard
and
dayl
ight
savi
ngs
tim
ear
eac
cept
able
4 C
anbe
redu
ced
todp+246060x
ifon
eho
urde
viat
ion
acce
ptab
lew
hen
stra
ddlin
god
dnu
mbe
rof
stan
dard
day
light
savi
ngs
tim
ech
ange
s5 C
anbe
redu
ced
toplot(dpsrnorm(10))
ifpl
otti
ngpo
ints
atG
MT
date
tim
esra
ther
than
curr
entd
atet
ime
suffi
ces
6 An
equi
vale
ntal
tern
ativ
eisaschron(asPOSIXct(format(dp)tz=GMT))
Table 1 Comparison Table
R News ISSN 1609-3631
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 33
Programmersrsquo NicheA simple class in S3 and S4
by Thomas Lumley
ROC curves
Receiver Operating Characteristic curves (ROCcurves) display the ability of an ordinal variable todiscriminate between two groups Invented by radioengineers they are perhaps most used today in de-scribing medical diagnostic tests Suppose T is theresult of a diagnostic test and D is an indicator vari-able for the presence of a disease
The ROC curve plots the true positive rate (sen-sitivity) Pr(T gt c|D = 1) against the false positiverate (1minusspecificity) Pr(T gt c|D = 0) for all valuesof c Because the probabilities are conditional on dis-ease status the ROC curve can be estimated from ei-ther an ordinary prospective sample or from separatesamples of cases (D = 1) or controls (D = 0)
I will start out with a simple function to drawROC curves improve it and then use it as the basisof S3 and S4 classes
Simply coding up the definition of the ROC curvegives a function that is efficient enough for mostpractical pruposes The one necessary optimisationis to realise that plusmninfin and the the unique values ofT are the only cutpoints needed Note the use ofsapply to avoid introducing an index variable for el-ements of cutpoints
drawROC lt-function(TD)
cutpointslt-c(-Inf sort(unique(T)) Inf)
senslt-sapply(cutpoints
function(c) sum(D[Tgtc])sum(D))
speclt-sapply(cutpoints
function(c) sum((1-D)[Tlt=c]sum(1-D)))
plot(1-spec sens type=l)
It would usually be bad style to use T and c asvariable names because of confusion with the S-compatible T==TRUE and the built-in function c() Inthis case the benefit of preserving standard notationis arguably worth the potential confusion
There is a relatively simple optimisation ofthe function that increases the speed substantiallythough at the cost of requiring T to be a numberrather than just an object for which gt and lt= are de-fined
drawROClt-function(TD)
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
plot( mspecsens type=l)
One pedagogical virtue of this code is that itmakes it obvious that the ROC curve must be mono-tone the true positive and false positive rates are cu-mulative sums of non-negative numbers
Classes and methods
Creating an S3 class for ROC curves is an easy in-cremental step The computational and graphicalparts of drawROC are separated and an object of classROC is created simply by setting the class at-tribute of the result
The first thing a new class needs is a printmethod The print function is generic when givenan object of class ROC it automatically looks for aprintROC function to take care of the work Failingthat it calls printdefault which spews the inter-nal representation of the object all over the screen
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec test=TT
call=syscall())
class(rval)lt-ROC
rval
printROClt-function(x)
cat(ROC curve )
print(x$call)
The programmer is responsible for ensuring thatthe object has all the properties assumed by func-tions that use ROC objects If an object has classduck R will assume it can lookduck walkduckand quackduck
Since the main purpose of ROC curves is to begraphed a plot method is also needed As withprint plot will look for a plotROC method whenhanded an object of class ROC to plot
plotROClt-function(x type=b nullline=TRUE
xlab=1-Specificity ylab=Sensitivity
main=NULL )
par(pty=s)
plot(x$mspec x$sens type=type
xlab=xlab ylab=ylab )
R News ISSN 1609-3631
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 34
if(nullline)
abline(0 1 lty=3)
if(isnull(main))
mainlt-x$call
title(main=main)
A lines method allows ROC curves to be graphedon the same plot and compared It is a stripped-down version of the plot method
linesROClt-function(x)
lines(x$mspec x$sens )
A method for identify allows the cutpoint to befound for any point on the curve
identifyROClt-function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(x$testdigits)
identify(x$mspec x$sens labels=labels)
An AUC function to compute the area under theROC curve is left as an exercise for the reader
Generalising the class
The test variable T in an ROC curve may be a modelprediction rather than a single biomarker and ROCcurves have been used to summarise the discrimina-tory power of logistic regression and survival mod-els A generic constructor function would allowmethods for single variables and for models Herethe default method is the same as the original ROCfunction
ROC lt- function(T) UseMethod(ROC)
ROCdefaultlt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
The ROCglm method extracts the fitted valuesfrom a binomial regression model and uses them asthe test variable
ROCglmlt-function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
testlt-fitted(T)
diseaselt-(test+resid(T type=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously
far from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
rvallt-list(sens=sens mspec=mspec
test=TTcall=syscall())
class(rval)lt-ROC
rval
S4 classes and method
In the new S4 class system provided by the meth-ods package classes and methods must be regis-tered This ensures that an object has the componentsrequired by its class and avoids the ambiguities ofthe S3 system For example it is not possible to tellfrom the name that ttestformula is a method forttest while tdataframe is a method for t (andttestcluster in the Design package is neither)
As a price for this additional clarity the S4 sys-tem takes a little more planning and can be clumsywhen a class needs to have components that are onlysometimes present
The definition of the ROC class is very similar tothe calls used to create the ROC objects in the S3 func-tions
setClass(ROC
representation(sens=numericmspec=numeric
test=numericcall=call)
validity=function(object)
length(objectsens)==length(objectmspec) ampamp
length(objectsens)==length(objecttest)
)
The first argument to setClass gives the name ofthe new class The second describes the components(slots) that contain the data Optional arguments in-clude a validity check In this case the ROC curvecontains three numeric vectors and a copy of the callthat created it The validity check makes sure that thelengths of the three vectors agree It could also checkthat the vectors were appropriately ordered or thatthe true and false positive rates were between 0 and1
In contrast to the S3 class system the S4 systemrequires all creation of objects to be done by the new
R News ISSN 1609-3631
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 35
function This checks that the correct componentsare present and runs any validity checks Here is thetranslation of the ROC function
ROClt-function(TD)
TTlt-rev(sort(unique(T)))
DDlt-table(-TD)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROC sens=sens mspec=mspec
test=TTcall=syscall())
Rather than print S4 objects use show for displayHere is a translation of the print method from the S3version of the code
setMethod(showROC
function(object)
cat(ROC curve )
print(objectcall)
)
The first argument of setMethod is the name ofthe generic function (show) The second is the sig-nature which specifies when the method should becalled The signature is a vector of class names inthis case a single name since the generic functionshow has only one argument The third argument isthe method itself In this example it is an anonymousfunction it could also be a named function The nota-tion objectcall is used to access the slots definedin setClass in the same way that $ is used for listcomponents The notation should be used only inmethods although this is not enforced in current ver-sions of R
The plot method shows some of the added flex-ibility of the S4 method system The generic func-tion plot has two arguments and the chosen methodcan be based on the classes of either or both Inthis case a method is needed for the case where thefirst argument is an ROC object and there is no sec-ond argument so the signature of the method isc(ROCmissing) In addition to the two argu-ments x and y of the generic function the methodhas all the arguments needed to customize the plotincluding a argument for further graphical pa-rameters
setMethod(plot c(ROCmissing)
function(x y type=b nullline=TRUE
xlab=1-Specificity
ylab=Sensitivity
main=NULL )
par(pty=s)
plot(xmspec xsens type=type
xlab=xlab ylab=ylab )
if(nullline)
abline(01lty=3)
if(isnull(main))
mainlt-xcall
title(main=main)
)
Creating a method for lines appears to work thesame way
setMethod(linesROC
function(x)
lines(xmspec xsens)
)
In fact things are more complicated There is no S4generic function for lines (unlike plot and show)When an S4 method is set on a function that is not al-ready an S4 generic a generic function is created Ifyou knew that lines was not already an S4 genericit would be good style to include a specific call tosetGeneric to make clear what is happening Look-ing at the function lines before
gt linesfunction (x )UseMethod(lines)ltenvironment namespacegraphicsgt
and after the setMethod call
gt linesstandardGeneric for lines defined frompackage graphics
function (x )standardGeneric(lines)ltenvironment 0x2fc68a8gtMethods may be defined for arguments x
shows what happensNote that the creation of this S4 generic does not
affect the workings of S3 methods for lines Callingmethods(lines) will still list the S3 methods andcalling getMethods(lines) will list both S3 and S4methods
Adding a method for creating ROC curves frombinomial generalised linear models provides an ex-ample of setGeneric Calling setGeneric creates ageneric ROC function and makes the existing functionthe default method The code is the same as the S3ROCglm except that new is used to create the ROCobject
setGeneric(ROC)
setMethod(ROCc(glmmissing)
function(T)
if ((T$family$family in
c(binomial quasibinomial)))
stop(ROC curves for binomial glms only)
R News ISSN 1609-3631
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 36
testlt-fitted(T)
diseaselt-(test+resid(Ttype=response))
diseaselt-diseaseweights(T)
if (max(abs(disease 1))gt001)
warning(Y values suspiciously far
from integers)
TTlt-rev(sort(unique(test)))
DDlt-table(-testdisease)
senslt-cumsum(DD[2])sum(DD[2])
mspeclt-cumsum(DD[1])sum(DD[1])
new(ROCsens=sens mspec=mspec
test=TTcall=syscall())
)
Finally a method for identify shows one addi-tional feature of setGeneric The signature argu-ment to setGeneric specifies which arguments arepermitted in the signature of a method and thusare used for method dispatch Method dispatch foridentify will be based only on the first argumentwhich saves having to specify a missing secondargument in the method
setGeneric(identifysignature=c(x))
setMethod(identify ROC
function(x labels=NULL
digits=1)
if (isnull(labels))
labelslt-round(xtest digits)
identify(xmspec xsens
labels=labels)
)
Discussion
Creating a simple class and methods requires verysimilar code whether the S3 or S4 system is used anda similar incremental design strategy is possible TheS3 and S4 method system can coexist peacefully evenwhen S4 methods need to be defined for a functionthat already has S3 methods
This example has not used inheritance where theS3 and S4 systems differ more dramatically Judgingfrom the available examples of S4 classes inheritanceseems most useful in defining data structures ratherthan objects representing statistical calculations Thismay be because inheritance extends a class by creat-ing a special case but statisticians more often extenda class by creating a more general case Reusing codefrom say linear models in creating generalised lin-ear models is more an example of delegation than in-heritance It is not that a generalised linear modelis a linear model more that it has a linear model(from the last iteration of iteratively reweighted leastsquares) associated with it
Thomas LumleyDepartment of BiostatisticsUniversity of Washington Seattle
Changes in Rby the R Core Team
New features in 191
bull asDate() now has a method for POSIXlt ob-jects
bull mean() has a method for difftime objects andso summary() works for such objects
bull legend() has a new argument ptcex
bull plotts() has more arguments particularlyyaxflip
bull heatmap() has a new keepdendro argument
bull The default barplot method now handles vec-tors and 1-d arrays (eg obtained by table())the same and uses grey instead of heat colorpalettes in these cases (Also fixes PR6776)
bull nls() now looks for variables and functions inits formula in the environment of the formulabefore the search path in the same way lm()etc look for variables in their formulae
User-visible changes in 190
bull Underscore _ is now allowed in syntacticallyvalid names and makenames() no longerchanges underscores Very old code that makesuse of underscore for assignment may nowgive confusing error messages
bull Package rsquobasersquo has been split into packagesrsquobasersquo rsquographicsrsquo rsquostatsrsquo and rsquoutilsrsquo All four areloaded in a default installation but the separa-tion allows a rsquolean and meanrsquo version of R to beused for tasks such as building indices
Packages ctest eda modreg mva nls stepfunand ts have been merged into stats and lqs has
R News ISSN 1609-3631
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 37
been returned to MASS In all cases a stub hasbeen left that will issue a warning and ensurethat the appropriate new home is loaded Allthe time series datasets have been moved topackage stats Sweave has been moved to utils
Package mle has been moved to stats4 whichwill become the central place for statistical S4classes and methods distributed with base RPackage mle remains as a stub
Users may notice that code in Rprofile is runwith only the new base loaded and so func-tions may now not be found For exam-ple psoptions(horizontal = TRUE) shouldbe preceded by library(graphics) or calledas graphicspsoptions or better set as ahook ndash see setHook
bull There has been a concerted effort to speed upthe startup of an R session it now takes about23rds of the time of 181
bull A warning is issued at startup in a UTF-8 lo-cale as currently R only supports single-byteencodings
New features in 190
bull $ $lt- [[ [[lt- can be applied to environ-ments Only character arguments are allowedand no partial matching is done The semanticsare basically that of getassign to the environ-ment with inherits=FALSE
bull There are now print() and [ methods for acfobjects
bull aov() will now handle singular Error() mod-els with a warning
bull arima() allows models with no free parame-ters to be fitted (to find log-likelihood and AICvalues thanks to Rob Hyndman)
bull array() and matrix() now allow 0-lengthlsquodatarsquo arguments for compatibility with S
bull asdataframe() now has a method for ar-rays
bull asmatrixdataframe() now coerces an all-logical data frame to a logical matrix
bull New function assignInNamespace() paral-lelling fixInNamespace
bull There is a new function contourLines() toproduce contour lines (but not draw anything)This makes the CRAN package clines (with itsclines() function) redundant
bull D() deriv() etc now also differentiateasin() acos() atan() (thanks to a contri-bution of Kasper Kristensen)
bull The package argument to data() is no longerallowed to be a (unquoted) name and so can bea variable name or a quoted character string
bull There is a new class Date to represent dates(without times) plus many utility functionssimilar to those for date-times See Date
bull Deparsing (including using dump() anddput()) an integer vector now wraps it inasinteger() so it will be source()d correctly(Related to PR4361)
bull Deprecated() has a new argument packagewhich is used in the warning message for non-base packages
bull The print() method for difftime objects nowhandles arrays
bull dircreate() is now an internal function(rather than a call to mkdir) on Unix as well ason Windows There is now an option to sup-press warnings from mkdir which may or maynot have been wanted
bull dist() has a new method to calculateMinkowski distances
bull expandgrid() returns appropriate array di-mensions and dimnames in the attributeoutattrs and this is used by thepredict() method for loess to return a suit-able array
bull factanal() loess() and princomp() now ex-plicitly check for numerical inputs they mighthave silently coded factor variables in formu-lae
bull New functions factorial() defined asgamma(x+1) and for S-PLUS compatibilitylfactorial() defined as lgamma(x+1)
bull findInterval(x v) now allows +-Inf val-ues and NAs in x
bull formuladefault() now looks for a termscomponent before a formula argument in thesaved call the component will have lsquorsquo ex-panded and probably will have the original en-vironment set as its environment And what itdoes is now documented
bull glm() arguments etastart and mustart arenow evaluated via the model frame in the sameway as subset and weights
R News ISSN 1609-3631
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 38
bull Functions grep() regexpr() sub() andgsub() now coerce their arguments to char-acter rather than give an error
The perl=TRUE argument now uses charactertables prepared for the locale currently in useeach time it is used rather than those of the Clocale
bull New functions head() and tail() in packagelsquoutilsrsquo (Based on a contribution by PatrickBurns)
bull legend() has a new argument rsquotextcolrsquo
bull methods(class=) now checks for a matchinggeneric and so no longer returns methods fornon-visible generics (and eliminates variousmismatches)
bull A new function mget() will retrieve multiplevalues from an environment
bull modelframe() methods for example those forlm and glm pass relevant parts of ontothe default method (This has long been docu-mented but not done) The default method isnow able to cope with model classes such aslqs and ppr
bull nls() and ppr() have a model argument to al-low the model frame to be returned as part ofthe fitted object
bull POSIXct objects can now have a tzone at-tribute that determines how they will be con-verted and printed This means that date-timeobjects which have a timezone specified willgenerally be regarded as in their original timezone
bull postscript() device output has been modi-fied to work around rounding errors in low-precision calculations in gs gt= 811 (PR5285which is not a bug in R)
It is now documented how to use other Com-puter Modern fonts for example italic ratherthan slanted
bull ppr() now fully supports categorical explana-tory variables
ppr() is now interruptible at suitable places inthe underlying FORTRAN code
bull princomp() now warns if both x and covmatare supplied and returns scores only if the cen-tring used is known
bull psigamma(x deriv=0) a new function gen-eralizes digamma() etc All these (psigammadigamma trigamma) now also work for x lt0
bull pchisq( ncp gt 0) and hence qchisq() nowwork with much higher values of ncp it has be-come much more accurate in the left tail
bull readtable() now allows embedded newlinesin quoted fields (PR4555)
bull repdefault(0-length-vector lengthout=n)now gives a vector of length n and not length0 for compatibility with S
If both each and lengthout have been speci-fied it now recycles rather than fills with NAsfor S compatibility
If both times and lengthout have been spec-ified times is now ignored for S compatibility(Previously padding with NAs was used)
The POSIXct and POSIXlt methods forrep() now pass on to the default method(as expected by PR5818)
bull rgb2hsv() is new an R interface the C APIfunction with the same name
bull User hooks can be set for onLoadlibrary detach and onUnload of pack-agesnamespaces see setHook
bull save() default arguments can now beset using option savedefaults whichis also used by saveimage() if optionsaveimagedefaults is not present
bull New function shQuote() to quote strings to bepassed to OS shells
bull sink() now has a split= argument to directoutput to both the sink and the current outputconnection
bull splitscreen() now works for multiple de-vices at once
bull On some OSes (including Windows and thoseusing glibc) strptime() did not validate datescorrectly so we have added extra code to do soHowever this cannot correct scanning errorsin the OSrsquos strptime (although we have beenable to work around these on Windows) Someexamples are now tested for during configura-tion
bull strsplit() now has fixed and perl argu-ments and split= is optimized
bull subset() now allows a drop argument whichis passed on to the indexing method for dataframes
bull termplot() has an option to smooth the partialresiduals
bull varimax() and promax() add class loadingsto their loadings component
R News ISSN 1609-3631
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 39
bull Model fits now add a dataClasses attributeto the terms which can be used to check thatthe variables supplied for prediction are of thesame type as those used for fitting (It is cur-rently used by predict() methods for classeslm mlm glm and ppr as well asmethods in packages MASS rpart and tree)
bull New command-line argument --max-ppsizeallows the size of the pointer protection stack tobe set higher than the previous limit of 10000
bull The fonts on an X11() device (also jpeg() andpng() on Unix) can be specified by a new ar-gument lsquofontsrsquo defaulting to the value of a newoption X11fonts
bull New functions in the tools pack-age pkgDepends getDepList andinstallFoundDepends These provide func-tionality for assessing dependencies and theavailability of them (either locally or from on-line repositories)
bull The parsed contents of a NAMESPACE file arenow stored at installation and if available usedto speed loading the package so packages withnamespaces should be reinstalled
bull Argument asp although not a graphics param-eter is accepted in the of graphics functionswithout a warning It now works as expectedin contour()
bull Package stats4 exports S4 generics for AIC()and BIC()
bull The Mac OS X version now produces an Rframework for easier linking of R into otherprograms As a result Rapp is now relocat-able
bull Added experimental support for conditionalsin NAMESPACE files
bull Added aslistenvironment to coerce envi-ronments to lists (efficiently)
bull New function addmargins() in the stats pack-age to add marginal summaries to tables egrow and column totals (Based on a contribu-tion by Bendix Carstensen)
bull dendrogam edge and node labelscan now be expressions (to be plot-ted via statsplotNode called fromplotdendrogram) The diamond framesaround edge labels are more nicely scaled hor-izontally
bull Methods defined in the methods package cannow include default expressions for argu-ments If these arguments are missing in
the call the defaults in the selected methodwill override a default in the generic SeesetMethod
bull Changes to package rsquogridrsquo
ndash Renamed pushpopviewport() topushpopViewport()
ndash Added upViewport() downViewport()and seekViewport() to allow creationand navigation of viewport tree (ratherthan just viewport stack)
ndash Added id and idlengths arguments togridpolygon() to allow multiple poly-gons within single gridpolygon() call
ndash Added vpList() vpStack() vpTree()and currentvpTree() to allow creationof viewport bundles that may be pushedat once (lists are pushed in parallel stacksin series)
currentvpTree() returns the currentviewport tree
ndash Added vpPath() to allow specificationof viewport path in downViewport() andseekViewport()
See viewports for an example of its use
NOTE it is also possible to specify a pathdirectly eg something like vp1vp2but this is only advised for interactive use(in case I decide to change the separator in later versions)
ndash Added just argument to gridlayout()to allow justification of layout relative toparent viewport IF the layout is not thesame size as the viewport Therersquos an ex-ample in help(gridlayout)
ndash Allowed the vp slot in a grob to bea viewport name or a vpPath Theinterpretation of these new alterna-tives is to call downViewport() withthe name or vpPath before drawingthe grob and upViewport() the appro-priate amount after drawing the grobHerersquos an example of the possible usagepushViewport(viewport(w=5 h=5
name=A))gridrect()pushViewport(viewport(w=5 h=5
name=B))gridrect(gp=gpar(col=grey))upViewport(2)gridrect(vp=A
gp=gpar(fill=red))gridrect(vp=vpPath(A B)
gp=gpar(fill=blue))
R News ISSN 1609-3631
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 40
ndash Added enginedisplaylist() functionThis allows the user to tell grid NOT touse the graphics engine display list andto handle ALL redraws using its own dis-play list (including redraws after deviceresizes and copies)This provides a way to avoid some of theproblems with resizing a device when youhave used gridconvert() or the grid-Base package or even base functions suchas legend()There is a document discussing the useof display lists in grid on the grid website httpwwwstataucklandacnz~paulgridgridhtml
ndash Changed the implementation of grob ob-jects They are no longer implementedas external references They are nowregular R objects which copy-by-valueThis means that they can be savedloadedlike normal R objects In order to retainsome existing grob behaviour the follow-ing changes were necessary
grobs all now have a name slot Thegrob name is used to uniquely iden-tify a drawn grob (ie a grob on thedisplay list)
gridedit() and gridpack() nowtake a grob name as the firs argumentinstead of a grob (Actually they takea gPath see below)
the grobwidth and grobheightunits take either a grob OR a grobname (actually a gPath see below)Only in the latter case will the unitbe updated if the grob pointed to ismodified
In addition the following features arenow possible with grobs
grobs now save()load() like anynormal R object
many grid() functions now have aGrob() counterpart The grid()version is used for its side-effectof drawing something or modifyingsomething which has been drawn theGrob() version is used for its re-turn value which is a grob Thismakes it more convenient to just workwith grob objects without producingany graphical output (by using theGrob() functions)
there is a gTree object (derived fromgrob) which is a grob that canhave children A gTree also has achildrenvp slot which is a view-port which is pushed and then uped
before the children are drawn this al-lows the children of a gTree to placethemselves somewhere in the view-ports specified in the childrenvp byhaving a vpPath in their vp slot
there is a gPath object which is essen-tially a concatenation of grob namesThis is used to specify the child of (achild of ) a gTree
there is a new API for creat-ingaccessingmodifying grob ob-jects gridadd() gridremove()gridedit() gridget() (and theirGrob() counterparts can be used toadd remove edit or extract a grobor the child of a gTree NOTE thenew gridedit() API is incompati-ble with the previous version
ndash Added stringWidth() stringHeight()grobWidth() and grobHeight() conve-nience functions (they produce strwidthstrheight grobwidth and grobheightunit objects respectively)
ndash Allowed viewports to turn off clipping al-together Possible settings for viewportclip arg are now
on clip to the viewport (was TRUE)inherit clip to whatever parent says
(was FALSE)off turn off clipping
Still accept logical values (and NA mapsto off)
bull R CMD check now runs the (Rd) examples withdefault RNGkind (uniform amp normal) andcodesetseed(1) example( setRNG = TRUE)does the same
bull undoc() in package lsquotoolsrsquo has a new default oflsquousevalues = NULLrsquo which produces a warn-ing whenever the default values of functionarguments differ between documentation andcode Note that this affects R CMD check aswell
bull Testing examples via massage-examplespl (asused by R CMD check) now restores the searchpath after every help file
bull checkS3methods() in package rsquotoolsrsquo nowalso looks for generics in the loaded names-pacespackages listed in the Depends fields ofthe packagersquos DESCRIPTION file when testingan installed package
bull The DESCRIPTION file of packages may con-tain a rsquoSuggestsrsquo field for packages that areused only in examples or vignettes
R News ISSN 1609-3631
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 41
bull Added an option to packagedependencies()to handle the rsquoSuggestsrsquo levels of dependencies
bull Vignette dependencies can now be checkedand obtained via vignetteDepends
bull Option repositories to list URLs for pack-age repositories added
bull packagedescription() has been replaced bypackageDescription()
bull R CMD INSTALLbuild now skip Subversionrsquossvn directories as well as CVS directories
bull arraySubscript and vectorSubscript take anew argument which is a function pointer thatprovides access to character strings (such as thenames vector) rather than assuming these arepassed in
bull R_CheckUserInterrupt is now described inlsquoWriting R Extensionsrsquo and there is a newequivalent subroutine rchkusr for calling fromFORTRAN code
bull hsv2rgb and rgb2hsv are newly in the C API
bull Salloc and Srealloc are provided in Sh aswrappers for S_alloc and S_realloc sincecurrent S versions use these forms
bull The type used for vector lengths is nowR_len_t rather than int to allow for a futurechange
bull The internal header nmathdpqh hasslightly improved macros R_DT_val() andR_DT_Cval() a new R_D_LExp() and improvedR_DT_log() and R_DT_Clog() this improvesaccuracy in several [dpq]-functions for extremearguments
bull printcoefmat() is defunct replaced byprintCoefmat()
bull codes() and codeslt-() are defunct
bull anovalistlm (replaced in 120) is now de-funct
bull glmfitnull() lmfitnull() andlmwfitnull() are defunct
bull printatomic() is defunct
bull The command-line arguments --nsize and--vsize are no longer recognized as synonymsfor --min-nsize and --min-vsize (which re-placed them in 120)
bull Unnecessary methods coefglm andfittedglm have been removed they wereeach identical to the default method
bull Laeigen() is deprecated now eigen() usesLAPACK by default
bull tetragamma() and pentagamma() are depre-cated since they are equivalent to psigamma(deriv=2) and psigamma( deriv=3)
bull LTRUELFALSE in Rmathh have been removedthey were deprecated in 120
bull packagecontents() and packagedescription()have been deprecated
bull The defaults for configure are now--without-zlib --without-bzlib --without-pcre
The included PCRE sources have been updatedto version 45 and PCRE gt= 40 is now requiredif --with-pcre is used
The included zlib sources have been updated to121 and this is now required if --with-zlibis used
bull configure no longer lists bzip2 and PCRE aslsquoadditional capabilitiesrsquo as all builds of R havehad them since 170
bull --with-blasgoto= to use K Gotorsquos optimizedBLAS will now work
The above lists only new features see the lsquoNEWSrsquofile in the R distribution or on the R homepage for alist of bug fixes
Changes on CRANby Kurt Hornik
New contributed packages
AlgDesign Algorithmic experimental designs Cal-culates exact and approximate theory experi-mental designs for D A and I criteria Verylarge designs may be created Experimental de-
signs may be blocked or blocked designs cre-ated from a candidate list using several crite-ria The blocking can be done when whole andwithin plot factors interact By Bob Wheeler
BradleyTerry Specify and fit the Bradley-Terrymodel and structured versions By David Firth
BsMD Bayes screening and model discrimination
R News ISSN 1609-3631
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 42
follow-up designs By Ernesto Barrios based onDaniel Meyerrsquos code
DCluster A set of functions for the detection of spa-tial clusters of disease using count data Boot-strap is used to estimate sampling distributionsof statistics By Virgilio Goacutemez-Rubio JuanFerraacutendiz Antonio Loacutepez
GeneTS A package for analyzing multiple gene ex-pression time series data Currently GeneTSimplements methods for cell cycle analy-sis and for inferring large sparse graphi-cal Gaussian models For plotting the in-ferred genetic networks GeneTS requires thegraph and Rgraphviz packages (availablefrom wwwbioconductororg) By Konstanti-nos Fokianos Juliane Schaefer and KorbinianStrimmer
HighProbability Provides a simple fast reliable so-lution to the multiple testing problem Given avector of p-values or achieved significance lev-els computed using standard frequentist infer-ence HighProbability determines which onesare low enough that their alternative hypothe-ses can be considered highly probable The p-value vector may be determined using exist-ing R functions such as ttest wilcoxtestcortest or sample HighProbability can beused to detect differential gene expression andto solve other problems involving a large num-ber of hypothesis tests By David R Bickel
Icens Many functions for computing the NPMLE forcensored and truncated data By R Gentlemanand Alain Vandal
MCMCpack This package contains functions forposterior simulation for a number of statisti-cal models All simulation is done in com-piled C++ written in the Scythe Statistical Li-brary Version 04 All models return codamcmc objects that can then be summarized us-ing coda functions or the coda menu interfaceThe package also contains some useful utilityfunctions including some additional PDFs andpseudo-random number generators for statis-tical distributions By Andrew D Martin andKevin M Quinn
MNP A publicly available R package that fitsthe Bayesian Multinomial Probit models viaMarkov chain Monte Carlo Along with thestandard Multinomial Probit model it can alsofit models with different choice sets for each ob-servation and complete or partial ordering ofall the available alternatives The estimation isbased on the efficient marginal data augmen-tation algorithm that is developed by Imai andvan Dyk (2004) By Kosuke Imai Jordan VanceDavid A van Dyk
NADA Contains methods described by Dennis RHelsel in his book ldquoNondetects And Data Anal-ysis Statistics for Censored EnvironmentalDatardquo By Lopaka Lee
R2WinBUGS Using this package it is possible tocall a BUGS model summarize inferences andconvergence in a table and graph and savethe simulations in arrays for easy access inR By originally written by Andrew Gelmanchanges and packaged by Sibylle Sturtz andUwe Ligges
RScaLAPACK An interface to ScaLAPACK func-tions from R By Nagiza F Samatova SrikanthYoginath and David Bauer
RUnit R functions implementing a standard UnitTesting framework with additional code in-spection and report generation tools ByMatthias Burger Klaus Juenemann ThomasKoenig
SIN This package provides routines to perform SINmodel selection as described in Drton amp Perl-man (2004) The selected models are in theformat of the ggm package which allows inparticular parameter estimation in the selectedmodel By Mathias Drton
SoPhy SWMS_2D interface Submission to Comput-ers and Geosciences title The use of the lan-guage interface of R two examples for mod-elling water flux and solute transport By Mar-tin Schlather Bernd Huwe
SparseLogReg Some functions wrapping the sparselogistic regression code by S K Shevade and SS Keerthi originally intended for microarray-based gene selection By Michael T Maderwith C code from S K Shevade and S SKeerthi
assist ASSIST see manual By Yuedong Wang andChunlei Ke
bayesmix Bayesian mixture models of univariateGaussian distributions using JAGS By BettinaGruen
betareg Beta regression for modeling rates and pro-portions By Alexandre de Bustamante Simas
circular Circular Statistics from ldquoTopics in circularStatisticsrdquo (2001) S Rao Jammalamadaka andA SenGupta World Scientific By Ulric LundClaudio Agostinelli
crossdes Contains functions for the constructionand randomization of balanced carryover bal-anced designs Contains functions to check
R News ISSN 1609-3631
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 43
given designs for balance Also contains func-tions for simulation studies on the validityof two randomization procedures By MartinOliver Sailer
debug Debugger for R functions with code displaygraceful error recovery line-numbered condi-tional breakpoints access to exit code flowcontrol and full keyboard input By Mark VBravington
distr Object orientated implementation of distribu-tions and some additional functionality ByFlorian Camphausen Matthias Kohl PeterRuckdeschel Thomas Stabla
dynamicGraph Interactive graphical tool for ma-nipulating graphs By Jens Henrik Badsberg
energy E-statistics (energy) tests for comparing dis-tributions multivariate normality Poisson testmultivariate k-sample test for equal distribu-tions hierarchical clustering by e-distancesEnergy-statistics concept based on a general-ization of Newtonrsquos potential energy is due toGabor J Szekely By Maria L Rizzo and GaborJ Szekely
evdbayes Provides functions for the bayesian anal-ysis of extreme value models using MCMCmethods By Alec Stephenson
evir Functions for extreme value theory which maybe divided into the following groups ex-ploratory data analysis block maxima peaksover thresholds (univariate and bivariate)point processes gevgpd distributions By Soriginal (EVIS) by Alexander McNeil R port byAlec Stephenson
fBasics fBasics package from Rmetrics mdash Rmetricsis an Environment and Software Collection forteaching ldquoFinancial Engineering and Compu-tational Financerdquo By Diethelm Wuertz andmany others see the SOURCE file
fExtremes fExtremes package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fOptions fOptions package from Rmetrics ByDiethelm Wuertz and many others see theSOURCE file
fSeries fOptions package from Rmetrics By Di-ethelm Wuertz and many others see theSOURCE file
fortunes R Fortunes By Achim Zeileis fortune con-tributions from Torsten Hothorn Peter Dal-gaard Uwe Ligges Kevin Wright
gap This is an integrated package for genetic dataanalysis of both population and family dataCurrently it contains functions for samplesize calculations of both population-based andfamily-based designs probability of familialdisease aggregation kinship calculation somestatistics in linkage analysis and associationanalysis involving one or more genetic mark-ers including haplotype analysis In the futureit will incorporate programs for path and seg-regation analyses as well as other statistics inlinkage and association analyses By Jing huaZhao in collaboration with other colleaguesand with help from Kurt Hornik and Brian Rip-ley of the R core development team
gclus Orders panels in scatterplot matrices and par-allel coordinate displays by some merit indexPackage contains various indices of merit or-dering functions and enhanced versions ofpairs and parcoord which color panels accord-ing to their merit level By Catherine Hurley
gpls Classification using generalized partial leastsquares for two-group and multi-group (morethan 2 group) classification By Beiying Ding
hapassoc The following R functions are used forlikelihood inference of trait associations withhaplotypes and other covariates in generalizedlinear models The functions accommodate un-certain haplotype phase and can handle miss-ing genotypes at some SNPs By K Burkett BMcNeney J Graham
haplostats Haplo Stats is a suite of S-PLUSR rou-tines for the analysis of indirectly measuredhaplotypes The statistical methods assumethat all subjects are unrelated and that haplo-types are ambiguous (due to unknown link-age phase of the genetic markers) The ge-netic markers are assumed to be codominant(ie one-to-one correspondence between theirgenotypes and their phenotypes) and so we re-fer to the measurements of genetic markers asgenotypes The main functions in Haplo Statsare haploem haploglm and haploscoreBy Jason P Sinnwell and Daniel J Schaid
hett Functions for the fitting and summarizing ofheteroscedastic t-regression By Julian Taylor
httpRequest HTTP Request protocols Implementsthe GET POST and multipart POST request ByEryk Witold Wolski
impute Imputation for microarray data (currentlyKNN only) By Trevor Hastie Robert Tib-shirani Balasubramanian Narasimhan GilbertChu
R News ISSN 1609-3631
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 44
kernlab Kernel-based machine learning methods in-cluding support vector machines By Alexan-dros Karatzoglou Alex Smola Achim ZeileisKurt Hornik
klaR Miscellaneous functions for classification andvisualization developed at the Department ofStatistics University of Dortmund By Chris-tian Roever Nils Raabe Karsten Luebke UweLigges
knncat This program scales categorical variables insuch a way as to make NN classification as ac-curate as possible It also handles continuousvariables and prior probabilities and does in-telligent variable selection and estimation of er-ror rates and the right number of NNrsquos By SamButtrey
kza Kolmogorov-Zurbenko Adpative filter for locat-ing change points in a time series By BrianClose with contributions from Igor Zurbenko
mAr Estimation of multivariate AR models througha computationally-efficient stepwise least-squares algorithm (Neumaier and Schneider2001) the procedure is of particular interest forhigh-dimensional data without missing valuessuch as geophysical fields By S M Barbosa
mathgraph Simple tools for constructing and ma-nipulating objects of class mathgraph fromthe book ldquoS Poetryrdquo available at httpwwwburns-statcompagesspoetryhtml ByOriginal S code by Patrick J Burns Ported toR by Nick Efthymiou
mcgibbsit Provides an implementation of Warnesamp Rafteryrsquos MCGibbsit run-length diagnos-tic for a set of (not-necessarily independent)MCMC sampers It combines the estimateerror-bounding approach of Raftery and Lewiswith evaluate between verses within chain ap-proach of Gelman and Rubin By Gregory RWarnes
mixreg Fits mixtures of one-variable regressions(which has been described as doing ANCOVAwhen you donrsquot know the levels) By RolfTurner
mscalib Calibration and filtering methods for cal-ibration of mass spectrometric peptide masslists Includes methods for internal externalcalibration of mass lists Provides methods forfiltering chemical noise and peptide contami-nants By Eryk Witold Wolski
multinomRob overdispersed multinomial regres-sion using robust (LQD and tanh) estimationBy Walter R Mebane Jr Jasjeet Singh Sekhon
mvbutils Utilities for project organization editingand backup sourcing documentation (formaland informal) package preparation macrofunctions and miscellaneous utilities Neededby debug package By Mark V Bravington
mvpart Multivariate regression trees Original rpartby Terry M Therneau and Beth Atkinson Rport by Brian Ripley Some routines from ve-gan by Jari Oksanen Extensions and adapta-tions of rpart to mvpart by Glenn Dersquoath
pgam This work is aimed at extending a class ofstate space models for Poisson count dataso called Poisson-Gamma models towards asemiparametric specification Just like the gen-eralized additive models (GAM) cubic splinesare used for covariate smoothing The semi-parametric models are fitted by an iterativeprocess that combines maximization of likeli-hood and backfitting algorithm By Washing-ton Junger
qcc Shewhart quality control charts for continu-ous attribute and count data Cusum andEWMA charts Operating characteristic curvesProcess capability analysis Pareto chart andcause-and-effect chart By Luca Scrucca
race Implementation of some racing methods for theempirical selection of the best If the R pack-age rpvm is installed (and if PVM is availableproperly configured and initialized) the eval-uation of the candidates are performed in par-allel on different hosts By Mauro Birattari
rbugs Functions to prepare files needed for run-ning BUGS in batch-mode and running BUGSfrom R Support for Linux systems with Wineis emphasized By Jun Yan (with part ofthe code modified from lsquobugsRrsquo httpwwwstatcolumbiaedu~gelmanbugsR by An-drew Gelman)
ref small package with functions for creating ref-erences reading from and writing ro refer-ences and a memory efficient refdata typethat transparently encapsulates matrixes anddataframes By Jens Oehlschlaumlgel
regress Functions to fit Gaussian linear model bymaximizing the residual log likelihood Thecovariance structure can be written as a lin-ear combination of known matrices Can beused for multivariate models and random ef-fects models Easy straight forward manner tospecify random effects models including ran-dom interactions By David Clifford Peter Mc-Cullagh
rgl 3D visualization device (OpenGL) By DanielAdler
R News ISSN 1609-3631
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 45
rlecuyer Provides an interface to the C implemen-tation of the random number generator withmultiple independent streams developed byLrsquoEcuyer et al (2002) The main purpose ofthis package is to enable the use of this ran-dom number generator in parallel R applica-tions By Hana Sevcikova Tony Rossini
rpartpermutation Performs permutation tests ofrpart models By Daniel S Myers
rrcov Functions for Robust Location and Scat-ter Estimation and Robust Regression withHigh Breakdown Point (covMcd() ltsReg())Originally written for S-PLUS (fastlts andfastmcd) by Peter Rousseeuw amp Katrien vanDriessen ported to R adapted and packagedby Valentin Todorov
sandwich Model-robust standard error estimatorsfor time series and longitudinal data ByThomas Lumley Achim Zeileis
seqmon A program that computes the probabilityof crossing sequential boundaries in a clinicaltrial It implements the Armitage-McPhersonand Rowe Algorithm using the method de-scribed in Schoenfeld D (2001) ldquo A simple Al-gorithm for Designing Group Sequential Clini-cal Trialsrdquo Biometrics 27 972ndash974 By David ASchoenfeld
setRNG Set reproducible random number generatorin R and S By Paul Gilbert
sfsmisc Useful utilities [lsquogoodiesrsquo] from Seminarfuer Statistik ETH Zurich many ported fromS-plus times By Martin Maechler and manyothers
skewt Density distribution function quantile func-tion and random generation for the skewed tdistribution of Fernandez and Steel By RobertKingbdquo with contributions from Emily Ander-son
spatialCovariance Functions that compute the spa-tial covariance matrix for the matern andpower classes of spatial models for data thatarise on rectangular units This code can alsobe used for the change of support problem andfor spatial data that arise on irregularly shapedregions like counties or zipcodes by laying afine grid of rectangles and aggregating the in-tegrals in a form of Riemann integration ByDavid Clifford
spc Evaluation of control charts by means of thezero-state steady-state ARL (Average RunLength) Setting up control charts for given in-control ARL and plotting of the related figuresThe control charts under consideration are one-and two-sided EWMA and CUSUM charts formonitoring the mean of normally distributedindependent data Other charts and parame-ters are in preparation Further SPC areas willbe covered as well (sampling plans capabilityindices ) By Sven Knoth
spe Implements stochastic proximity embedding asdescribed by Agrafiotis et al in PNAS 200299 pg 15869 and J Comput Chem 200324 pg1215 By Rajarshi Guha
supclust Methodology for Supervised Grouping ofPredictor Variables By Marcel Dettling andMartin Maechler
svmpath Computes the entire regularization pathfor the two-class SVM classifier with essentialythe same cost as a single SVM fit By TrevorHastie
treeglia Stem analysis functions for volume incre-ment and carbon uptake assessment from tree-rings By Marco Bascietto
urca Unit root and cointegration tests encounteredin applied econometric analysis are imple-mented By Bernhard Pfaff
urn Functions for sampling without replacement(Simulated Urns) By Micah Altman
verify This small package contains simple tools forconstructing and manipulating objects of classverify These are used when regression-testingR software By S Original by Patrick J BurnsPorted to R by Nick Efthymiou
zoo A class with methods for totally ordered in-dexed observations such as irregular time se-ries By Achim Zeileis Gabor Grothendieck
Other changes
bull Packages mclust1998 netCDF and serializewere moved from the main CRAN section tothe Archive
Kurt HornikWirtschaftsuniversitaumlt Wien AustriaKurtHornikR-projectorg
R News ISSN 1609-3631
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-
Vol 41 June 2004 46
R Foundation Newsby Bettina Gruumln
Donations
bull Dipartimento di Scienze Statistiche e Matem-atiche di Palermo (Italy)
bull Emanuele De Rinaldis (Italy)
bull Norm Phillips (Canada)
New supporting institutions
bull Breast Center at Baylor College of MedicineHouston Texas USA
bull Dana-Farber Cancer Institute Boston USA
bull Department of Biostatistics Vanderbilt Univer-sity School of Medicine USA
bull Department of Statistics amp Actuarial ScienceUniversity of Iowa USA
bull Division of Biostatistics University of Califor-nia Berkeley USA
bull Norwegian Institute of Marine ResearchBergen Norway
bull TERRA Lab University of Regina - Depart-ment of Geography Canada
bull ViaLactia Biosciences (NZ) Ltd AucklandNew Zealand
New supporting members
Andrej Blejec (Slovenia)Javier de las Rivas (Spain)Sandrine Dudoit (USA)Werner Engl (Austria)Balazs Gyoumlrffy (Germany)Branimir K Hackenberger (Croatia)O Maurice Haynes (USA)Kai Hendry (UK)Arne Henningsen (Germany)Wolfgang Huber (Germany)Marjatta Kari (Finland)Osmo Kari (Finland)Peter Lauf (Germany)Andy Liaw (USA)Mu-Yen Lin (Taiwan)Jeffrey Lins (Danmark)James W MacDonald (USA)Marlene Muumlller (Germany)Gillian Raab (UK)Roland Rau (Germany)Michael G Schimek (Austria)R Woodrow Setzer (USA)Thomas Telkamp (Netherlands)Klaus Thul (Taiwan)Shusaku Tsumoto (Japan)Kevin Wright (USA)
Bettina GruumlnTechnische Universitaumlt Wien AustriaBettinaGruencituwienacat
Editor-in-ChiefThomas LumleyDepartment of BiostatisticsUniversity of WashingtonSeattle WA 98195-7232USA
Editorial BoardDouglas Bates and Paul Murrell
Editor Programmerrsquos NicheBill Venables
Editor Help DeskUwe Ligges
Email of editors and editorial boardfirstnamelastname R-projectorg
R News is a publication of the R Foundation for Sta-tistical Computing communications regarding thispublication should be addressed to the editors Allarticles are copyrighted by the respective authorsPlease send submissions to regular columns to therespective column editor all other submissions tothe editor-in-chief or another member of the edi-torial board (more detailed submission instructionscan be found on the R homepage)
R Project HomepagehttpwwwR-projectorg
This newsletter is available online athttpCRANR-projectorgdocRnews
R News ISSN 1609-3631
- Editorial
- The Decision To Use R
-
- Preface
- Some Background on the Company
- Evaluating The Tools of the Trade - A Value Based Equation
- An Introduction To R
- A Continued Evolution
- Giving Back
- Thanks
-
- The ade4 package - I One-table methods
-
- Introduction
- The duality diagram class
- Distance matrices
- Taking into account groups of individuals
- Permutation tests
- Conclusion
-
- qcc An R package for quality control charting and statistical process control
-
- Introduction
- Creating a qcc object
- Shewhart control charts
- Operating characteristic function
- Cusum charts
- EWMA charts
- Process capability analysis
- Pareto charts and cause-and-effect diagrams
- Tuning and extending the package
- Summary
- References
-
- Least Squares Calculations in R
- Tools for interactively exploring R packages
-
- References
-
- The survival package
-
- Overview
- Specifying survival data
- One and two sample summaries
- Proportional hazards models
- Additional features
-
- useR 2004
- R Help Desk
-
- Introduction
- Other Applications
- Time Series
- Input
- Avoiding Errors
- Comparison Table
-
- Programmers Niche
-
- ROC curves
- Classes and methods
-
- Generalising the class
-
- S4 classes and method
- Discussion
-
- Changes in R
-
- New features in 191
- User-visible changes in 190
- New features in 190
-
- Changes on CRAN
-
- New contributed packages
- Other changes
-
- R Foundation News
-
- Donations
- New supporting institutions
- New supporting members
-