news - cbgstatcbgstat.com/method_maximal_chi_square_method/rnews_2002... · 2016. 6. 1. · news...

NewsThe Newsletter of the R Project Volume 2/1, March 2002

Editorialby Kurt Hornik

Welcome to the fourth issue of R News, the newsletterof the R project for statistical computing.

This is the first special issue of R News, with anemphasis on applying R in medical statistics. It isspecial in many ways: first, it took longer to pre-pare than we had originally antipicated (which typi-cally seems to happen for special issues). Second, thenewsletter file is rather large as there are many excit-ing images—in fact, in trying to keep the size rea-sonable, in some cases images are only included at alower resolution, with the “real stuff” available fromthe respective authors’ web pages. And finally, arti-cles are fewer but longer: as many of the applicationsdescribed are based on recent advances in medicaltechnology, we felt that extra space with backgroundinformation was warranted.

The main focus in this issue is on using R inthe analysis of genome data. Here, several excit-ing initiatives have recently been started, and R hasthe potential to become a standard tool in this area.The multitude of research questions calls for a flex-ible computing environment with seemless integra-tion to databases and web content and, of course,state-of-the-art statistical methods: R can do allof that. Robert Gentleman and Vince Carey havestarted the Bioconductor project, a collaborative ef-fort to provide common infrastructure for the anal-

ysis of genome data. The packages developed bythis project (about 20 are currently under way) makeheavy use of the S4 formal classes and methods(which are now also available in R) introduced in the“Green Book” by John Chambers, and in fact providethe first large-scale project employing the new S ob-ject system.

R 1.4.0 was released on Dec 19, 2001, making itthe “Lord of the Rings” release of R. Its new featuresare described in “Changes in R”. The number of Rpackages distributed via CRAN’s main section alonenow exceeds 150 (!!!)—“Changes on CRAN” brieflypresents the 26 most recent ones. The series of intro-ductory articles on each recommended package con-tinues with “Reading Foreign Files”.

Developers of R packages should consider sub-mitting their work to the Journal of Statistical Software(JSS, �� ) which has be-come the ‘Statistical Software’ section of the Journalof Computational and Graphical Statistics (JCGS).JSS provides peer reviews of code and publishes usermanuals alongside, and hence ideally complementsR News as a platform for disseminating news aboutexciting statistical software.

Kurt HornikWirtschaftsuniversität Wien, AustriaTechnische Universität Wien, Austria��

Contents of this issue:

Editorial . . . . . . . . . . . . . . . . . . . . . . 1Reading Foreign Files . . . . . . . . . . . . . . . 2Maximally Selected Rank Statistics in R . . . . 3Quality Control and Early Diagnostics for

cDNA Microarrays . . . . . . . . . . . . . . . 6Bioconductor . . . . . . . . . . . . . . . . . . . . 11

AnalyzeFMRI: An R package for the explo-ration and analysis of MRI and fMRI datasets 17

Using R for the Analysis of DNA MicroarrayData . . . . . . . . . . . . . . . . . . . . . . . 24

Changes in R 1.4.0 . . . . . . . . . . . . . . . . . 32Changes in R 1.4.1 . . . . . . . . . . . . . . . . . 39Changes on CRAN . . . . . . . . . . . . . . . . 40

Vol. 2/1, March 2002 2

Reading Foreign Filesby Duncan Murdoch

One of the first tasks in any statistical analysis is get-ting the data into a usable form. When your clientgives you notebooks filled with observations or a pileof surveys and you control the data entry, then youcan choose the form to be something convenient. InR, that usually means some form of text file, eithercomma- or tab-delimited. Through the RODBC andvarious SQL packages R can also import data directlyfrom many databases.

However, sometimes you receive data that has al-ready been entered into some other statistical pack-age and stored in a proprietary format. In that caseyou have a couple of choices. You might be able towrite programs in that package to export the data inan R-friendly format, but often you don’t have accessor familiarity with it, and it would be best if R couldread the foreign file directly. The foreign package isdesigned for exactly this situation. It has code to readfiles from Minitab, S, S-PLUS, SAS, SPSS and Stata.In this article I’ll describe how to import data fromall of these packages. I wrote the S and S-PLUS parts,so I’ll describe those in most detail first. Other cred-its for the foreign package go to Doug Bates, SaikatDebRoy, and Thomas Lumley.

For more information on data import and export,I recommend reading the fine manual “R Data Im-port/Export” which comes with the R distribution.

Reading S and S-PLUS files

Both S and S-PLUS usually store objects (which, asmost R users know, are very similar to R objects) inindividual files in a data directory, as opposed to thesingle workspace files that R uses. Sometimes the filehas the same name as the object, sometimes it has aconstructed name, e.g., ‘ ��’. The mapping betweenobject name and file name is complicated and ver-sion dependent. In S-PLUS 2000 for Windows, thefile ‘ ��’ in the data directory contains a list of ob-ject names and corresponding filenames. Other ver-sions use different mechanisms. Sometimes (e.g., inlibrary sections) all objects are stored in one big file;at present the foreign package cannot read these.

The foreign function �� can be used toread the individual binary files. I have tested itmainly on S-PLUS 2000 for Windows, but it is basedon code that worked on earlier versions as well. Itmay not work at all on later versions. (This is a con-tinuing problem with reading foreign files. The writ-ers of those foreign applications frequently changethe formats of their files, and in many cases, the fileformats are unpublished. When the format changes,foreign stops working until someone works out the

new format and modifies the code.) �� takes just one argument: the filename

to read from. It’s up to you to figure out which filecontains the data you want. Because S and R are sosimilar, �� is surprisingly successful. It can gen-erally read simple vectors, data frames, and can oc-casionally read functions. It almost always has trou-ble reading formulas, because those are stored differ-ently in S and R.

For example, I have an S-PLUS data frame called� :

� ��

��

� � � ��

� ��

��

��

��

��

��

��

At the beginning of the session, S-PLUS printed thisheader:

��

!"#$�� %��#�� #&��#�&��'(#)*��

so that’s the place to go looking for � : but there’sno file there by that name. The ‘ ��’ file has theselines near the end:

+��+

+))��+

which give me the correct filename, ‘ ��’. Here Iread the file into R:

� ��,-!".$�� %��.�� .&��-/

0122

� ��3,-�&��'(.)*��.))��-/

��

� � � ��

� ��

��

��

��

��

��

��

(The only reason to use �� here is to fit into thesenarrow columns!) We see that �� was success-ful; only the printing format is different.

Of course, since I have access to S-PLUS todo this, it would have been much easier to ex-port the data frame using �� and read it intoR using �� . I could also have exportedit using �� and restored it using foreign’s�� , but that has the same limitations as ��.

R News ISSN 1609-3631

Vol. 2/1, March 2002 3

Reading SAS files

SAS stores its files in a number of different formats,depending on the platform (PC, Unix, etc.), version,etc. The �� function can read the SAS trans-port (XPORT) format. How you produce one of thesefiles depends on which version of SAS you’re using.In older versions, there was a !�"# $!"�%. Newerversions use a “library engine” called &' �� tocreate them. From the SAS technical note referencedin the ( �� help page, a line like

� �� 444 ��5�4�� -444��-6

creates a library named which lives in the physi-cal file ‘��’. New datasets created in this library,e.g. with

�� 444�� '6

�� (��6

will be saved in the XPORT format.To read them, use the �� function. For

example, �� ) ��)� will return a listof data frames, one per dataset in the exported li-brary. If there is only a single dataset, it will be re-turned directly, not in a list. The *�� func-tion gives information about the contents of the li-brary.

Reading Minitab files

Minitab also stores its data files in several differ-ent formats. The foreign package can only read the“portable” format, the MTP files. In its current in-carnation, only numeric data types are supported; inparticular, dataset descriptions cannot be read.

The �� function is used to read the files. Itputs each column, matrix or constant into a separatecomponent of a list.

Reading SPSS and Stata files

The foreign functions �� and �� canread the binary file formats of the SPSS and Statapackages respectively. The former reads data into alist, while the latter reads it into a data frame. The� �� function is unique in the foreign packagein being able to export data in the Stata binary format.

Caveats and conclusions

It’s likely that all of the functions in the foreign pack-age have limitations, but only some of them are doc-umented. It’s best to be very careful when transfer-ring data from one package to another. If you can,use two different methods of transfer and comparethe results; calculate summary statistics before andafter the transfer; do anything you can to ensure thatthe data that arrives in R is the data that left the otherpackage. If it’s not, you’ll be analyzing programmingbugs instead of the data that you want to see.

But this is true of any data entry exercise: errorsintroduced in data entry aren’t of much interest toyour client!

Duncan MurdochUniversity of Western Ontario��

Maximally Selected Rank Statistics in Rby Torsten Hothorn and Berthold Lausen

Introduction

The determination of two groups of observationswith respect to a simple cutpoint of a predictor is acommon problem in medical statistics. For example,the distinction of a low and high risk group of pa-tients is of special interest. The selection of a cutpointin the predictor leads to a multiple testing problem,cf. Figure 1. This has to be taken into account whenthe effect of the selected cutpoint is evaluated. Maxi-mally selected rank statistics can be used for estima-tion as well as evaluation of a simple cutpoint model.We show how this problems can be treated with themaxstat package and illustrate the usage of the pack-age by gene expression profiling data.

Maximally selected rank statistics

The functional relationship between a quantitative orordered predictor X and a quantitative, ordered orcensored response Y is unknown. As a simple modelone can assume that an unknown cutpoint μ in X de-termines two groups of observations regarding theresponse Y: the first group with X-values less orequal μ and the second group with X-values greaterμ. A measure of the difference between two groupswith respect to μ is the absolute value of an appro-priate standardized two-sample linear rank statisticof the responses. We give a short overview and fol-low the notation in Lausen and Schumacher (1992).

The hypothesis of independence of X and Y canbe formulated as

H0 : P�Y � y�X � μ� � P�Y � y�X � μ�

for all y and μ � �. This hypothesis can be tested as


Vol. 2/1, March 2002 4

follows. For every reasonable cutpoint μ in X (e.g.,cutpoints that provide a reasonable sample size inboth groups), the absolute value of the standardizedtwo-sample linear rank statistic �Sμ� is computed.The maximum of the standardized statistics

M � maxμ�Sμ�

of all possible cutpoints is used as a test statisticfor the hypothesis of independence above. The cut-point in X that provides the best separation of the re-sponses into two groups, i.e., where the standardizedstatistics take their maximum, is used as an estimateof the unknown cutpoint.

Several approximations for the distribution of themaximum of the standardized statistics Sμ have beensuggested. Lausen and Schumacher (1992) show thatthe limiting distribution is the distribution of thesupremum of the absolute value of a standardizedBrownian bridge and consequently the approxima-tion of Miller and Siegmund (1982) can be used. Anapproximation based on an improved Bonferroni in-equality is given by Lausen et al. (1994). For smallsample sizes, Hothorn and Lausen (2001) derive anlower bound on the distribution function based onthe exact distribution of simple linear rank statis-tics. The algorithm by Streitberg and Röhmel (1986)is used for the computations. The exact distributionof a maximally selected Gauß statistic can be com-puted using the algorithms by Genz (1992). Becausesimple linear rank statistics are asymptotically nor-mal, the results can be applied to approximate thedistribution of maximally selected rank statistics (seeHothorn and Lausen, 2001).

The maxstat package

The package maxstat implements both cutpoint esti-mation and the test procedure above with several P-value approximations as well as plotting of the em-pirical process of the standardized statistics. It de-pends on the packages exactRankTests for the com-putation of the distribution of linear rank statistics(Hothorn, 2001) and mvtnorm for the computation ofthe multivariate normal distribution (Hothorn et al.,2001). All packages are available at CRAN. Thegeneric method � �� provides a formulainterface for the specification of predictor and re-sponse. An object of class +� ��+ is returned.The methods � �� and �*�� areavailable for inspectation of the results.

Gene expression profiling

The distinction of two types of diffuse large B-celllymphoma by gene expression profiling is studied byAlizadeh et al. (2000). Hothorn and Lausen (2001)

suggest the mean gene expression (MGE) as quan-titative factor for the discrimination between twogroups of patients with respect to overall survivaltime. The dataset ,-�#- is included in the pack-age. The maximally selected log-rank statistic forcutpoints between the 10% and 90% quantile of MGEusing the upper bound of the P-value by Hothornand Lausen (2001) can be computed by

� ��,*2�72/

� ��4��,3&�5,��8 '��/ 9 :;<8

��=*2�728 ��(��=+2��>��+8

��(��=+?2+/

2��>�� &�� ?2

��" 3&�5,��8 '��/ � :;<

: = ��8 ��5��& = ��

�� "

�� '&��

��

For censored responses, the formula interface issimilar to the one used in package survival: ��

specifies the time until an event and �� is the sta-tus indicator (��./). For quantitative responsesy, the formula is of the form ‘0 1 ’. Currently itis not possible to specify more than one predictor . �� allows the selection of the statistics tobe used: 2�, 3�*�� , 4��, 5� �*6��*and -�� are available. �� defines whichkind of P-value approximation is computed: -�78means the limiting distribution, -�79 the approxi-mation based on the improved Bonferroni inequal-ity, � ��2� the distribution of a maximally se-lected Gauß statistic and �- is the upper bound ofthe P-value by Hothorn and Lausen (2001). All im-plemented approximations are known to be conser-vative and therefore their minimum P-value is avail-able by choosing ��.+��+.

−0.05 0.00 0.05 0.10 0.15 0.20 0.25

1.0

1.5

2.0

2.5

3.0

Mean gene expression

Sta

ndar

dize

d lo

g−ra

nk s

tatis

tic u

sing

Lau

94

Figure 1: Absolute standardized log-rank statisticsand significance bound based on the improved Bon-ferroni inequality.


Vol. 2/1, March 2002 5

For the overall survival time, the estimated cut-point is 0.186 mean gene expression, the maximumof the log-rank statistics is M � 3.171. The proba-bility that, under the null hypothesis, the maximallyselected log-rank statistic is greater M � 3.171 is lessthen than 0.022. The empirical process of the stan-dardized statistics together with theα-quantile of thenull distribution can be plotted using �*�� .

� ��,*2�72/

� �� @�

��4��,3&�5,��8 '��/ 9 :;<8

��=*2�728 ��(��=+2��>��+8

��(��=+2�&��+8 ��(�=��/

� ��,��8 4�� =+:�� 4��+/

If the significance level *�� is specified, the cor-responding quantile is computed and drawn as ahorizonal red line. The estimated cutpoint is plottedas vertical dashed line, see Figure 1.

The difference in overall survival time betweenthe two groups determined by a cutpoint of 0.186mean gene expression is plotted in Figure 2. Noevent was observed for patients with mean gene ex-pression greater 0.186.

0 20 40 60 80 100 120

0.0

0.2

0.4

0.6

0.8

1.0

Survival time in month

Pro

babi

lity

Mean gene expression > 0.186

Mean gene expression ≤ 0.186

Figure 2: Kaplan-Meier curves of two groups of DL-BCL patients separated by the cutpoint 0.186 meangene expression.

Summary

The package maxstat provides a user-friendly inter-face and implements standard methods as well asrecent suggestions for the approximation of the nulldistribution of maximally selected rank statistics.

Acknowledgement. Support from DeutscheForschungsgemeinschaft SFB 539-A4 is gratefullyacknowledged.

Bibliography

Ash A. Alizadeh, Michael B. Eisen, and Davis, R. E.et al. Distinct types of diffuse large B-cell lym-phoma identified by gene expression profiling. Na-ture, 403:503–511, 2000. 4

Alan Genz. Numerical computation of multivariatenormal probabilities. Journal of Computational andGraphical Statistics, 1:141–149, 1992. 4

Torsten Hothorn. On exact rank tests in R. R News, 1(1):11–12, 2001. 4

Torsten Hothorn, Frank Bretz, and Alan Genz. Onmultivariate t and Gauss probabilities in R. RNews, 1(2):27–29, 2001. 4

Torsten Hothorn and Berthold Lausen. On the exactdistribution of maximally selected rank statistics.Preprint, Universität Erlangen-Nürnberg, submitted,2001. 4

Berthold Lausen, Wilhelm Sauerbrei, and MartinSchumacher. Classification and regression trees(CART) used for the exploration of prognostic fac-tors measured on different scales. In P. Dirschedland R. Ostermann, editors, Computational Statistics,pages 483–496, Heidelberg, 1994. Physica-Verlag.4

Berthold Lausen and Martin Schumacher. Maximallyselected rank statistics. Biometrics, 48:73–85, 1992.3, 4

Rupert Miller and David Siegmund. Maximally se-lected chi square statistics. Biometrics, 38:1011–1016, 1982. 4

Bernd Streitberg and Joachim Röhmel. Exact distri-butions for permutations and rank tests: An in-troduction to some recently published algorithms.Statistical Software Newsletters, 12(1):10–17, 1986. 4

Friedrich-Alexander-Universität Erlangen-Nürnberg,Institut für Medizininformatik, Biometrie undEpidemiologie, Waldstraße 6, D-91054 Erlangen%� �� :��*�� *��

�� *��-�� :��*�� *��


Vol. 2/1, March 2002 6

Quality Control and Early Diagnostics forcDNA Microarraysby Günther Sawitzki

We present a case study of a simple application of Rto analyze microarray data. As is often the case, mostof the tools are nearly available in R, with only minoradjustments needed. Some more sophisticated stepsare needed in the finer issues, but these details canonly be mentioned in passing.

Quality control for microarrays

Microarrays are a recent technology in biological andbiochemical research. The basic idea is: you wantto assess the abundance of some component of asample. You prepare a more or less well definedprobe which binds chemically to this specific com-ponent. You bring probe and sample into contact,measure the amount of bound component and inferthe amount in the original sample.

So far, this is a classical technique in science. Buttechnology has advanced and allows high through-put. Instead of applying samples to probes one byone, we can spot many probes on a chip (a micro-scope slide or some other carrier) and apply the sam-ple to all of them simultaneously and under identi-cal conditions. Depending on the mechanics and ad-hesion conditions, today we can spot 10,000–50,000probes on a chip in well-separated positions. So onesample (or a mixture of some samples) can give infor-mation about the response for many probes. Theseare not i.i.d. observations, but the probes are selectedand placed on the chip by our choice, and response aswell as errors usually are correlated. The data haveto be considered as high dimensional response vec-tors.

Various approaches how to analyze this kind ofdata are discussed in the literature. We will concen-trate on base level data analysis. Of course any seri-ous analysis does specify diagnostic checks for vali-dating its underlying assumptions. Or, to put it theother way, an approach without diagnostics for itsunderlying assumptions may be hardly consideredserious. The relevant assumptions to be checked de-pend on the statistical methods applied, and we facethe usual problem that any residual diagnostic is in-fluenced by the choice of methods which is used todefine fit and hence residuals. However there aresome elementary assumptions to check which are rel-evant to all methods. If you prefer, you can name itquality control.

We use R for quality control of microarrays. Wesimplify and specialize the problem to keep the pre-sentation short. So for now the aim is to imple-

ment an early diagnostic to support quality controlfor microarrays in a routine laboratory environment- something like a scatterplot matrix, but adapted tothe problem at hand.

Some background on DNA hybri-dization

Proteins are chemical components of interest in bio-chemical research. But these are very delicate to han-dle, and specific probes have to be developed in-dividually protein by protein. As a substitute, thegenes controlling the protein production can be stud-ied. The genes are encoded in DNA, and in the syn-thesis process these are transcribed to RNA whichthen enters to the protein production. DNA andtranscribed RNA come in complementary pairs, andthe complementary pairs bind specifically (hybridi-zation). So for each specific DNA segment the com-plementary chain gives a convenient probe.

In old Lackmus paper, the probe on the paperprovides colour as visible indicator. In hybridizationexperiments, we have to add an indicator. In DNA/RNA hybridization experiments the DNA is used asa probe. The indicator now is associated with thesample: the RNA sample is transcribed to cDNA(which is more stable) and marked, using dyes ora radioactive marker. For simplicity, we will con-centrate on colour markers. After probe and samplehave been hybridized in contact and unbound mate-rial has been washed off, the amount of observablemarker will reflect the amount of bound sample.

Data is collected by scanning. The raw data areintensities by scan pixel, either for an isolated scanchannel (frequency band) or for a series of chan-nels. Conventionally these data are pre-processedby image analysis software. Spots are identified bysegmentation, and an intensity level is reported foreach spot. Additionally, estimates for the local back-ground intensities are reported.

In our specific case, our partner is the MolecularGenome Analysis Department of the German can-cer research center (DKFZ). Cancer research has var-ious aspects. We simplify strongly in this presenta-tion and consider the restricted question as to whichgenes are significantly more (or significantly less) ac-tive in cancer cell, in comparison to non-cancerous“normal” cells. Which genes are relevant in cancer?Thinking in terms of classical experiments leads totaking normal and tumor cell samples from each in-dividual and applying a paired comparison. Detect-ing differentially expressed genes is a very different


Vol. 2/1, March 2002 7

challenge from classical comparison. Gene expres-sion analysis is a search problem in a high dimen-sional structured data set, while classical paired com-parison deals with independent pairs of samples ina univariate context. But in the core gene expres-sion analysis benefits from scores which may be mo-tivated by a classical paired comparison.

Statistical routine immediately asks for factorsand variance components. Of course there will bea major variation between persons. But there willbe also a major contribution from the chip and hy-bridizations: the amount of sample provided, thesample preparations, hybridization conditions (e.g.,temperature) and the peculiarities of the scanningprocess which is needed to determine the marker in-tensity by spot, among others are prone to vary fromchip to chip. To control these factors or variance com-ponents, sometimes a paired design may be used byapplying tumor and normal tissue samples on thesame chip, using different markers (e.g., green andred dye). This is the basic design in the fundamentalexperiments.

The challenge is to find the genes that are rel-evant. The typical experiment has a large numberof probes with a high variation in response betweensamples from different individuals. But only a smallnumber of genes is expected to be differentially ex-pressed (“many genes few differentials”).

To get an idea of how to evaluate the data, we canthink of the structure we would use in a linear gaus-sian model to check for differences in the foreground( f g). We would implement the background (bg) as anuisance covariable and in principle use a statisticsbased upon something like

� � �Ytumor �Ynormal�

where e.g.,

Ytumor � ln�Yf g tumor �Ybg tumor�

Ynormal � ln�Yf g normal� Ybg normal�

given or taken some transformation and scaling. Us-ing log intensities is a crude first approach. Find-ing the appropriate transformations and scaling asa topic of its own, see e.g. Huber et al. (2002). Thenext non-trivial step is combining spot by spot in-formation to gene information, taking into accountbackground and unknown transformations. Detailsgive an even more complex picture: background cor-rection and transformation may be partly integratedin image processing, or may use information fromother spots. But there is an idea of the general struc-ture, and together with statistical folklore it suggestshow to do an early diagnostics.

We hope that up to choice of scale � covers theessential information of the experiment. We hopewe can ignore all other possible covariates and foreach spot we can concentrate on Yf g tumor, Ybg tumor,Yf g normal, Ybg normal as source of information. Up to

choice of scale, �Yf g tumor � Yf g normal� represents theraw “effect” and a scatter plot of Yf g tumor vs Yf g normalis the obvious raw graphic tool.

Unfortunately this tool is stale. Since tumor andnormal sample may have different markers (or maybe placed on different slides in single dye experi-ments) there may be unknown transformations in-volved going from our target, the amount of boundprobe specific sample, to the reported intensities Y.Since it is the most direct raw plot of the effect, thescatterplot is worth being inspected. But more con-siderations may be necessary.

Diagnostics may be sharper if we know what weare looking for. We want diagnostics that highlightthe effect, and we want diagnostics that warn againstcovariates or violation of critical assumptions. Fornow, we pick out one example: spatial variation. Aswith all data collected from a physical carrier, posi-tion on the carrier may be an important covariate. Wedo hope that this is not the case, so that we can omitit from our model. But before running into artifacts,we should check for spatial homogeneity.

Diagnostic for spatial effects

For illustration, we use a paired comparison betweentumor and normal samples. After segmentation andimage pre-processing we have some intensity in-formation Yf g tumor, Ybg tumor, Yf g normal, Ybg normal perspot. A cheap first step is to visualize these foreach component by position (row and column) onthe chip. With R, the immediate idea is to organizeeach of these vectors as a matrix with dimensionscorresponding to the physical chip layout and to use��.

As usual, some fine tuning is needed. As has beendiscussed many times, in S and R different conceptsof coordinate orientation are used for matrices andplots and hence the image appears rotated. Second,we want an aspect ratio corresponding to the geo-metric while image follows the layout of R’s graph-ics regions. A wrapper �� around �� isused as a convenient solution for these details (andoffers some additional general services which maybe helpful when representing a matrix graphically).

Using red and green color palettes for a pairedcomparison experiment gives a graphical represen-tation which separates the four information channelsand is directly comparable to the raw scanned im-age. An example of these images is in ��

��*;�� .The next step is to enhance accessibility of the in-

formation. The measured marker intensity is only anindirect indicator of the gene activity. The relation isinfluenced by many factors, such as sample amount,variation between chips, scanner settings. Internalscanner calibration in the scan machine (which mayautomatically adjust between scans) and settings of


Vol. 2/1, March 2002 8

the image processing software are notorious sourcesof (possibly nonlinear) distortions. So we want tomake a paired comparison, but each of both sidesundergoes an unknown transformation. Here spe-cific details of the experiment come to help. We can-not assume that both samples in a pair undergo thesame transformation. But while we do not know thedetails of the transformations, we hope that at leasteach is guaranteed to be monotonous.

At least for experiments with “many genes - fewdifferentials”, this gives one of the rare situationswhere we see a bless of high dimensions, not a curse.The many spots scanned under comparable condi-tions provide ranks for the measured intensity whichare a stable indicator for the rank of the original ac-tivity of each. So we apply �� to the ranks,where the ranks are taken separately for the fourcomponents and within each scan run.

The rest is psychology. Green and red are com-monly used as dyes in these experiments or in pre-sentations, but these are not the best choice for visu-alization. If it comes to judging quantitative differ-ences, both colour scales are full of pitfalls. Insteadwe use a colour palette going from blue to yellow,with more lightness in the middle value range.

To add some sugar, background is compared be-tween the channels representing tumor and normal.If we want a paired comparison, background may beignorable if it is of the same order for both becauseit balances in differences. But if we have spots forwhich the background values differ drastically, back-ground correction may be critical. If these pointscome in spatial clusters, a lot more work needs tobe done in the analysis. To draw attention to this,spots with extremely high values compared to theircounterpart are highlighted. This highlighting is im-plemented as an overlay. Since �� has the fa-cility to leave undefined values as background, it isenough to apply �� again with the uncriticalpoints marked as 5<, using a fixed colour.

Wrapping it up

Rank transformation and image generation arewrapped up in a single procedure ��. Sincethe ranks cover all order information, but lose theoriginal scale, marginal scatterplots are provided aswell, on a fixed common logarithmic scale.

Misadjustment in the scanning is a known no-torious problem. Estimated densities over all spotswithin one scan run are provided for the four infor-mation items, together with the gamma correctionexponent which would be needed to align the me-dians.

If all conditions were fixed or only one data setwere used, this would be sufficient. The target envi-ronment however is the laboratory front end wherethe chip scanning is done as the chips come in. Ex-

perimental setup (including chip layout) and sam-pling protocols are prone to vary. Passing the detailsas single parameters is error prone, and passing thedata repeatedly is forbidding for efficiency reasonsdue to the size of the data per case.

The relevant information is bundled instead. Bor-rowing ideas from relational data bases, for each se-ries of experiments one master list is kept whichkeeps references to tables or lists which describe de-tails of the experiment. These details go from de-scription of the geometry, over a representation of theactual experimental design to various lists which de-scribe the association between spots and correspond-ing probes. S always had facilities for a limited formof object oriented programming, since functions arefirst class members in S. Besides data slots, the mas-ter list has list elements which are functions. Us-ing enclosure techniques as described in Gentlemanand Ihaka (2000), it can extract and cache informa-tion from the details. The proper data are kept inseparate tables, and methods of the master list caninvoke ��and other procedures with a min-imum of parameter passing, while guaranteeing con-sistency which would endangered if global variableswere used.

From the user perspective, a typical session mayintroduce a new line of experiments. The structure ofthe session is

'&�4 @� 0�;�4,+@�� A'� ��+/

BB '�� '�� A'� %��

BB �%�&�� % �� 5�8

BB � � � �� '��

'&�4C��',+@�� '�� +/

BB � �''�� (� ��

BB D(� � ��8 &� %�� ( �� &��

BB � �� ( A� ��

BB $�� 4��

'&�4C��5,/

BB D( '&�� 4�� '�� 5�

BB ��% �� @�� A'� ��>*��

Once an experimental layout has been specified,it is attached to some specific data set as

'&�4C��,+@�� +/

BB ��'�� ( �� ( �(

BB '&�� 4�� &�

Of course it would be preferable to set up a referencefrom the data to the experiment descriptor. But wehave to take into account that the data may be usedin other evaluation software; so we should not addelements to the low level data if we can avoid it.

When the association has been set up, applicationis done as

'&�4C�(��'(�,@�� '(� ��%'��/

BB �� '� �% '(�.'(�� (��


Vol. 2/1, March 2002 9

This is a poor man’s version of object oriented pro-gramming in R. For a full grown model of object ori-ented programming in R, see Chambers and Lang(2001).

Looking at the output

We cannot expect a spatially uniform distribution ofthe signals over the spots. The probes do not fallat random on the chip, but they are placed, and theplacement may reflect some strategy or some tradi-tion. In the sample output (Figure 1), we see a gradi-ent from top to bottom. If we look closer, there maybe a separation between upper and lower part. Infact these mirror the source of the genes spotted here.This particular chip is designed for investigations onkidney cancer. About half of the genes come froma “clone library” of kidney related genes, the otherfrom genes generally expected to be cancer related.This feature is apparent in all chips from this spe-cial data series. The immediate warning to note is:it might be appropriate to use a model which takesinto account this as a factor. On some lines, the spotsdid not carry an active probe. These unused spotsprovide an excellent possibility to study the null dis-tribution. The vertical band structure correspondsto the plates in which the probes are provides; thesquare blocks come from characteristics of the printhead. These design factors are well known, and needto taken into account in a formal analysis.

Of course these known features can be taken intoaccount in an adapted version of showchip, and allthis information is accessible in the lists mentionedabove. In a true application this would be used tospecialize ��. For now let us have anotherlook at the unadapted plot shown in Figure 1.

Background is more diffuse than foreground—good, since we expect some smooth spatial variationof the background while we expect a grainy struc-ture in the foreground reflecting the spot probes.High background spots have a left-right antisymme-try. This is a possible problem in this sample pair ifit were unnoticed (it is is an isolated problem on thisone chip specimen).

As in the foreground, there is some top/bottomgradient. This had good reason for the foregroundsignal. But for the background, there is no specialasymmetry. As this feature runs through all data ofthis series of experiments, it seems to indicate a sys-tematic effect (possibly a definition of “background”in the image processing software which picks up toomuch foreground signal).

Conclusion

�� is used for the first steps in data process-

ing. It works on minimal assumptions and is usedto give some first pilot information. For any evalu-ation strategy, it can be adapted easily if the criticalstatistics can be identified on spot level. For a fixedevaluation strategy, it then can be fine tuned to in-clude information on the spatial distribution of thefit and residual contributions of each spot. This iswhere the real application lies, once proper modelsare fixed, or while discussing model alternatives forarray data analysis.

An outline of how �� is integrated in ageneral rank based analysis strategy is given in Sa-witzki (2001).

P.S. And of course there is one spot out.

Acknowledgements

This work has been done in cooperation with Molec-ular Genome Analysis Group of the German can-cer research center (DKFZ) ��:��

;�=>9=�.Thanks to Wolfgang Huber and Holger Sültmann

for background information and fruitful discussions.A video showing the experiments step by

step is accessible as �� 0� � and available as DVDon CD from the DKFZ address.

Bibliography

J. M. Chambers and D. T. Lang. Objectoriented pro-gramming in .̊ R News, 3(1):17–19, September 2001.9

R. Gentleman and R. Ihaka. Lexical scope and sta-tistical computing. Journal of Computational andGraphical Statistics, 3(9):491–508, September 2000.8

W. Huber, A. von Heydebreck, H. Sültmann,A. Poustka, and M. Vingron. Variance stabiliza-tion applied to microarray data calibration and tothe quantification of differential expression. (Sub-mitted to Bioinformatics), January 2002. 7

G. Sawitzki. Microarrays: Two short reviews andsome perspectives. Technical report, StatLabHeidelberg, 2001. URL ��*;�

�� . (Göttingen Lec-tures) Göttingen/Heidelberg 2001. 9

Günther SawitzkiStatLab Heidelberg��*;��*;� ��


Vol. 2/1, March 2002 10

10 30 50

120

100

8060

4020

rk(fg.green)l34−u09632vene.txt

10 30 5012

010

080

6040

20

rk(bg.green)l34−u09632vene.txt

100 1000 20000

100

1000

2000

0

# fg < bg: 191bg.green

fg.g

reen

10 30 50

120

100

8060

4020

rk(fg.red)l34−u09632vene.txt

10 30 50

120

100

8060

4020

rk(bg.red)l34−u09632vene.txt

100 1000 20000

100

1000

2000

0

# fg < bg: 171bg.red

fg.r

ed

100 1000 20000

0.00

00.

002

0.00

4

fg gamma ratio: 1.052log intensities

Den

sity

l34−u09632vene.txt

100 1000 20000

100

1000

2000

0

bg.green

bg.r

ed

100 1000 20000

100

1000

2000

0

fg.green TC

fg.r

ed T

V

l34−u09632vene.txt

Figure 1: Output of ?��. This example compares tumor center material with tumor progression front.


Vol. 2/1, March 2002 11

BioconductorOpen source bioinformatics using R

by Robert Gentleman and Vincent Carey

Introduction

One of the interesting statistical challenges of theearly 21st century is the analysis of genomic data.This field is already producing large complex datasets that hold the promise of answering importantquestions about how different organisms function.Of particular interest is the ability to study diseasesat a new level of resolution.

While the rewards appear substantial there arelarge hurdles to overcome. The size and complex-ity of the data may prevent many statisticians fromgetting involved. The need to understand a reason-able (or unreasonable) amount of biology in order tobe effective will also be a barrier to entry for some.

We have recently begun a new initiative to de-velop software tools to make the analysis of thesedata sets easier and we hope to substantially lowerthe barrier to entry for statisticians. The projectis called Bioconductor, ��;�� . Wechose the name to be suggestive of cooperation andwe very much hope that it will be a collaborativeproject developed in much the same spirit as R.

In this article we will consider three of the Bio-conductor packages and the problems that they weredesigned to address. We will concentrate on the anal-ysis of DNA microarray data. This has been the focusof our initial efforts. We hope to expand our offeringsto cover a much wider set of problems and data in thefuture. The problems that we will consider are:

� data complexity: the basic strategy here isto combine object oriented programming withappropriate data structures. The package isBiobase.

� utilizing biological meta data: the basic strat-egy here is to use databases and hash tables tolink symbols. The package is annotate.

� gene selection: given a set of expression data(generally thousands of genes) how do we se-lect a small subset that might be of more in-terest? This is equivalent to ranking genes orgroups of genes in some order. The package isgenefilter.

� documentation: as we build a collection of in-terdependent but conceptually distinct pack-ages, how do we ensure durable interopera-tion? We describe a vignette documentationconcept, implemented using ��&� (a newfunction in R contributed by F. Leisch).

An overview of this area, provided by the Eu-ropean Bioinformatics Institute, may be found at�� 0��;��1; :��

;��*��0��* Also, a recent issue of the Journal ofthe American Medical Association (November 14,2001) has a useful series of pedagogic and interpre-tive papers.

Handling data complexity

Most microarray experiments consists of some num-ber of samples, usually less than 100, on which ex-pression of messenger RNA (mRNA) has been mea-sured. While there are substantial interesting statis-tical problems involving the design and processingof the raw data we will concentrate on the data thathave been processed to the point that we have com-parable expression data for a set of genes for eachof our samples. For simplicity of exposition we willconsider a clinical setting where the samples corre-spond to patients.

Our primary genomic data resource then consistsof an expression array which has several thousandrows (representing genes) and columns represent-ing patients. In addition we have data on the pa-tients, such as age, gender, disease status, durationof symptoms and so on. These data are conceptuallyand often administratively distinct from the expres-sion data (which may be managed in a completelydifferent database). We will use the term phenotypicdata to refer to this resource, acknowledging that itmay include information on demographics or ther-apeutic conditions that are not really about pheno-type.

What are efficient approaches for coordinating ac-cess to genomic and phenotypic data as describedhere? It is possible to couple the genomic and phe-notypic data for each patient as a row in a singledata frame, but this seems unnatural. We recog-nize that the general content and structure of expres-sion data and phenotype data may evolve in differ-ent ways as biological technology and research direc-tions also change, and that the corresponding typesof information will seldom be treated in the sym-metric fashion implied by storing them in a com-mon data frame. Our approach to data structure de-sign for this problem preserves the conceptual inde-pendence of these distinct information types and ex-ploits the formal methods and classes in the packagemethods which was developed by J. M. Chambers.

Class �� and its methods

The methods package provides a substantial ba-sis for object-oriented programming in R. Object-oriented programming has long been a tool for deal-


Vol. 2/1, March 2002 12

ing with complexity. If we can find a suitable repre-sentation for our data then we can think of it as anobject. This object is then passed to various routinesthat are designed to deal with the special structure inthe objects.

An object has slots that represent its differentcomponents. To coordinate access to the genomicand phenotypic data generated in a typical microar-ray experiment, we defined a class of objects called� � ��. An instance of this class has the followingslots:

exprs An array that contains the estimated expres-sion values, with columns representing sam-ples and rows representing genes.

se.exprs An array of the same size as � � con-taining estimated standard errors. This may be5@--.

phenoData An instance of the ��,� classwhich contains the sample level variables forthis experiment. This class is described subse-quently.

description A character variable describing the ex-periment. This will probably change to someform of documentation object when and if weadopt a more standard mechanism for docu-mentation.

annotation This is a character string indicating thename of the annotation data that can be usedfor this � � ��.

notes A character string for notes regarding theanalysis or for any other purpose.

Once we have defined the class we may create in-stances of it. A particular data set stored in this for-mat would be called an instance of the class. Whilewe should technically always say something like: x isan instance of the � � �� class, we will often simplysay that x is an � � ��.

We have implemented a number of methods forthe � � �� and ��,� classes. Most of theseare preliminary and some will evolve as our under-standing and the data analytic processing involvedmature. The reader is directed to the documentationin our packages for definitive descriptions.

We would like to say a few words here about thesubsetting methods. By implementing special sub-setting methods we are able to ensure that the dataremain correctly aligned. So, if is an instance ofan � � ��, we may ask for AB/�'C. The secondindex in this subscripting expression refers to tissuesamples. The value of this expression is an � � ��

whose contents are restricted to the first five sam-ples in . The new � � ��’s � � element con-tains the first five columns of ’s � � array. The new� � ��’s �� array is similarly restricted. Butthe ��,�must also have a subset made and for

it we want the first five rows of ’s phenoData dataframe, since it is in the more standard format wherecolumns are variables and rows are cases.

Notice that by representing our data with a for-mal class we have removed a great deal of the com-plexity associated with the data. We believe that thisrepresentation relieves the data analyst of complextasks of coordination and checking of diverse dataresources and makes working with � � ��s muchsimpler than working with the raw data.

We have also defined the D operator to work onexprSets. This operator selects variables of the ap-propriate name from the ��,� component ofthe � � ��.

The ��,� class was designed to hold thephenotypic (or sample level) data. Instances of thisclass have the following slots:

pData This is a �� with samples as therows and the phenotypic variables as thecolumns.

varLabels This is a list with one element per pheno-typic variable. Names of list elements are thevariable names; list element values are charac-ter strings with brief textual descriptions of theassociated variables.

We find it very helpful to keep descriptions of thevariables associated with the data themselves. Sucha construct could be helpful for all �� s.

Example

The golubEsets package includes � � �� embody-ing the leukemia data of the celebrated Science pa-per of T. Golub and colleagues (Science 286:531-537,1999). Upon attaching the ��*�;% �� element of thepackage, we perform the following operations.

Show the data. This gives a concise report.

� ��& D��

<4�� 3� ,4��3�/ ��(

��

��

�(��*�� A'� ��( �� 5�� '��

5��2� ��

3��" 3�� 4

E22�E:2" !�'��8 ��'�� E22 �� E:2

�:�$�" !�'��8 �� %�� #

��(��

D��'��" !�'��8 D '�� '�� &��

!E�" !�'��8 !E� '��%'��

*��" *��

;��" !�'��8 �� % ��

�'��" �'� �% '�� (��

D��" ��

$3" $��'�� (

3�&�'" 3�&�' �% ��


Vol. 2/1, March 2002 13

Subset expression values. This gives a matrix.Note the gene annotation, in Affymetrix ID format.

� 4��,��& D��/F�"�8�"�G

F8�G F8 G F8�G

E!!H��)��

E!!H��:)��

E!!H��)��

E!!H��7��)��

Tabulate stratum membership. This exploits theredefined D operator.

� �� ,��& D��CE22�E:2/

E22 E:2

� ��

Restrict to the ALL patients. We obtain the dimen-sions after restriction to patients whose phenoDataindicates that they have acute lymphocytic leukemia.

� ��,��,4��,��& D��F 8

��& D��CE22�E:2==+E22+G///

F�G ��

Other tools for working with � � ��s are pro-vided, including methods for iterating over geneswith function application, and sampling from pa-tients in an � � �� (with or without replacement).The latter method returns an � � ��.

Annotation

Relating the genes to various biological data re-sources is essential. While there is much to belearned and many of the biological databases arevery incomplete it is never the less essential to startemploying these data. Again our approach is quitesimple but it has proven effective and we are able toboth analyse the data we encounter and to provideresources to others.

Our approach has been to divide the process intotwo components. One process is the building andcollation of annotation data from public databases,and the other process is the extraction and formattingof the collated data for analysis reporting. Data ana-lysts using Bioconductor will typically simply obtaina set of annotation tables for the chip data they areusing. Given the appropriate tables they can selectgenes according to certain conditions such as func-tional group, or biological process or chromosomallocation.

The process of building annotation data sets iscarried out by the AnnBuilder package. We dis-tribute this but most users will not want to buildtheir own annotation. They will want instead tohave access to specific annotation for their chip.AnnBuilder relies on several R packages (available

from CRAN) including D. Temple Lang’s XML andT. Keitt’s RPgSQL.

Typically the annotation data available are a com-plete set for all genomes or a complete set for a spe-cific genome, such as yeast or humans. Since this istypically much larger than what is currently avail-able on any microarray and in the interest of perfor-mance we have found that producing annotation col-lections at the level of a chip has been quite success-ful.

The data analyst simply needs to ensure that theyhave an annotation collection suitable for their chip.A number of these are available from the Biocon-ductor web page and given the appropriate referencedata we can provide custom made collections withina few days.

Currently these collections arrive as a set ofcomma separated files. The first entry in each rowis the identification name or label for the gene. Thisshould correspond to the row names of the � �

value in the � � �� that is being analysed. The an-notation filenames should have their first five lettersidentical to the value in the �� slot of the� � �� since this is used to align the two. This pack-age will become more object oriented in the near fu-ture and the distribution mechanism will change.

Affymetrix Inc. produces several chips for thehuman genome. The most popular in current useare the U95v2 chips (with version A being usedmost often). The probes that are arrayed here haveAffymetrix identifiers. We provide functions thatmap the Affymetrix identifiers to a number of otheridentifiers such as the Locus Link value, the Gen-Bank value. In addition we have mapped mostprobes to their Genome Ontology (GO) values ��*��0�� . GO is an attempt to pro-vide standardized descriptions of the biological rele-vance of genes into the following three categories:

� Biological process

� Cellular component

� Molecular function

Within a category the set of terms forms a directedacyclic graph. Genes typically get a specific value foreach of these three categories. We trace each gene tothe top (root node) and report the last three nodes inits path to the root of the tree. These top three nodesprovide general groupings that may be used to ex-amine the data for related patterns of expression.

Additional data, such as chromosomal location,is also available. As data become available we willbe adding it to our offerings. Readers with knowl-edge of data that we have not included are invited tosubmit this information to us.

The annotation files are read in to R as environ-ments and used as if they were hash tables. Typ-ically the name they are stored under is the iden-tifier (but that does not need to be the case). The


Vol. 2/1, March 2002 14

value is can then be accessed using ��. Since weoften want to get or put multiple values (if we have100 genes we would like to get chromosome loca-tion for all of them) there are functions ��*�� and��*��.

Often researchers are interested in finding outmore about their genes. For example they would liketo look at the different resources at NCBI. We can eas-ily produce HTML pages with active links to the dif-ferent online resources. **��*�� is an exampleof a function designed to provide links to the LocusLink web page for a list of genes.

It is also possible using connections and the XMLpackage to open http connections and read the datafrom the web sites directly. It could then be pro-cessed using other R tools. For example if ab-stracts or full text searchable articles were availablethese could be downloaded and searched for rele-vant terms.

The remainder of this section involves referenceto Web sites and R functionality that may change inthe future. This portion of the project and the un-derlying data are in a state of near constant change.An up-to-date version of the material in this sec-tion, along with relevant scripts and data images,will be maintained at ��;��

� �� =/=8��*. If you have problems withany of the commands described below, please con-sult this URL.

To illustrate how one can get started analyzingpublicly distributed expression data with R, the fol-lowing commands lead to a matrix ��E � con-taining the Golub training data, and a character vec-tor of Affymetrix expression tags ��%�:

��1>2 @�

&��,��,+(��"..��&.+8

+��.��)��)E22)E:2)��4�+8

�� = ++/8

+�+/

��I' @� �'��,��1>28 �� = +#�+8 �(�� = ++/

BB �4� '�� '(� &� � ��

��I' @� ',��I'F�"��G8 ++8 ��I'F�,�"��/G/

��:�� @� �,��4,��I'8 �� = ��//

�''D�� @� ��:��F��8 G

'<4�� @� ��:��F��8 �J,�8 ��8 /G

�&�<4�� @� �,��,'<4��8 �8 ��&��'//

Let’s see what the Affymetrix tags look like:

�,�,�''D��F',��8 �8��/G//

F8�G

F�8G +E!!H��)��+

F 8G +E!!H�*��H��)��+

F�8G +E!!H�D(�H�:)��+

Using the ��package of Bioconductor, we canmap these tags to GenBank identifiers:

� ��,��/

� ��,�� /

?;&�� ,/ B �� &� ��

�&��,�''D��F',��8 �8��/G8

�5=?;&��/

The result is a list of GenBank identifiers:

C+E!!H��)��+

F�G +K�� +

C+E!!H�*��H��)��+

F�G +2�� +

C+E!!H�D(�H�:)��+

F�G +H��+

Upon submitting the tag "J04423" to GenBank, wefind that this gene is from E. coli, and is related tobiotin biosynthesis.

A long range objective of the Bioconductorproject is to facilitate the production and collation ofinterpretive biological information of this sort at anystage of the process of filtering and modeling expres-sion data.

Gene filtering

In this section we consider the process of selectinggenes that are associated with clinical variables of in-terest. The package genefilter is one approach to thisproblem. We think of the selection process as the se-quential application of filters. If a gene passes all ofthe filters then we deem it interesting and select it.

For example, genes are potentially interesting ifthey show some variation across the samples that wehave. They are potentially interesting if they have es-timated expression levels that are high enough for usto believe that the gene is actually expressed. Theseare both examples of non-specific filters; we have notmade any reference to a covariate. Covariate-basedfilters specify conditions of magnitude or variationin expression within or between strata defined by co-variates that must be satisfied by a gene in order thatit be retained for subsequent filtering and analysis.

Closures for non-specific filters

A filter function will usually have some context-specific parameters to define its behavior. To makeit easy to associate the correct settings with a filterfunction we have used R lexical scoping rules. Hereis the function �"&� < that filters genes by requiringthat at least k of the samples have expression valueslarger than A.

�L5�E @� %&�'��,�8 E=��8 �� = D>1</ M

%&�'��,4/ M

%,��/

4 @� 4FN��,4/G

�&�, 4 � E / � �

O

O

To define a filter that requires a minimum of fivesamples to have expression level at least 150, we sim-ply evaluate the expression


Vol. 2/1, March 2002 15

��P @� �L5�E,�8 ��/

Now �0� is a function, since that is what �"&� < re-turns. �0� takes a single argument, , which willbe a vector of gene expressions (one expression levelfor each sample) and evaluates the body of the innerfunction in �"&� <. In that body <, � and �� areunbound symbols. They will obtain their values asthose that we specified when we created the closure(the coupling of the function body with its enclosingenvironment) �0�. If � is an � � ��, the result of��*0�� B/B�0�� is a logical vector with gthelement %�@E if gene g (with expression levels storedin row g of the � � slot of �) has expression levelat least 150 in 5 of the samples, and F<-�E otherwise.

Another more interesting non-specific filter is thegap filter. It selects genes using a function that hasthree parameters. The first is the gap, the secondis the IQR and the third is a proportion. To applythis filter the (positive) expression levels for the geneare sorted and the proportion indicated are removedfrom both tails. If in the remainder of the data there isa gap between adjacent values of at least the amountspecified by the gap parameter then the gene willpass the filter. Otherwise, if the IQR based on all val-ues is larger than the specified IQR the gene passesthe filter. The purpose of this filter is to select genesthat have some variation in their expression valuesand hence might have something interesting to say.

Covariate-dependent filters

An advantage to this method is that any statisti-cal test can be implemented as a filter. Functionsimplementing covariate-dependent filters are builtthrough closures, have the same appearance as sim-pler non-specific filters, and are ��*0’d to expres-sion level data in precisely the same way as describedabove.

Here is a filter-building function that uses the par-tial likelihood ratio to test for an association betweena censored survival time and gene expression levels.

'�4%�� @� %&�'�� ,�&��8 '��8 �/

M

�&��,+'�4�(+8 +�&�55��+/

%&�'��,4/ M

��5� @� ��,'�4�(,3&�5,�&��8 '��/ 9 4//

% ,�(��,��5�8 +��+//

��&��,!E23</

�� @� � Q ,��5�C��F�G �

��5�C��F G/

�5 @� � � �'(�J,��8 �/

% ,�5 @ �/

��&��,D>1</

��&��,!E23</

O

O

To use this method, we must create the closure byestablishing bindings for the survival time, censor-ing indicator, and significance level for declaring an

association. Ordinarily, the survival data will be cap-tured in the ��,� slot of an expression set, say�, so we will see something like

��7�4 @� '�4%��, �C��8 �C5��8 �� /

Now ��*0�� B /B �0#� � is a logical vec-tor with element %�@E at index g whenever gene g isassociated with survival, and F<-�E otherwise.

We have provided other simple filters such as at-test filter and an ANOVA filter. One can simplyset up the test and require that the p-value of thetest when applied to the appropriate covariate (fromthe phenoData slot) is smaller than some prespeci-fied value.

In one designed experiment we used testingwithin gene-specific non-linear mixed effects modelsto measure the association between gene expressionand a continuous outcome. The amount of program-ming required was minimal. Filtering in this sim-ple way provides us with a very natural paradigmand tool in which to carry out gene selection. Notehowever, that in some situations we will need to dealwith more than one gene at a time.

Equipped with the lexical scoping rules, we haveseparated the processes of generic filter specifica-tion (filter building routines supplied in the gene-filter package), selection of detailed filter parameters(binding of parameters when the closure is obtainedfrom the filter builder), and application of filters overlarge numbers of genes. For the latter process wehave given examples of “manual” use of ��*0, butmore efficient tools for sequential filtering are sup-plied through the ��*�� and ��*�� func-tions in the package.

A word of warning is in order for those who havenot worked extensively with function closures. It iseasy to forget that the function is a closure and hassome variables captured. Seemingly correct resultscan obtain if you are not careful. The alternative isto specify all parameters and strata at filtering time.But we find that makes the filtering code much morecomplicated and less intuitive.

Documentation

We are embarked on the production of packages andprotocols to facilitate greater involvement of statisti-cians in bioinformatics, and to support smoother col-laboration between statisticians, biologists and pro-grammers when working with genomic data. Docu-mentation of data structures, methods, and analysisprocedures must be correct, thorough, and accessibleif this project is to succeed.

R has a number of unique and highly effectiveprotocols for the creation (� ��) and use of doc-uments at the function and package level that facili-tate simple interactive demonstration of function be-haviors (the � ��*� function, which applies to help


Vol. 2/1, March 2002 16

topics) and unit testing of functions and packages atbuild time (the testing carried out by � #4, ��).

We have introduced two new documentationprotocols as we grow the Bioconductor project.

��

A revised � ��#* function has been developedwhich produces a template that illustrates instantia-tion of classes (using the �� function) and searchesthrough the searchlist to obtain information on for-mal methods that apply to the class being docu-mented. Thus the creator of a new class is motivatedto describe not only the structure of the class beingcreated, but also the family of methods that may beused to work with this class.

The vignette protocol

The intent of the #4, �� unit testing protocols ofR is to establish that the elements of a package func-tion in accordance with the description in the manualpages. The Bioconductor project has two additionalunit testing requirements. First, packages must worktogether in well-defined ways: examples involvingmultiple packages and diverse data structures mustinteroperate correctly and continuously while evolv-ing. Second, users will require ‘higher-level’ views ofprogramming processes than are typically affordedby package man pages. Users will need to be takenthrough the steps of data structure creation, search-list extension, and then through the details of anyparticular data processing or inference procedure,with considerable narrative guidance.

Sweave allows integrated session narration, shellmonitoring, and graphics incorporation in docu-ments that are specified in LATEX and are render-able using PDF. We propose that Sweave documents(with suffix ‘ ��’) and PDF compilations of thembe kept in the ‘��’ subdirectory of packages re-lated to the Bioconductor project.

Conclusions

Analytical computing for bioinformatics has beencharacterized to date by the proliferation of sepa-rated binaries or scripts that implement various anal-ysis methods in very easy to use formats (e.g., Mi-croSoft Excel modules, Java applets). This paradigmhas a number of drawbacks, including frequent re-liance upon idiosyncratic data input and output for-mats, high costs of establishing interoperability, diffi-culties of incorporating new methodological insights

into existing programs, minimal infrastructure toverify procedure portability, and non-standard doc-umentation.

The Bioconductor project is based on the award-winning interactive programming language S, as de-ployed in the open-source statistical computing en-vironment R. An interactive programming languageprovides an ideal platform for prototyping imple-mentations of new ideas and comparing the new ap-proaches to well-established older approaches. Thelanguage basis confers well-defined paths to exten-sion and interoperation of any routines. Proceduresthat are too intensive to work routinely in the inter-active environment may be coded in any higher-levellanguage and loaded dynamically in R for higherperformance with no change to the user interface.The object-oriented programming facilities provideaccess to state-of-the-art methods in data structureand software design that help to control code com-plexity and reduce obstacles to smooth interoper-ation of diverse procedures. The intersystems in-terfacing initiative at �� providestools to permit smooth interoperation of Bioconduc-tor routines with completely foreign systems such asrelational databases, browsers, or other virtual ma-chines.

The role of R in bioinformatics research has beensubstantial, with contributions already present onCRAN (sma, GeneSOM, etc.) and its use in manyprojects for functional genomics. It is our hope thatthe Bioconductor project will aid in the federationof programming efforts to support progress in manydifferent areas of biology and clinical research. Thosewho are interested in this project should visit ourweb site and consider subscribing to the bioconduc-tor mailing list.

Acknowledgment

Vincent Carey’s work on Bioconductor is supportedin part by US National Heart Lung and Blood Insti-tute Grant HL66795, Innate Immunity in Heart Lungand Blood Disease.

Robert Gentlemans’s work is supported in partby NIH/NCI Grant 2P30 CA06516-38.

Robert GentlemanDFCI ��*��0�� & ��

Vincent CareyChanning Lab�&�� & ��


Vol. 2/1, March 2002 17

AnalyzeFMRI: An R package for theexploration and analysis of MRI and fMRIdatasetsby Jonathan Marchini

AnalyzeFMRI is a developing package for the explo-ration and analysis of large MRI and fMRI datasets.In this article we give a short introduction to MRIand fMRI and describe how we have used R whenworking with these large datasets. We describe thecurrent version of the package using examples, de-scribe our own approach for fMRI analysis and out-line our plans for future functionality in the package.

MRI and fMRI

Magnetic Resonance Imaging (MRI) is a non-invasive medical imaging technique that providesimages that represent slices of (brain) tissue. Es-sentially, measurements occur within each sliceon a grid of cube-like volume elements (vox-els). These measurements reflect the chemicaland magnetic properties of the tissue in eachsmall voxel and result in structural MRI im-ages that show detailed contrast between differ-ent tissue types (see ��

1� �� ). Variations ofthe imaging parameters sensitize the images to dif-ferent chemical and physical properties of interest.

The relationship between functional MRI (fMRI)and structural MRI is analogous to the relationshipbetween still photography and movies. In simpleterms fMRI is fast, repeated structural MRI and canbe carried out in such a way so as to be sensitive tolocal changes in levels of brain activity. Increases inlocal brain activity increase the local levels of bloodoxygen. This in turn causes the measured signal toincrease. Thus the images produced are said to beBlood Oxygenation Level Dependent (BOLD). Throughthe BOLD mechanism we can use fMRI datasets tolocalize specific areas or networks of the brain thatare responsible for cognitive functions of interest. Inthis fashion fMRI has revolutionized the area of neu-roscience over the last 10 years. An excellent intro-duction to this area is given by Matthews et al. (2001)

In a typical fMRI experiment the subject is repeat-edly imaged whilst performing a task(s) or receivingcertain stimuli designed to elicit the cognitive func-tion of interest. Traditionally, tasks/stimuli are ap-plied in alternating blocks of 20–30 seconds. Morerecently, ‘event-related’ designs in which the stimu-lus is applied for short bursts in a stochastic mannerhave become popular.

In those areas of the brain that are activated by

the tasks/stimuli we will observe that the voxel timeseries are correlated with the design of the experi-ment (see Figure 1). Focus usually centers on delin-eating those areas (clusters of voxels) of the brain thatexhibit a significant response. Most often we willwish to make inferences using a group of subjects.

Figure 1: A real voxel time series (solid line) inan area of the brain activated by an alternating‘OFF/ON’ visual stimulus (dotted line).

Figure 2: A functional MRI image

The resolution of fMRI datasets is very good.Typically datasets obtained using this technique con-sist of whole brain scans (� 20 images) repeated


Vol. 2/1, March 2002 18

every 2-3 seconds. Usually each image consists of64�64 voxels of size (4mm)3 (see Figure 2).

In summary, an fMRI dataset consists of a 3Dgrid of voxels, each containing a time series of mea-surements that reflect brain activity. Out of roughly15,000 voxels that lie inside the brain we wish toidentify those that were activated.

Using R for fMRI datasets

In order to analyze fMRI experiments we need to beable to manipulate and visualize the large amountsof data in a relatively easy fashion. This in turnallows us to implement and evaluate existing ap-proaches and develop new methods of analysis. Theanalysis will often involve several computationallyintensive steps and so it is also important to keep aneye on the efficiency of the code.

The high level environment that R provides hasallowed us to implement and test our methodsquickly and reliably. During this process we haveused profiling facilities to identify areas of our codethat can be speeded up (Venables, 2001) and wherenecessary we have written C and Fortran code toachieve this (Venables and Ripley, 2000). We haveoften found that the memory overheads of carryingout all calculation within R can be high. For this rea-son we have often used R simply to pass file namesand analysis parameters to C code. All file I/O andcomputations are carried out in C before writing theresults to a new data file. In addition we have foundit useful to write simple Graphical User Interface’s(GUI’s) using the tcltk package. This allows analysesto be set up and run without recourse to commandline functions.

The ANALYZE format and I/O

The ANALYZE image format is a general medicalimage format developed by the Mayo Clinic1 that iscommonly used within the brain imaging commu-nity. The format consists of an image file (‘��’)together with a header file (‘��’). The header filedetails the binary storage type of the image file, di-mensions of the dataset and certain imaging parame-ters used during acquisition. In addition the endian-ness of the image and header files can be detectedby reading the first 4 bytes of the header file, as theycontain the constant header file size (348 bytes). Thesize of a typical fMRI dataset stored in this format is30Mb.

The package provides read and write capabilitiesfor the ANALYZE format. For example, a simplesummary of the dataset is available.

� %��R�%��&��,+�.4��+/

!� ��" �.4��

*�� " ��*

H ��" ��

S ��" ��

T ��" �

D� ��" � ��

I�4� ��" � �� 4 � �� 4 � ��

*�� " �� (��

,�� 5�4�/

Functions of the form �� *0:��G allowwhole or parts of the dataset to be read into R, i.e.,

� � @� %��R�5��&�,+�.4��+/

� ��,�/

F�G ��

The function �� *0:�allows an array tobe written into a new ANALYZE format file, i.e.,

� � @� ��4,��,��/8��=��,��8�//

� %��R,�8 %�=+�.4 +8 �R=+%��+/

In addition ��;�� *�� and�� *�� allow the user to store cus-tom information in the header files.

Exploratory Data Analysis

fMRI datasets are large and can contain significantnoise structure and imaging artifacts. The qualityof datasets can vary widely from scanner to scan-ner and even from day to day on the same scanner.For these reasons it is important to regularly ‘eye-ball’ the datasets to examine their quality. The pack-age provides two methods of examining the struc-ture present in each fMRI dataset.

The first is a simple spectral summary of thedataset produced using �� *�� 0. Thisfunction calculates the periodogram of each voxeltime series normalized by the median periodogramordinate. A plot is produced that shows the quantilesof the normalized periodogram ordinates at each fre-quency. This provides a fast look at a fMRI dataset toidentify any artifacts that reside at single frequencies.Figure 3 shows the results of applying this functionto a real fMRI dataset. The spike at frequency 18 isconsistent with the frequency of the periodic stim-ulus of the experiment. The spike at frequency 60highlights the existence of a Nyquist ghost image ar-tifact that can occur in fMRI datasets.

� %��'��&��,%�=+4��+8

��%�=+4��+8

��%��=!E23</

$��'�� '�� F�G F G F�G F�G F�G F�G F�G

F�G F�G F��G F��G F� G F��G F��G F��G F��G F��G

F��G F��G F �G F �G

1http://www.mayo.edu/bir/PDF/ANALYZE75.pdf


Vol. 2/1, March 2002 19

0 5 10 20 30 40 50 60 70 80 90

05

1015

2025

30

Figure 3: Spectral summary plot of an fMRI dataset.

We provide a GUI that allows visualization of in-dividual voxel time-series, images of specific slicesand functional volumes, movies through time of spe-cific slices and spectral summary plots. This GUI isinvoke through the function ��*0:�F4�H��.

Secondly, we have applied Independent Compo-nent Analysis (ICA) to fMRI datasets and found thatthe extracted components are extremely useful in un-covering otherwise hidden structure. In ICA thedata matrix X is considered to be a linear combi-nation of non-Gaussian (independent) components,i.e., X � SA where columns of S contain the inde-pendent components and A is a linear mixing ma-trix. In short ICA attempts to ‘un-mix’ the data byestimating an un-mixing matrix W where XW � S.

Under this generative model the measured ‘sig-nals’ in X will tend to be ‘more Gaussian’ than thesource components (in S) due to the Central LimitTheorem. Thus, in order to extract the independentcomponents/sources we search for an un-mixingmatrix W that maximizes the non-Gaussianity of thesources. It is useful to note that in the absence of agenerative model for the data the algorithm used isequivalent to Projection Pursuit. We use the Fast ICAalgorithm (Hyvarinen, 1999) implemented in pack-age fastICA to estimate the sources (see that packagefor more details).

For fMRI data we concatenate the dataset into adata matrix so that each row of X represents onevoxel time series. Using this formulation allows usto estimate spatially independent sources that occurin the dataset. The function �� calls C codethat reads the data directly from the ANALYZE im-age file, carries out the ICA decomposition and re-turns the results into R. We also provide a GUI writ-ten using tcltk that allows a fast way of visualizingthe results of the ICA decomposition. This GUI isinvoked through the function �� (see

Figure 4).Figure 5 shows one of the estimated components

(columns of S) plotted in its spatial configuration to-gether with its associated weighting through time(row of A). This particular component shows thehigh frequency Nyquist ghosting artifact suggestedby the spectral summary plot.

Figure 4: The GUI for spatial ICA for fMRI datasets.

Analysis

The most widely used strategy for the analysis offMRI experiments is carried out using a two-stageapproach. In the first stage a linear model with cor-related errors is used for each individual voxel time-series Y. That is,

Y � Xβ�ε ε � N�0,σ2V� (1)

where the design matrix X contains terms that modelthe BOLD response to the stimuli and non-lineartrends that are often observed in fMRI voxel time-series.

Once this model has been fitted at each voxel,summary statistics (typically t and F statistics) arecalculated that reflect the components of the re-sponse under study. For example, we might calculatean F statistic that reflects the variance component at-tributable to the BOLD response. In the context of agroup study the statistic at each voxel is calculatedfrom the model fits of all the subjects. These statis-tic values can then be plotted spatially as a statisticimage (see figure 8). The second stage then focuseson the analysis of the statistic image in order to iden-tify those areas of the brain that were activated by thestimuli.

Time-series modelling

In general, the linear model (1) used at each voxelwill contain three main components. These being (i)non-linear trends, (ii) the BOLD response to the stim-ulus, and (iii) auto-correlated noise.


Vol. 2/1, March 2002 20

Spatial ICAComponent 13file: /users/marchini/fmri /av/jm2/15secs/jm2.1 5sec.mc.img

Date: Thu Oct 4 16:38:00

slice 1 slice 2 slice 3 slice 4

slice 5 slice 6 slice 7 slice 8 slice 9



slice 20 slice 21 Time course Peiodogram

Figure 5: An extracted spatial ICA component showing a high frequency Nyquist ghosting artifact.

Trends are commonly modelled using polyno-mial, cosine or spline basis terms. Alternativelythe trends can be removed from the data beforeanalysis by applying an appropriate filter/smoother.We have found a Gaussian weighted running linessmoother reliably removes the non-linear trends(Friedman, 1984). For efficiency we avoid multiplecalls to smoothing functions (such as ��) by ini-tially calculating a smoothing matrix S and applyingit at each voxel.

The nature of the BOLD response implies thatin areas of activation we will observe a delayed

and blurred version of the stimulus design. This iscommonly modelled through the convolution of thestimulus design x�t�2 with a Haemodyanmic ResponseFunction (HRF), h�t�. That is,

BOLD�t� � ∑s

h�t� s�x�s� � h� x�t� (2)

Suggested forms for the HRF include discretizedPoisson, Gamma and Gaussian density functions. Toallow for spatial variation in the HRF at each voxela small set of basis HRF’s can be used that vary inthe amount they blur and delay the stimulus design

2normally a vector of 1’s and 0’s that specify the times when the stimulus is ‘ON’ and ‘OFF’ respectively.


Vol. 2/1, March 2002 21

(Friston et al., 1995). Each HRF is convolved with thedesign to make one column of the design matrix X.

The final component in the linear model is theauto-correlated noise term. The auto-correlationstructure varies considerably throughout the brainand the chosen model should be chosen to reflectthis fact. Suggested models include AR(p) (Bull-more et al., 1996), ARMA models (Locascio et al.,1997) and non-parametric spectral density estimates(Lange and Zeger, 1997). Alternatively, Worsley andFriston (1995) suggest imposing a known correlationstructure on the model (using a low-pass filter) be-fore using OLS to fit the model at each voxel. Theauthors point out that this scheme works for periodicdesigns3 but is inefficient for more interesting event-related designs.

In general the chosen noise model should be flex-ible enough to capture the spatially varying levels ofcorrelation throughout the brain. The model shouldprovide well-calibrated levels of significance for es-timated responses and be resistant to commonly oc-curring image artifacts.

A spectral domain approach

For periodic designs the specification of the HRF andnoise model at each voxel time-series is greatly sim-plified by working in the spectral domain (Marchiniand Ripley, 2000). The simplest example of a peri-odic design involves c repeats of a block which con-sists of b volumes scanned during a baseline stimula-tion, followed by b volumes scanned during a stimu-lus/task of interest. For such designs the majority ofthe power of the BOLD response will lie at just onefrequency ωc � 2πc�n and will result in a spike atthat frequency in the periodogram of the voxel time-series, i.e. the spike at the 18th frequency in Figure 6.

0.00 0.05 0.10 0.15

010

000

2000

030

000

Frequency (Cycles per second)

Per

iodo

gram

Figure 6: Periodogram and spectral density estimate.

The asymptotic distribution of the periodogram

ordinate I�ωc� is given by

I�ωc� � 12

f �ωc�.χ22, (3)

where f�ω� represents the non-normalized spectraldensity of the (second order stationary) noise pro-cess. This suggests that a test for an unusuallylarge component at frequency ωc can be constructedthrough the use of the statistic Rc where

Rc �I�ωc�

�f �ωc�, (4)

where �f �ωc� is an estimate of the non-normalizedspectral density. If we can obtain a sufficiently accu-rate estimate of f �ωc� then Rc will have a standardexponential distribution.

Spectral density estimation

Lange and Zeger (1997) used a rectangular smooth-ing kernel in both space and frequency to estimatef �ω�. This approach tends to produce biased es-timates at boundaries between obviously differentspectral densities.

We use a smoothing spline to estimate the spec-tral density from the log-periodogram of the voxeltime-series (Wahba, 1980). One advantage of work-ing on log-scale is that large high frequency artifactsare down weighted and thus avoids the biased esti-mates exhibited by other methods.

To improve the estimation we use a spatiallyanisotropic filter that only smooths spectral densityestimates that are similar. Specifically we use a mul-tivariate normal approximation to the difference be-tween two independent4 spectral density estimates.This allows us to decide when two estimates are closeenough that they can be smoothed. We have also ex-tended this approach to the case of event-related de-signs by working with the Fourier transform of thetime-series as opposed to just the periodogram (Mar-chini, 2001).

Calibration

One nice property of this method is that we can cal-culate Rj at all the other frequencies at which weknow there to be no response. Under the null modelof no activation all these statistics should have thesame distribution and can be used to self-calibratethe method.

Figure 7 illustrates the results of our calibra-tion experiments. We applied the method to 6 nulldatasets5 with (blue lines) and without (red lines)spatial smoothing. We compared the distribution of

3when the columns of the design matrix are close to being eigenvectors of the low-pass filter.4the approximation works well even with the spatial correlation that exists between estimates5datasets in which there is no activation because no stimulus was applied to the subject.


Vol. 2/1, March 2002 22

the statistic values at both the design frequency (dot-ted lines) and at a range of other frequencies (solidlines) to the theoretical exponential distribution. Thestatistic values are shown on � log10�p–value� scaleto focus on the tails of the distribution. The line at45o represents exact agreement with the theoreticaldistribution. This plot shows how spatial smooth-ing provides a more accurate estimate of f �ω�. Thismethod of calibration also allows us to choose thesmoothing spline parameter in an objective fashion.

0 2 4 6 8

02

46

8

Empirical − log10(p − value)

The

oret

ical

−lo

g 10(p

−va

lue)

Figure 7: Calibration results

10−4

10−5

10−6

10−7

10−8

10−9

10−10

Figure 8: p-value image using spline smoothingspectral density estimation.

Figures 8 and 9 show statistic images pro-duced by using a smoothing spline and an AR(1)model (Bullmore et al., 1996) for the spectral den-sity at each voxel. The images are shown onp-value scale, thresholded to show only values

below 10�4 and overlaid onto an image of theslice. The spline smoothing approach shows ac-tivation just in the visual cortex. The AR(1)approach identifies the same areas but with ad-ditional amounts of false positive activation in-dicating that this approach is poorly calibrated.

10−4

10−5

10−6

10−7

10−8

10−9

10−10

Figure 9: p-value image using AR(1) spectral densityestimation.

We have implemented our approach using a mix-ture of R, C and FORTRAN code. We are currentlyworking on incorporating this method into the Ana-lyzeFMRI package.

Identifying areas of activation statistic

Many different methods are available for the analysisof fMRI statistic images. The most widely used ap-proaches are based on thresholding the statistic map(Worsley and Friston, 1995). The level of the thresh-old is chosen to protect against false positive activa-tion under the null hypothesis of no activation any-where in the image. Results from Random Field the-ory are used to set the appropriate threshold.

This approach tends to attract alot of criticism.In some experiments we know the effect will occursomewhere and we want to estimate the location ofthe response rather than use a hypothesis test to seeif it’s there. Also, to validate assumptions requiredby the Random Field theory results the datasets aresignificantly spatially smoothed. This tends to blursthe good resolution fMRI affords.

Despite these criticisms this approach has beenvery successful in answering the questions that neu-roscientists want to ask. In addition, for group stud-ies, the resolution is limited by structural variabilitybetween individuals. Thus using a small amount ofsmoothing doesn’t matter and can in fact help detec-tion of a response.

More attractive approaches are based on mixturemodels for levels of activation (Everitt and Bullmore,1999; Hartvig and Jensen, 2000). It has been sug-


Vol. 2/1, March 2002 23

gested that such an approach provides a less arbi-trary way of delineating areas of activation although(so far) they have only been used with a 0-1 loss func-tion. It will be interesting and important to investi-gate the effect of varying the loss function.

More recently it has been suggested that thresh-olds be defined by controlling the False DiscoveryRate (FDR) (Genovese et al., 2001). This approach ismore focussed on modelling the areas of activationas it is based on controlling the amount of false pos-itive tests compared to all positive tests. In this waythe method shares a property of mixture model ap-proaches in that the threshold adapts to the proper-ties of the statistic image.

We are currently in the process of implementingthese approaches into the AnalyzeFMRI package.

Future plans

The long term goal of the AnalyzeFMRI package isto provide a comprehensive software package for theanalysis of fMRI and MRI images. We aim to takefull advantage of the existing (and future) graphicsfacilities of R by providing fast interfaces to widelyused medical image formats and several graphicalexploratory tools for fMRI and MRI datasets. Inthis way we hope to provide a valuable resource forstatisticians working in this field. Hopefully this willnegate the need for replication of coding effort andallow researchers new to the field to quickly famil-iarize themselves with many of the current methodsof analysis.

Bibliography

E. Bullmore, M. Brammer, S. C. R. Williams, S. Rabe-Hesketh, N. Janot, A. David, J. Mellers, R. Howard,and P. Sham. Statistical methods of estimation andinference for functional MR image analysis. Mag-netic Resonance in Medicine, 35:261–277, 1996. 21,22

B. S. Everitt and E. D. Bullmore. Mixture modelmapping of brain activation in functional magneticresonance images. Human Brain Mapping, 7:1–14,1999. 22

J. H. Friedman. A variable span scatterplot smoother.Technical Report 5, Laboratory for ComputationalStatistics, Stanford University, 1984. 20

K. Friston, C. Frith, R. Frackowiak, and R. Turner.Characterising evoked hemodynamics with fMRI.NeuroImage, 2:157–165, 1995. 21

C. R. Genovese, N. A. Lazar, and T. Nichols. Thresh-olding of statistical maps in functional neroim-gaing using the false discovery rate. Technicalreport, Department of Statistics, Carnegie MellonUniversity, 2001. 23

N. V. Hartvig and J. L. Jensen. Spatial mixture mod-elling of fMRI data. Human Brain Mapping, 11:233–248, 2000. 22

A. Hyvarinen. Fast and robust fixed-point algo-rithms for independent component analysis. IEEETransactions on Neural Networks, 10:626–634, 1999.19

N. Lange and S. L. Zeger. Non-linear time series anal-ysis for human brain mapping by functional mag-netic resonance imaging. Applied Statistics, 46:1–29,1997. 21

J. L. Locascio, P. L. Jennings, C. I. Moore, andS. Corkin. Time series analysis in the time domainand resampling methods for studies of functionalmagnetic resonance brain imaging. Human BrainMapping, 5:168–193, 1997. 21

J. L. Marchini. Statistical Issues in the Analysis of MRIBrain Images. PhD thesis, Department of Statistics,Oxford University, UK, 2001. 21

J. L. Marchini and B. D. Ripley. A new statistical ap-proach to detecting activation in functional MRI.NeuroImage, 12:366–380, 2000. 21

P. M. Matthews, P. Jezzard, and S. M. Smith, editors.Functional Magnetic Resonance Imaging of the Brain:Methods for Neuroscience. Oxford University Press,2001. 17

W. N. Venables. Programmers’s niche. R News, 1(1):27–30, 2001. 18

W. N. Venables and B. D. Ripley. S Programming.Springer, New York, 2000. 18

G. Wahba. Automatic smoothing of the log peri-odogram. Journal of the American Statistical Asso-ciation, 75:122–132, 1980. 21

K. Worsley and K. Friston. Analysis of fMRI time se-ries revisited – again. NeuroImage, 2:173–181, 1995.21, 22

Jonathan L. MarchiniUniversity of Oxford, UK� ��


Vol. 2/1, March 2002 24

Using R for the Analysis of DNAMicroarray Databy Sandrine Dudoit, Yee Hwa Yang, and Ben Bolstad

Overview

Microarray experiments generate large and com-plex multivariate datasets. Careful statistical designand analysis are essential to improve the efficiencyand reliability of microarray experiments, from theearly design and pre-processing stages to higher-level analyses. Access to an efficient, portable, anddistributed statistical computing environment is arelated and equally critical aspect of the analysisof gene expression data. As part of the Biocon-ductor project, we are working on the developmentof R packages implementing statistical methods forthe design and analysis of DNA microarray experi-ments. An early effort is found in the sma (Statisticsfor Microarray Analysis) package which we wrotein the Fall of 2000. This small package was initiallybuilt to allow the sharing of software with our col-laborators, both statisticians and biologists, for pre-processing tasks such as normalization and diagnos-tic plots. We are in the process of expanding thescope of this package by implementing new meth-ods for microarray experimental design, normaliza-tion, estimation, multiple testing, cluster analysis,and discriminant analysis.

This short article begins with a brief introduc-tion to the biology and technology of DNA microar-ray experiments. Next, we illustrate some of themain steps in the analysis of microarray gene ex-pression data, using the apo AI experiment as a casestudy. Additional details can be found in Dudoitet al. (2002b). Although we focus primarily on two-color cDNA microarrays, a number of the proposedmethods and implementations extend to other plat-forms as well (e.g., Affymetrix oligonucleotide chips,nylon membrane arrays). Finally, we discuss ongo-ing revisions and extensions to sma as part of theBioconductor project. This project aims more gen-erally to produce an open source and open devel-opment computing environment for biologists andstatisticians working on the analysis of genomic data.More information on Bioconductor can be found at��;�� .

Background on DNA microarrays

The ever-increasing rate at which genomes are be-ing sequenced has opened a new area of genome re-search, functional genomics, which is concerned withassigning biological function to DNA sequences.

With the availability of the DNA sequences of manygenomes (e.g., the yeast Saccharomyces cerevisae, theround worm Caenorhabditis elegans, the fruit flyDrosophila melanogaster, and numerous bacteria) andthe recent release of the first draft of the humangenome, an essential and formidable task is to definethe role of each gene and elucidate interactions be-tween sets of genes. Innovative approaches, such asthe cDNA and oligonucleotide microarray technolo-gies, have been developed to exploit DNA sequencedata and yield information about gene expressionlevels for entire genomes.

Different aspects of gene expression can be stud-ied using microarrays, such as expression at the tran-scription or translation level, subcellular localizationof gene products, and protein binding sites on DNA.To date, attention has focused primarily on expres-sion at the transcription stage, i.e., on mRNA or tran-script levels. There are several types of microarraysystems, including the two-color cDNA microarraysdeveloped in the Brown and Botstein labs at Stanfordand the high-density oligonucleotide chips from theAffymetrix company; the brief description below fo-cuses on the former.

cDNA microarrays consist of thousands of in-dividual DNA sequences printed in a high-densityarray on a glass microscope slide using a roboticprinter or arrayer. The relative abundance of thesespotted DNA sequences in two DNA or RNA sam-ples may be assessed by monitoring the differentialhybridization of the two samples to the sequenceson the array. For mRNA samples, the two sam-ples or targets are reverse-transcribed into cDNA,labeled using different fluorescent dyes (usually ared-fluorescent dye, Cyanine 5 or Cy5, and a green-fluorescent dye, Cyanine 3 or Cy3), then mixed inequal proportions and hybridized with the arrayedDNA sequences or probes (following the definitionof probe and target adopted in “The Chipping Fore-cast”, a January 1999 supplement to Nature Genet-ics). After this competitive hybridization, the slidesare imaged using a scanner and fluorescence mea-surements are made separately for each dye at eachspot on the array. The ratio of the red and green flu-orescence intensities for each spot is indicative of therelative abundance of the corresponding DNA probein the two nucleic acid target samples. See Brownand Botstein (1999) for a more detailed introductionto the biology and technology of cDNA microarrays.

DNA microarray experiments raise numerousstatistical questions, in fields as diverse as experi-mental design, image analysis, multiple testing, clus-ter analysis, and discriminant analysis. The main


Vol. 2/1, March 2002 25

steps in the statistical design and analysis of a mi-croarray experiment are summarized in Figure 1.Each step in this process depends critically on theavailability of an efficient, portable, and distributedstatistical computing environment.

Figure 1: Main steps in the statistical design andanalysis of a DNA microarray experiment (solidboxes).

Sample microarray dataset

We will illustrate some of the main steps in the anal-ysis of a microarray experiment using gene expres-sion data from the apo AI experiment described indetail in Callow et al. (2000). Apolipoprotein AI (apoAI) is a gene known to play a pivotal role in HDLmetabolism, and mice with the apo AI gene knocked-out have very low HDL cholesterol levels. The goalof this experiment was to identify genes with al-tered expression in the livers of apo AI knock-outmice compared to inbred control mice. The treat-ment group consisted of eight mice with the apoAI gene knocked-out and the control group con-sisted of eight control C57Bl/6 mice. For each ofthese sixteen mice, target cDNA was obtained frommRNA by reverse transcription and labeled usinga red-fluorescent dye, Cy5. The reference sampleused in all hybridizations was prepared by poolingcDNA from the eight control mice and was labeledwith a green-fluorescent dye, Cy3. Target cDNA washybridized to microarrays containing 6, 384 DNAprobes, including 257 related to lipid metabolism.Microarrays were printed using 4� 4 print-tips andare thus partitioned into a 4 � 4 grid matrix (theterms grid, sector, and print-tip group are used inter-changeably in the microarray literature). Each gridconsists of a 19 � 21 spot matrix that was printedwith a single print-tip.

Raw images and background corrected Cy3 andCy5 fluorescence intensities for all sixteen hybridiza-tions are available at ��;� ��*�0�

�� 0�: 0��*��*.Fluorescence intensity data from processed imagesfor six of the sixteen hybridizations (three knock-out mice and three control mice) were also includedin the sma package and can be accessed with the

command ��4��< 0�. Figure 2 displaysan RGB overlay of the Cy3 and Cy5 images forone of the sixteen arrays (files 1230ko1G.tif.marrayand 1230ko1R.tif.marray corresponding to knock-out mouse 1, i.e., ��9 array in the dataset4��< 0). Analyses described below were per-formed using sma version 0.5.7 and R version 1.3.1.

Figure 2: RGB overlay of the Cy3 and Cy5 images forknock-out mouse 1, i.e., ��9 array in the dataset4��< 0.

Pre-processing

Image analysis

The raw data from a microarray experiment are theimage files produced by the scanner; these are typi-cally pairs of 16-bit tagged image file format (TIFF)files, one for each fluorescent dye (the images for theapo AI experiment are about 2MB in size; more re-cent images produced from higher resolution scansare around 10 to 20 MB). Image analysis is requiredto extract foreground and background fluorescenceintensity measurements for each spotted DNA se-quence. Widely used commercial image process-ing software packages include GenePix for the Axonscanner and ImaGene from BioDiscovery; freelyavailable software includes ScanAlyze. Image pro-cessing is beyond the scope of this article, and thereader is referred to Yang et al. (2002a) for a detaileddiscussion of microarray image analysis and addi-tional references.

For the apo AI experiment, each of the sixteen hy-bridizations produced a pair of 16-bit images. Theseimages were processed using the software packageSpot (Buckley, 2000). Spot is built on R and is a spe-cialized version of another R package called VOIR,which is being developed by the CSIRO Image Anal-ysis Group (�� ). The segmentation component, based on aseeded region growing algorithm, places no restric-tion on the size or shape of the spots. The back-ground adjustment method relies on morphological


Vol. 2/1, March 2002 26

opening to generate an image of the estimated back-ground intensity for the entire slide. The results of ananalysis by Spot of microarray images are returnedas a data frame and can thus immediately be dis-played, manipulated, and analyzed in a number ofways within R. The rows of this data frame corre-spond to the spots on the array and the columnsto different spot statistics: e.g., grid row and col-umn coordinates, spot row and column coordinates,red and green foreground intensities, red and greenbackground intensities for different background ad-justment methods, spot area, perimeter. The six dataframes ��/, . . . , ��I, in 4��< 0 containSpot output for three control mice and three knock-out mice, respectively. In what follows, we use R andG to denote the background corrected red and greenfluorescence intensities of a particular spot, and M todenote the corresponding base-2 log-ratio, log2 R�G.

Reading in data

In general, microarray image processing resultsare stored in ASCII files and can be loaded intoR using specialized functions like �� and �� for Spot and GenePix output, respec-tively. These functions are simply wrapper functionsaround ��;*�,which take into account the spe-cific file format for output from different image pro-cessing software packages.

We have written a number of initialization func-tions to manipulate the image analysis output forbatches of arrays, i.e., collections of slides with thesame spot layout. The function �� is an in-teractive function for specifying the dimensions ofthe spot and grid matrices. These parameters are de-termined by the printing layout of the array, and areused for the spatial representation of spot statisticsin the function �*��* and the within-print-tip-group normalization procedure implemented in��. The function �� is an interactivefunction which creates a data structure for multi-slide microarray experiments. The data structure is alist of four matrices, storing raw red and green fore-ground intensities, and red and green backgroundintensities. The rows of the matrices correspondto spotted DNA sequences and the columns to hy-bridizations (i.e., arrays or slides). The functionalso allows the user to add hybridization data toan existing list. The following commands may beused to extract red and green fluorescence intensi-ties from the image analysis output for the six arraysin 4��< 0, and compute intensity log-ratios andaverage log-intensities.

��,:�&�E��/

��&��&� @� ��,/

��&�� @� ��,/

��&��:E� @� ��,��&��8 ��&��&�8

��=+�+/

Eventually, we would like to retrieve the imageanalysis results directly from a laboratory database.In addition, to allow more general and systematicrepresentation and manipulation of microarray data,we are working on the definition of new classes ofR objects for microarrays using the class/methodmechanism in the methods package (cf. section onongoing projects).

Diagnostic plots

Before proceeding to normalization or any higher-level analysis, it is instructive to look at diagnos-tic plots of spot statistics, such as red and greenforeground and background log-intensities, intensitylog-ratio, area, etc.

Spatial plots of spot statistics. The sma function�*��* creates an image of shades of gray orcolors that represents the values of a statistic for eachspot on the array. This function can be used to ex-plore whether there are any spatial effects in the data,for example, print-tip or cover-slip effects. The com-mands

: @� ��,��&��8 ��&��&�8

��=+�+/C:F8�G

��,:8 ��&��&�8 '��=�/

produce an image of the pre-normalization log-ratiosM for the array ��9. To display only the spotshaving the highest and lowest 5% pre-normalizationlog-ratios M, set the argument � ��/ to 0.05. Figure 3clearly indicates that the lowest row of grids has highred intensities compared to green, and thus suggeststhe existence of spatially-dependent dye biases.

Figure 3: Image of the pre-normalization inten-sity log-ratios for the ��9 array using the��*� palette (�*��* output).

Boxplots. Boxplots of spot statistics by plate, sec-tor, or slide can also be useful to identify spot or hy-bridization artifacts. The boxplots in Figure 4 againsuggest the existence of spatially-dependent dye bi-ases.


Vol. 2/1, March 2002 27

(1,1

)

(1,2

)

(1,3

)

(1,4

)

(2,1

)

(2,2

)

(2,3

)

(2,4

)

(3,1

)

(3,2

)

(3,3

)

(3,4

)

(4,1

)

(4,2

)

(4,3

)

(4,4

)

−4

−2

02

Mouse 4

sector

M

Mouse 4

Figure 4: Boxplots by sector of the pre-normalizationintensity log-ratios for the ��9 array. The pairs�i, j�, i, j � 1, . . . , 4, refer to sector or grid coordi-nates in the 4� 4 grid matrix shown in Figure 2.

Figure 5: Pre-normalization MA-plot for the ��9

array, highlighting spots with the highest and lowest0.5% log-ratios (�*��& output).

MA-plots. Single-slide expression data are typi-cally displayed by plotting the log-intensity log2 Rin the red channel vs. the log-intensity log2 G in thegreen channel. Such plots tend to give an unrealis-tic sense of concordance between the red and greenintensities and can mask interesting features of thedata. We thus prefer to plot the intensity log-ratioM � log2 R�G vs. the mean log intensity A �

log2

RG. An MA-plot amounts to a 45o counter-

clockwise rotation of the �log2 G, log2 R�-coordinatesystem, followed by scaling of the coordinates. Itis thus another representation of the �R, G� data interms of the log-ratios M which directly measure dif-ferences between the red and green channels and arethe quantities of interest to most investigators. Wehave found MA-plots to be more revealing than theirlog2 R vs. log2 G counterparts in terms of identifyingspot artifacts and for normalization purposes (Du-doit et al., 2002b; Yang et al., 2001, 2002b). For theapo AI experiment,

��5�,��&��8��&��&�8��=+�+8

��=�84��=+�'+8��=+�+8

'��=��8'��4=+�&��+8�'(= �/

��,+:�&� �+/

produces an MA-plot for the ��9 array, highlight-ing spots with the highest and lowest 0.5% log-ratiosusing the corresponding row number in ��

(Figure 5).

Normalization

The purpose of normalization is to identify and re-move sources of systematic variation, other than dif-ferential expression, in the measured fluorescenceintensities (e.g., different labeling efficiencies andscanning properties of the Cy3 and Cy5 dyes; dif-ferent scanning parameters, such as PMT settings;print-tip, spatial, or plate effects). It is necessaryto normalize the fluorescence intensities before anyanalysis which involves comparing expression lev-els within or between slides (e.g., clustering, multi-ple testing). The need for normalization can be seenmost clearly in self-self experiments, in which twoidentical mRNA samples are labeled with differentdyes and hybridized to the same slide (Dudoit et al.,2002b). Although there is no differential expressionand one expects the red and green intensities to beequal, the red intensities often tend to be lower thanthe green intensities. Furthermore, the imbalance inthe red and green intensities is usually not constantacross the spots within and between arrays, and canvary according to overall spot intensity, location onthe array, plate origin, and possibly other variables.Figure 6 displays the pre-normalization MA-plot forthe ��9 array, with the sixteen lowess fits foreach of the print-tip-groups (using a smoother spanf � 0.3 for the *�� function). The plot was pro-duced using the function �*�� *��

'�� @� ��,��, "�8�//

�� @� ��,�"�8�/

�� @� ��,+,+8��,��,�"�8�//8+8+8

��,�"�8�/8+/+8��=++/

��,��&��8��&��&�8

��=+�+8��=�8�'(= �8��=�8

��='��8��=��8'4=��/

��,��8� ��8��=�� F��,��8'��/G8

'��='��F��,��8'��/G8

��=��F��,��8'��/G8 �'��=�8��= /

��,+:�&� �+/

Figure 6 illustrates the non-linear dependence ofthe log-ratio M on the overall spot intensity A andthus suggests that an intensity or A-dependent nor-malization method is preferable to a global one (e.g.,median normalization). Also, four lowess curvesclearly stand out from the remaining twelve curves,suggesting strong print-tip or spatial effects. Thefour curves correspond to the last row of grids in the4� 4 grid matrix.


Vol. 2/1, March 2002 28

Figure 6: Pre-normalization MA-plot for the ��9

array, with the lowess fits for individual print-tips or sectors. Different colors are used to rep-resent lowess curves for print-tips from differentrows, and different line types are used to representlowess curves for print-tips from different columns(�*�� *�� output).

We have developed location and scale normaliza-tion methods which correct for intensity and spatialdye biases using robust local regression (Yang et al.,2001, 2002b). These procedures are implemented inthe sma function ��, which uses the R func-tion *�� to perform robust local regression ofthe log-ratio M on spot intensity A. Five methodsare currently available for normalizing the log-ratios;the method is specified using the �� argument ofthe function ��. These five options are: +�+for no normalization between the two channels; +�+for global median normalization, which sets the me-dian of the intensity log-ratios M to zero; +*+ forglobal lowess normalization, i.e., regression of M onA for all spots; +�+ for within-print-tip-group lowessnormalization; and ++ for scaled within-print-tip-group normalization. For the ++ option, scaling isdone by the median absolute deviation (MAD) of thelowess location normalized log-ratios for each print-tip-group. The code

��&�� @�

��,��&��8 ��&��&�8 ��=+�+/

performs within-print-tip-group lowess location nor-malization for the six slides in 4��< 0, andreturns normalized log-ratios M and average log-intensities A. For the apo AI experiment, only a smallproportion of the spots were expected to vary in in-tensity between the two channels; normalization wasthus performed using all 6, 384 probes. The functionreturns a list with two components: 4, a matrix ofnormalized log-ratios, and <, a matrix of average log-intensities. The rows of these matrices correspond tospotted DNA sequences and the columns to the dif-ferent slides.

Image analysis and normalization are the twomain pre-processing steps required in any microar-ray experiment. For our purpose, normalized mi-

croarray data consist primarily of pairs �M, A� of in-tensity log-ratios and average log-intensities for eachspot in each of several slides. In addition to these ba-sic statistics, we are planning on deriving spot andslide quality statistics and incorporating these in fur-ther analyses (cf. section on ongoing projects).

Main statistical analysis

The phrase “main statistical analysis” refers to theapplication of statistical methodology to normal-ized intensity data in order to answer the biologicalquestion for which the microarray experiment wasdesigned. Appropriate statistical methods dependlargely on the question of interest to the investiga-tor and in general can span the entire field of Statis-tics. For model organisms such as mice or yeast, bio-logical questions of interest might include the identi-fication of differentially expressed genes in factorialexperiments studying the simultaneous gene expres-sion response to factors such as drug treatment, celltype, spatial location of tissues, and time. In hu-man cancer microarray studies, one may be inter-ested in the discovery of new tumor subclasses, orin the identification of genes that are good predictorsof clinical outcomes such as survival or response totreatment. There is clearly no general “next step” inthe analysis of microarray data, and the implementa-tion of statistical methodology will require the devel-opment of question specific R packages (cf. sectionon ongoing projects). Next, we discuss differentialexpression, an important and common question inmicroarray experiments.

Differential expression

Differentially expressed genes are genes whose ex-pression levels are associated with a response or co-variate of interest. In general, the covariates couldbe either polytomous (e.g., treatment/control status,cell type, drug type) or continuous (e.g., dose of adrug, time). Similarly, the responses could be eitherpolytomous or continuous, for example, tumor type,response to treatment, or censored survival time inclinical applications of microarrays. An approach tothe identification of differentially expressed genes isto compute, for each gene, a statistic assessing thestrength of the association between the expressionlevels and responses or covariates. In such one geneat a time approaches, genes are ranked based onthese statistics and a subset is typically selected forfurther biological verification. This gene subset maybe chosen based on either the number of genes thatcan be conveniently followed-up (and subject matterknowledge), or statistical significance as described inthe next section. The package sma contains a numberof functions for computing and plotting frequentistand Bayesian statistics for each gene in a microarray


Vol. 2/1, March 2002 29

experiment.

Single-slide methods. Single-slide experimentsaim to compare transcript abundance in two mRNAsamples, the red and green labeled mRNA sampleshybridized to the same slide. The sma functions��#��, ��5��, and ��#�� imple-ment, respectively, the single-slide methods of Chenet al. (1997), Newton et al. (2001), and Sapir andChurchill (2000). These methods are based on as-sumed parametric models for the �R, G� intensitypairs (e.g., Gamma or Gaussian models). The re-sulting model-dependent rules amount to drawingtwo curves in the �R, G�-plane and calling a genedifferentially expressed if its �R, G� measured in-tensities fall outside the region between the twocurves. The above three functions can be appliedindividually, with options for producing MA-plotsshowing the model cut-offs. Alternatively, the threemethods can be compared graphically as in Fig-ure 7, by calling �*��*��*��B

��B �� .+�+B ��.9�. This pro-duces an MA-plot, print-tip-group lowess locationnormalized, for the ��9 array with cut-offs fromeach of the three single-slide methods. Black curvescorrespond to 1:1, 1:10, 1:100 log-odds ratios for theNewton et al. method; dashed lines are the 95% and99% confidence limits from the Chen et al. method;dotted lines are the 95% and 99% posterior probabil-ity cut-offs from the Sapir and Churchill method.

Figure 7: MA-plot with cut-offs for three single-slide methods applied to the ��9 array(�*��*��*�� output).

Multiple-slide methods. Multiple-slide experi-ments typically involve the comparison of transcriptabundance in two or more types of mRNA sam-ples hybridized to different slides. sma functionsfor multiple-slide analyses include: ��;0��,which computes odds of differential expression un-der a Bayesian model for gene expression; ��8,which computes two-sample t-statistics for compar-ing the expression response in two groups of mRNAsamples. The multiple testing package multtest con-

tains additional functions for computing test statis-tics (paired t-statistics, F-statistics, etc.) and asso-ciated permutation adjusted p-values for each geneon the array (see next section). For the apo AI data,the following commands implement two possibleapproaches for comparing expression levels in thelivers of knock-out and control mice.

BB D�� '( ��'�

BB 7�� % �( 4�� 5�� % �(

BB �(� '�� (� ��'��&� �'

BB ,�� %�� &��8 ��8 ��&��/

'� @� ',��,�8�/8 ��, 8�//

��&�� @� �� ,��&��8 '�/

�� ,��&�� 8 +E�� EU 4��8 �4 �'+/

BB ��

BB 7�� % �( 4�� 5�� % �(

BB �(� ��'��&� �' �� ( �� '��

BB ,�� %�� &��8 ��&��8 ��&��/

��&�� @�

�� ,:=��&��C:F8�"�G/

��,��&�� CH��C: ��8

��&�� C��/

Multiple testing

The biological question of differential expression canbe restated as a problem in multiple hypothesis test-ing: the simultaneous test for each gene of the nullhypothesis of no association between the expressionlevels and the responses or covariates. As a typicalmicroarray experiment measures expression levelsfor thousands of genes simultaneously, we are facedwith an extreme form of multiple testing when as-sessing the statistical significance of the results.

The multtest package implements a number ofresampling-based multiple testing procedures. Itincludes procedures for controlling the family-wiseType I error rate (FWER): Bonferroni, Hochberg,Holm, Šidák, Westfall and Young minP and maxTprocedures. The Westfall and Young procedures takeinto account the joint distribution of the test statis-tics and are thus in general less conservative than theother four procedures. The multtest package also in-cludes procedures for controlling the false discoveryrate (FDR): Benjamini and Hochberg and Benjaminiand Yekutieli step-up procedures (see Dudoit et al.(2002a) for a review of multiple testing proceduresand complete references). These procedures are im-plemented for tests based on t-statistics, F-statistics,paired t-statistics, block F-statistics, Wilcoxon statis-tics. The results of the procedures are summarizedusing adjusted p-values, which reflect for each genethe overall experiment Type I error rate when geneswith a smaller p-value are declared differentially ex-pressed. The p-values may be obtained either fromthe nominal distribution of the test statistics or bypermutation.

For the apo AI experiment, differentially ex-pressed genes between the eight knock-out and eight


Vol. 2/1, March 2002 30

control mice were identified by computing two-sample Welch’s t-statistics for each gene. In or-der to assess the statistical significance of the re-sults, four multiple testing procedures (Holm, West-fall and Young maxT, Benjamini and Hochberg FDR,Benjamini and Yekutieli FDR) were considered, andunadjusted and adjusted p-values were estimatedbased on all possible �16

8 � � 12, 870 permutationsof the knock-out/control labels. Figure 8 displaysplots of the ordered adjusted p-values for these fourmultiple testing procedures. The functions �� %,�� 8��, and ��*�� from the multtest pack-age were used to compute and display adjusted p-values.

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

Number of rejected hypotheses

Sor

ted

adju

sted

p−

valu

es

rawpHolmmaxTBHBY

Figure 8: Plot of sorted adjusted p-values for the apoAI experiment (multtest package, ��*�� output).

Biological verification

In the apo AI experiment, eight spotted DNA se-quences clearly stood out from the remaining se-quences and had maxT adjusted p-values less than0.01. These eight sequences correspond to only fourdistinct genes: apo AI (3 copies), apo CIII (2 copies),sterol C5 desaturase (2 copies), and a novel EST(1 copy). All changes were confirmed by real-timequantitative PCR (RT-PCR) as described in Callowet al. (2000). The presence of apo AI among thedifferentially expressed genes is to be expected, asthis is the gene that was knocked out in the treat-ment mice. The apo CIII gene, also associated withlipoprotein metabolism, is located very close to theapo AI locus; Callow et al. (2000) showed that thedown-regulation of apo CIII was actually due to ge-netic polymorphism rather than lack of apo AI. Thepresence of apo AI and apo CIII among the differ-entially expressed genes thus provides a check ofthe statistical method, if not a biologically interest-ing finding. Sterol C5 desaturase is an enzyme whichcatalyzes one of the terminal steps in cholesterol syn-thesis and the novel EST shares sequence similarity

to a family of ATPases. Note that the applicationof single-slide methods to array data from individ-ual knock-out mice generally produced a large num-ber of false positives, while at the same time missingsome of the confirmed genes.

Ongoing projects

The sma package represents a preliminary and lim-ited effort toward the goal of providing a statisticalcomputing environment for the analysis of microar-ray data. Modifications and extensions to sma willbe done within the Bioconductor project (��;�� ). The Bioconductor projectaims more generally to produce an open source andopen development computing environment for biol-ogists and statisticians working on the analysis of ge-nomic data. Important changes include the use of theclass/method mechanism in the R methods packageto allow more general and systematic representationand manipulation of microarray data (BioconductorBiobase package). In addition, we are in the processof splitting and expanding the sma package into thefollowing task specific packages.

Experimental design (MarrayDesign)

This package will include functions for: settingup a design matrix for a given experimentaldesign (this design matrix can be used to effi-ciently combine data across slides with func-tions similar to *� or �*�); searching for opti-mal designs; and creating graphical represen-tations of the design choices.

Spot quality (MarrayQual)

Building upon our previous work on imageanalysis and normalization, we plan to devisespot and slide quality statistics and incorporatethese statistics in further analyses of the fluo-rescence intensity data, by weighting the con-tribution of each spot based on its quality.

Normalization (MarrayNorm)

This package contains functions for diagnos-tic plots and normalization procedures basedon robust local regression. Normalizationprocedures were extended to accommodate abroader class of dye biases and the use of con-trol sequences spotted on the array and possi-bly spiked in the target cDNA samples.

Cluster analysis (MarrayClust)

R already has an extensive collection of clus-tering procedures in packages such as clus-ter, mva, and GeneSOM. Building upon theseexisting packages, we are implementing newmethods which are useful for the clusteringof microarray data, genes or mRNA samples.These include resampling based-methods for


Vol. 2/1, March 2002 31

estimating the number of clusters, for improv-ing the accuracy of a clustering procedure, andfor assessing the accuracy of cluster assign-ments for individual observations (Fridlyandand Dudoit, 2001).

Class prediction (MarrayPred)

This package builds upon existing R packagessuch as class and rpart and contains func-tions for bagging and boosting classifiers. Thefunctions also return estimates of misclassifi-cation rates and measures assessing the con-fidence of predictions for individual observa-tions. Resampling-based procedures for vari-able selection as described in Breiman (1999)will also be implemented in the package.

Affymetrix chips (affy)

Low-level analysis of Affymetrix data, suchas normalization and calculation of expressionestimates, is being handled by the affy pack-age ��;��=/�;��

1 � �: ��0�� *.

The packages MarrayNorm, affy, and Mar-rayPred are closest to being released. Required de-velopments to increase the effectiveness of these Rpackages within a biological context include the fol-lowing.

Links to biological databases: An important aspectof the analysis of microarray experimentsis the seamless access to biological infor-mation resources such as the National Cen-ter for Biotechnology Information (NCBI) En-trez system (��;��*��&�E�� :�) or the Gene Ontology (GO) Consor-tium (��*��0�� ). TheBioconductor annotate package developed byRobert Gentleman already provides browseraccess to LocusLink and GenBank.

GUI: Not all biologist users will feel comfortablewith the command line R environment. Toaccommodate a broad class of users and al-low easy access to the statistical methodology,the computing environment should provide agraphical user interface.

Documentation: We envision two main classes ofusers for these packages: biologists perform-ing microarray experiments, and researchersinvolved in the development of statisticalmethodology for microarray experiments. Theformer, will be primarily users of a standardset of functions from these packages. They willneed guidance in deciding which statistical ap-proaches and packages may be appropriate fortheir experiments, in choosing among the var-ious options provided by the functions, and in

correctly interpreting the results of their visual-izations and statistical analyses. Extended tu-torials, with step-by-step analyses of microar-ray experiments, will be provided. The Sweavepackage to be included in R 1.5 will be used forautomatic report generation mixing text and Rcode. Researchers in the second group, willlikely be interested in writing their own func-tions and packages, in addition to using exist-ing functions. The Bioconductor project willprovide a framework for integrating their ef-forts.

Thanks to Natalie Thorne, Ingrid Lönnstedt, andJessica Mar for their contributions to the sma pack-age.

Bibliography

L. Breiman. Random forests - random features. Tech-nical Report 567, Department of Statistics, U.C.Berkeley, 1999. 31

P. O. Brown and D. Botstein. Exploring the newworld of the genome with DNA microarrays. InThe Chipping Forecast, volume 21, pages 33–37.Supplement to Nature Genetics, January 1999. 24

M. J. Buckley. The Spot user’s guide. CSIRO Math-ematical and Information Sciences, August 2000.�� H<!��. 25

M. J. Callow, S. Dudoit, E. L. Gong, T. P. Speed, andE. M. Rubin. Microarray expression profiling iden-tifies genes with altered expression in HDL de-ficient mice. Genome Research, 10(12):2022–2029,2000. 25, 30

Y. Chen, E. R. Dougherty, and M. L. Bittner. Ratio-based decisions and the quantitative analysis ofcDNA microarray images. Journal of Biomedical Op-tics, 2:364–374, 1997. 29

S. Dudoit, J. P. Shaffer, and J. C. Boldrick. Multi-ple hypothesis testing in microarray experiments.Submitted, 2002a. 29

S. Dudoit, Y. H. Yang, M. J. Callow, and T. P. Speed.Statistical methods for identifying differentiallyexpressed genes in replicated cDNA microarrayexperiments. Statistica Sinica, 12(1), 2002b. 24, 27

J. Fridlyand and S. Dudoit. Applications of resam-pling methods to estimate the number of clus-ters and to improve the accuracy of a clusteringmethod. Technical Report 600, Department ofStatistics, U.C. Berkeley, September 2001. 31

M. A. Newton, C. M. Kendziorski, C. S. Richmond,F. R . Blattner, and K. W. Tsui. On differential vari-ability of expression ratios: Improving statistical


Vol. 2/1, March 2002 32

inference about gene expression changes from mi-croarray data. Journal of Computational Biology, 8:37–52, 2001. 29

M. Sapir and G. A. Churchill. Estimating the posteriorprobability of differential gene expression from microar-ray data. Poster, The Jackson Laboratory, 2000.�� **�. 29

Y. H. Yang, M. J. Buckley, S. Dudoit, and T. P. Speed.Comparison of methods for image analysis oncDNA microarray data. Journal of Computationaland Graphical Statistic, 11(1), 2002a. 25

Y. H. Yang, S. Dudoit, P. Luu, D. M. Lin, V. Peng,J. Ngai, and T. P. Speed. Normalization for cDNAmicroarray data: a robust composite method ad-dressing single and multiple slide systematic vari-ation. Nucleic Acids Research, 30(4), 2002b. 27, 28

Y. H. Yang, S. Dudoit, P. Luu, and T. P. Speed.

Normalization for cDNA microarray data. InMichael L. Bittner, Yidong Chen, Andreas N.Dorsel, and Edward R. Dougherty, editors, Mi-croarrays: Optical Technologies and Informatics, vol-ume 4266 of Proceedings of SPIE, May 2001. 27, 28

Sandrine DudoitUniversity of California, Berkeley��

�� ;� ��*�0��

Yee Hwa (Jean) YangUniversity of California, Berkeley0��;� ��*�0��

Ben BolstadUniversity of California, Berkeley;�*��;� ��*�0��

Changes in R 1.4.0by the R Core Team

The ‘��’ file of the R sources and hence this arti-cle have been reorganized into several sections (user-visible changes, new features, deprecated & defunct,documentation changes, utilities, C-level facilities,bug fixes) in order to provide a better overview ofall the changes in R 1.4.0.

User-visible changes

This is a new section to highlight changes in be-haviour, which may be given in more detail in the fol-lowing sections. Many bug fixes are also user-visiblechanges.

� The default save format has been changed, sosaved workspaces and objects cannot (by de-fault) be read in earlier versions of R.

� The number of bins selected by default in a his-togram uses the correct version of Sturges’ for-mula and will usually be one larger.

� �� no longer converts logical argu-ments to factors (following S4 rather than S3).

� ��;*�� has new arguments ‘nrows’ and‘colClasses’. If the latter is 5< (the default),conversion is attempted to logical, integer, nu-meric or complex, not just to numeric.

� ��*�� treats logical variables as afactors with levels ��F<-�EB %�@E� (rather

than 0-1 valued numerical variables). Thismakes R compatible with all S versions.

� Transparency is now supported on most graph-ics devices. This means that using � �+;�+�,for example in *��, will by default give atransparent rather than opaque background.

� [dpqr]gamma now has third argument ‘rate’for S-compatibility (and for compatibility withexponentials). Calls which use positionalmatching may need to be altered.

� The meaning of � . = in ��*��

has changed.

� �;� �� and �;� ��J�� do nothingsilently on a character vector of length 0, ratherthan generating an error. This is consistentwith other functions and with S.

� For compatibility with S4, any arithmetic op-eration using a zero-length vector has a zero-length result. (This was already true for logi-cal operations, which were compatible with S4rather than S3.)

� �� and �� have been moved to thenew package tools.

� The name of the site profile now defaults to‘ �� !��’.

� The startup process for setting environmentvariables now first searches for a site environ-ment file (given by the environment variable�KE5LH�"5 if set or ‘ �� "��’


Vol. 2/1, March 2002 33

if not), and then for a user ‘ ��"��’ file in thecurrent or the user’s home directory.

� Former � �GB ��*� . &� must now be� �GB ��*�� . &�.

� The default methods for -�&� and -��

have changed and so there may be signchanges in singular/eigen vectors, includingin �� , ��*�, ��*, � ��, and& �� .

New features

� Transparency is now supported on most graph-ics devices. Internally colors include an al-pha channel for opacity, but at present thereis only visible support for transparent/opaque.The new color ‘"transparent"’ (or ‘NA’ or‘"NA"’) is transparent, and is the default back-ground color for most devices. Those devices(postscript, XFig, PDF, Windows metafile andprinter) that previously treated ‘bg = "white"’as transparent now have ‘"transparent"’ asthe default and will actually print ‘"white"’.(NB: you may have ‘bg = "white"’ saved in�!�� in your workspace.)

� A package methods has been added, contain-ing formal classes and methods (“S4” meth-ods), implementing the description in the book“Programming with Data”. See (4�� andthe references there for more information.

– In support of this, the ‘�’ operator hasbeen added to the grammar.

– Method dispatch for formal methods (the�� 2�� function), is now aprimitive. Aside from efficiency issues,this allows S3-style generics to also haveformal methods (not really recommendedin the long run, but it should at leastwork). The C-level dispatch is now im-plemented for primitives that use either,��2 �� or ,��" E&* inter-nally.

– A version of the function �*�� in themethods package has arguments and 0,to allow methods for either or both. See(��4�� for examples of such meth-ods.

– The methods package now uses C-levelcode (from within ,��" E&*) todispatch any methods defined for prim-itive functions. As with S3-style meth-ods, methods can only be defined if thefirst argument satisfies ��;�� (notstrictly required for formal methods, butimposed for now for simplicity and effi-ciency).

� Changes to the tcltk package:

– New interface for accessing Tcl variables,effectively making the R representationslexically scoped. The old form is beingdeprecated.

– Callbacks can now be expressions, withslightly unorthodox semantics. In particu-lar this allows bindings to contain ‘break’expressions (this is necessary to bind codeto e.g. Alt-x without having the key com-bination also insert an ‘x’ in a text widget.)

– A bunch of file handling and dialog func-tions (previously only available via ��)have been added.

� The ‘(’ operator is now an actual function. Itcan be used (as always) as a unary operator(‘?plot’) and the grammar now allows it as abinary operator, planned to allow differentiat-ing documentation on the same name but dif-ferent type (‘class?matrix,’ for example). So far,no such documentation exists.

� New methods <H#��*��and *��-��*��,also fixing the <H# method for glm objects.

� ��!"�H$�� allows the label date/times tobe specified via the new ‘at’ argument.

� �� now allows ‘length = 0’ (and drawsno arrowheads).

� Modifications to the access functions for moreconsistency with S: arguments ‘name’, ‘pos’and ‘where’ are more flexible in ��,� ��, ��, *��, �;��, ��&��and ��.

� Three new primitive functions have beenadded to base: ��#*��, �;�3��#*��,and ��&� ��. The first twoare support routines for �*�� and�*J�� in package methods. The third re-places ��&�� in the functions ��,� ��, and friends.

� ; �*�� now respects an inline ‘cex.axis’ ar-gument and has a separate ‘cex.names’ argu-ment so names and the numeric axis labels canbe scaled separately. Also, graphics parametersintended for �� such as ‘las’ can now beused.

� Shading by lines added to functions; �*��, ��, *��, �� ,��*0�� and ��.


Vol. 2/1, March 2002 34

� ; �� has a show.names argument allow-ing labels on a single boxplot; it and hence;� �*�� now makes use of ‘pch’, ‘cex’, and‘bg’ for outlier ��.

; �� and ;� �*�� also have an argument‘outline’ to suppress outlier drawing S-Pluscompatibly.

� New ��;�*�� options ‘"GNOME"’ and‘"IEEE754"’.

� New function ��*��, a wrapper for��*�� /�� provided for compatibilitywith S-Plus.

� �� is now generic.

� �� in package ctest now also gives anasymptotic confidence interval for the Pearsonproduct moment correlation coefficient.

� ��, �� and *�; 0��now also returnthe information about available data sets, de-mos or packages. Similarly, ��*�� re-turns its results.

� ��0�� allows ‘bw’ or ‘width’ to specifya rule to choose the bandwidth, and rules‘"nrd0"’ (the previous default), ‘"nrd"’, ‘"ucv"’,‘"bcv"’, ‘"SJ-ste"’ and ‘"SJ-dpi"’ are supplied(based on functions in package MASS).

� �� *�� now has a default method,used for classes ‘"lm"’ and ‘"glm"’.

� New argument ‘cacheOK’ to ��*��*��

to request cache flushing.

All methods for ��*��*�� do tilde-expansion on the path name.

The internal ��*��*�� etc now allowURLs of the form ‘ftp://[email protected]/’ and‘ftp://user:[email protected]/’

� ��*�� and ��?�� are now genericfunctions with methods for data frames (aswell as atomic vectors).

� ��*�� and � �� use ��

on their scores, so ‘na.action = na.exclude’ issupported.

� Function ��5��&��0�;�*H�� returns de-tails about a native routine, potentially includ-ing its address, the library in which it is located,the interface by which it can be called and thenumber of parameters.

� Functions such as ��*�� which perform li-brary or package index searches now useNULL as default for their ‘lib.loc’ argument sothat missingness can be propagated more eas-ily. The default corresponds to all currentlyknown libraries as before.

� Added function ��*�� .

� ��*�� allows ‘breaks’ to specify arule to choose the number of classes, andrules ‘"Sturges"’ (the previous default), ‘"Scott"’and ‘"FD"’ (Freedman-Diaconis) are supplied(based on package MASS).

� Function ��*��, a fast and reliable wayto test for exact equality of two objects.

� New generic function ��J��, from S4. Thisis by default equivalent to ‘x[value] <- NA’ butmay differ, e.g., for factors where ‘"NA"’ is alevel.

� ��xxx reached through ��K� are nowgeneric.

� -�� and -�&�� have new defaultmethods to use later (and often much faster)LAPACK routines. The difference is most no-ticeable on systems with optimized BLAS li-braries.

� *�� is now generic.

� New function �*�;!�� for getting or set-ting the paths to the library trees R knowsabout. This is still stored in ‘.lib.loc’, whichhowever should no longer be accessed directly.

� Using lm/glm/. . . with ‘data’ a matrix ratherthan a data frame now gives a specific errormessage.

� *��, *?��, �*�� and �� use the stan-dard NA-handling and so support ‘na.action =na.exclude’.

� ��*��;�� now has a ‘tol’ argument to bepassed to �*&��.

� �� has a ‘data.frame’ method applyingmean column-by-column. When applied tonon-numeric data �� now returns ‘NA’rather than a confusing error message (for com-patibility with S4). Logicals are still coerced tonumeric.

� The formula interface to ��*��now al-lows a contingency table as data argument.

� ��&�� is now internal and allows youto set hashing. Also, � ��&�� and� ��&J�� are included to provide directaccess to setting and retrieving environments.

� Function �*�� to look up IP addresses ofhosts: intended as a way to test for internetconnectivity.


Vol. 2/1, March 2002 35

� "��, �;��, �� and �� meth-ods for time series objects moved from packagets to package base.

� New option ‘download.file.method’ canbe used to set the default method for��*��*�� and functions which use itsuch as ��.

� � �� and � ��*�� now implement‘na.last = FALSE’ and ‘na.last = NA’.

� Started work on new package managementsystem: �� and friends.

� �� has a new ‘method’ argument allowing‘method = print’.

� ��, �� and ;�� devices now have a‘bg’ argument to set the background color: use-ful to set ‘"transparent"’ on ��.

� Changes to the �� device:

– The symbol font can now be set on a�� device, and support hasbeen added for using Computer Mod-ern type-1 fonts (including for symbols).(Contributed by Brian D’Urso.)

– There is now support for URW font fam-ilies: this will give access to more char-acters and more appropriate metrics onPostScript devices using URW fonts (suchas ghostscript).

– %%IncludeResource comments have beenadded to the output. (Contributed byBrian D’Urso.)

� � �� now predicts on ‘newdata’containing NAs.

� � �� now has a formula interface.

� ��#� �� now returns what is available iffewer characters than requested are on the file.

� ��*�� allows up to 256 chars for theprompt.

� ��;*��, �� and ��*��

have a new argument ‘comment.char’, default‘#’, that can be used to start comments on a line.

� New function ��*�:� �� to provide Rinterface to finalization.

� �� extends ��-�� and ��3��, which are deprecated.

� *�� now returns a classed object, has a printmethod and an inverse.

� Changes to &�� and friends:

– &�� now takes an ‘envir’ argument forspecifying where items to be saved are tobe found.

– A new default format for savedworkspaces has been introduced. Thisformat provides support for some new in-ternal data types, produces smaller savefiles when saving code, and provides a ba-sis for a more flexible serialization mecha-nism.

– Modified ‘save’ internals to improve per-formance when saving large collections ofcode.

– &�� and &�� now takea ‘version’ argument to specify theworkspace file-format version to use. Theversion used from R 0.99.0 to 1.3.1 is ver-sion 1. The new default format is ver-sion 2. *�� can read a version 2 savedworkspace if it is compressed.

– &�� and &�� now take a‘compress’ argument to specify that thesaved image should be written using thezlib compression facilities.

– &�� now takes an argument‘ascii’.

– &�� now takes an argument‘safe’. If ‘TRUE’, the default, a tempo-rary file is used for creating the savedworkspace. The temporary file is renamedif the save succeeds. This preserves an ex-isting workspace if the save fails, but atthe cost of using extra disk space duringthe save.

– &�� default arguments can bespecified in the ‘save.image.defaults’ op-tion. These specifications are used when&�� is called from ?�� or GUIanalogs.

� �� allows unlimited (by R) lengths of in-put lines, instead of a limit of 8190 chars.

� ��*�� has a new ‘control.spar’ ar-gument and returns ‘lambda’ besides ‘spar’.‘spar’ � 0 is now valid and allows to go moreclosely towards interpolation (lambda 0)than before. This also fixes ��*��

behavior for ‘df � n - 2’. Better error messagesin several situations.

Note that spar = 0 is no longer the default andno longer entails cross-validation.

� � �� has been enhanced; new ‘mar’ argu-ment uses smaller � �� by default; fur-ther ‘nrow’ and ‘ncol’ as S-Plus, ‘frame.plot’,‘flip.labels’, ‘lty’ and explicit ‘main’, ‘sub’,‘xlab’ and ‘ylab’. Note that ‘colors’ has been


Vol. 2/1, March 2002 36

replaced by ‘col.segments’ and there’s a new‘col.stars’.

� �� now returns the locations invisibly.

� �� is now closer to ��<H#�� and so han-dles a wider range of objects (but stepAIC [inMASS] is still more powerful).

� 0�;�*�� now has automatic xlab and ylaband a main argument which eliminates anincorrect warning. It better checks wronglyscaled arguments.

� �0��*��*�� now issues a warning if itfails.

� An enhanced function �0��&� �� is nowa documented function, rather than just inter-nal to ��;*��.

� � �� allows multiple arguments, follow-ing S4’s style.

� New function �� for evaluating expres-sions in environments constructed from data.

� Unix //�� devices can now have a canvascolor set, which can help to distinguish plotting‘"white"’ from plotting ‘"transparent"’.

� On Unix, $//��, �� and �� now giveinformative warnings if they fail to open thedevice.

� The startup processing now interprets escapesin the values of environment variables set in‘ �� "��’ in a similar way tomost shells.

� The operator ‘.’ is now allowed as an assign-ment operator in the grammar, for consistencywith other languages, including recent ver-sions of S-Plus. Assignments with ‘=’ are ba-sically allowed only at top-level and in bracedor parenthesized expressions, to make famouserrors such as �� .=� / �*� 8 illegal in thegrammar. (There is a plan to gradually elimi-nate the underscore as an assignment in futureversions of R.)

� Finalizers can be registered to be run on systemexit for both reachable and unreachable objects.

� integer addition, subtraction, and multiplica-tion now return NA’s on overflow and issue awarning.

� Printing factors with both level ‘"NA"’ andmissing values uses ‘<NA>’ for the missingvalues to distinguish them.

� Added an experimental interface for lockingenvironments and individual bindings. Alsoadded support for “active bindings” that linka variable to a function (useful for example forlinking an R variable to an internal C global).

� GNOME interface now has separate colours forinput and output text (like the windows GUI).These can be modified via the properties dia-logue.

� Output from the GNOME console is blockbuffered for increased speed

� The GNOME console inherits standard emacs-style keyboard shortcuts from the GtkText wid-get for cursor motion, editing and selection.These have been modified to allow for theprompt at the beginning of the command line.

� One can register R functions and C rou-tines to be called at the end of the suc-cessful evaluation of each top-level expres-sion, for example to perform auto-saves, up-date displays, etc. See ��%�#**;��

and ��#**;��4�� . See ��

��&�*�� %��*� �

��.

Deprecated & defunct

� �<*� has been removed from all R sourcesand deprecated.

� ��-�� and ��3�� are depre-cated in favour of ��.

� Previously deprecated functions ��*��,� ��, ��;*�� *��, �� *��,and �� *�� are defunct. Method‘"socket"’ for ��*��*�� no longer ex-ists.

Documentation changes

� Writing R Extensions has a new chapter ongeneric/method functions.

Utilities

� New package tools for package developmentand administration tools, containing the QAtools ��FF��, �� and �� previ-ously in package base, as well as the followingnew ones:

– ��<��F�� for checking whetherthe final argument of assignment func-tions in a package is named ‘value’.


Vol. 2/1, March 2002 37

– ��,��< �� for checking whether allarguments shown in M�� of Rd filesare documented in the correspondingM ��.

– ��4�� for checking whether allmethods defined in a package have all ar-guments of their generic.

– ��%�F�� for finding expressions con-taining the symbols ‘T’ and ‘F’.

� � #4, ��8�&� has more convenient defaultsfor its output file.

� � #4, �� now also fully checks the ‘De-pends’ field in the package ‘#��$ %&'%��’ file.It also tests for syntax errors in the R code,whether all methods in the code have all ar-guments of the corresponding generic, for ar-guments shown in M�� but not documentedin M ��, and whether assignment func-tions have their final argument named ‘value’.

C-level facilities

� 0��;� �� and &�� ;� �� arenow available to package users. All “array-like” packages can use a standard method forcalculating subscripts.

� The C routine �0��80�;�*, similar to�0��8� , returns a symbol corresponding tothe type supplied as an argument.

� The macro ��-H�KE$% now includes ‘.’, e.g.‘".so"’ or ‘".dll"’, since the Mac uses "Lib" with-out a ‘.’.

� New FORTRAN entry points � �� and � �� for warnings and error exits fromcompiled Fortran code.

� A new serialization mechanism is availablethat can be used to serialize R objects to con-nections or to strings. This mechanism is usedfor the version 2 save format. For now, only aninternal C interface is available.

� �K� 0E&*�� added for evaluating expressionsfrom C code with errors handled but guaran-teed to return to the calling C routine. This isused in embedding R in other applications andlanguages.

� Support for ��’ing user-defined tablesof variables is available and accessed via theRObjectTables package currently at ��

�� ";��%;*�.

Bug fixes

� Fixed ‘�� !��(�� !�� !’ to detectinstances of � �� at the very start of a line.

� Fixed Pearson residuals for glms with non-canonical *��!�N//8O�. Fixed them againfor weights (PR#1175).

� Fixed an inconsistency in the evaluation con-text for on.exit expressions between explicitcalls to ‘return’ and falling off the end returns.

� The code in ��*�� *�� han-dling contrasts was assuming a response waspresent, and so without a response was failingto record the contrasts for the first variable if itwas a factor.

� ��&�� could get the time base wrong insome cases.

� ��*�� was opening all files in textmode: mattered on Windows and classic Mac-intosh. (PR#1085)

� �AC J� � now works for factor �.

� �;� J�� was misbehaving if the replace-ment was too short.

� The version of ‘ ��)��!’ generated whenbuilding R or installing packages had an incor-rect link to the style sheet. The version used by��*�� was correct. (PR#1090)

� �� now gives character (not factorcodes) as rownames. (PR#1092)

� �*��!"�H$�� and �*��!"�H$*� now respectthe ‘xaxt’ parameter.

� It is now possible to predict froman intercept-only model: previously��*�� *�� objected to a 0-column model frame.

� ��!"�H$�� was not setting the right classes in1.3.x.

� �� GB �� . +**��;+�� 1 is now guaran-teed which ensures that ? ��/ � P8� is al-ways ok in �� . (PR#1099)

� ��&��*�� had a missing ‘drop=FALSE’ andso failed for some intercept-less models.

� � �� =�� now accepts vector as wellas matrix ‘newxreg’ arguments.

� �;��B�� now works for 0-columndataframes. This fixes PR#1102.

� �*��=�77�B *�� . +0+� now works.

� method ‘"gnudoit"’ of ;�� wasincorrectly documented as ‘"gnuclient"’(PR#1108)


Vol. 2/1, March 2002 38

� saving with ‘ascii=TRUE’ mangled back-slashes. (PR#1115)

� � ��B� and others now adds a gap appropri-ately. (PR#1101)

� *��-��*�� now uses the correct ‘"df"’ (nlmelegacy code).

� �*��<**#�� works again, andcloses all �� diversions.

� ��0��.+��+�works again.

� ��;� was (accidentally) returning theresult invisibly.

� �!"�H$��+5<+� (or ..lt) now work;hence, �� GB **.%�@E� now works withdataframes containing POSIXt date columns.

� �� 8PO=Q/� and similar ones do not seg-fault anymore but duly report allocation errors.

� �?�=B =B /� now works (PR#1133).

� ��3�� got it wrong if the ‘"i"’ factorwas not sorted (the function is now depre-cated since �� is there, but the bug stillneeded fixing . . . )

� PR#757 was fixed incorrectly, causing impropersubsetting of ‘pch’ etc. in �*�� *��.

� *�; 0�� no longer removes environmentsof functions that are not defined in the top-level package scope. Also, packages loadedby �?�� when sourcing package code arenow visible in the remaining source evalua-tions.

� �� J� & now works (again) for ‘"dist"’objects d. (PR#1129)

� Workarounds for problems with incompletelyspecified date-times in � ��which wereseen only on glibc-based systems (PR#1155).

� � �� was returning the wrong rotationmatrix. (PR#1146)

� The [pqr]signrank and [pqr]wilcox functionsfailed to check that memory has been allocated(PR#1149), and had (often large) memory leaksif interrupted. They now can be interrupted onWindows and MacOS and don’t leak memory.

� �� =�) is now ��5<B 5<� not 5<.

� �� B �� for �� 0 alwaysgives an integral answer. Previously itmight not due to rounding errors in fround.(PR#1138/9)

� Several memory leaks on interrupting func-tions have been circumvented. Functions*?�� and �&�� can now be interrupted onWindows and MacOS.

� �� was finding incorrect breakpointsfrom irregularly-spaced midpoints. (PR#1160)

� Use fuzz in the 2-sample Kolmogorov-Smirnovtest in package ctest to avoid rounding errors(PR#1004, follow-up).

� Use exact Hodges-Lehmann estimators for theWilcoxon tests in package ctest (PR#1150).

� Arithmetic which coerced types could lose theclass information, for example ‘table - real’ hada class attribute but was not treated as a classedobject.

� Internal ftp client could crash R under errorconditions such as failing to parse the URL.

� Internal clipping code for circles could attemptto allocate a vector of length �1 (related toPR#1174)

� The hash function used internally in ��,��?�� and ��*�� was very ineffi-cient for integers stored as numeric, on little-endian chips. It was failing to hash the imagi-nary part of complex numbers.

� �� no longer tries to truncate on openingin modes including ‘"w"’. (Caused the fifo ex-ample to fail on HP-UX.)

� Output over 1024 characters was discardedfrom the GNOME console.

� �� now correctly warns about clipped val-ues also for logarithmic axes and has a ‘quiet’argument for suppressing these (PR#1188).

� ��*�� *� was not handling cor-rectly ‘contrasts.arg’ which did not supply afull set of contrasts (PR#1187).

� The ‘width’ argument of ��0�� was onlycompatible with S for a Gaussian kernel: nowit is compatible in all cases.

� The ;�� C code had a transcription er-ror from the original Fortran which led to asmall deviation from the intended distribution.(PR#1190)

� ��B B ��.=� was wrong if � was +/-Inf.

� Subsetting grouping factors gave incorrect de-grees of freedom for some tests in packagectest. (PR#1124)

� � �� had a memory leak.

� ?;��=�8'B =�/9O>7/B =�='� was (incor-rectly) 3e-308. (PR#1201)


Vol. 2/1, March 2002 39

� Fixed alignment problem in ‘ ��’ on Irix.(PR#1002, 1026)

� �*�� failed on null binomial models.(PR#1216)

� -�&�� with ‘nu’ = 0 or ‘nv’ = 0 could fail asthe matrix passed to DGESVD was not of di-mension at least one (it was a vector).

� Rownames in ‘xcoef’ and ‘ycoef’ of �� were wrong if ‘x’ or ‘y’ was rank-deficient.

� *?�� could give warnings if there was an ex-act fit. (PR#1184)

� �&�� didn’t find free-floating variables forE � �� terms when called from inside anotherfunction.

� � ��;*�� failed if asked to quote a nu-merical matrix with no row names. (PR#1219)

� *�� GB GB �.=� now returns the mean, �;��GB GB � �;./� gives 0, (PR#1218).

Changes in R 1.4.1by the R Core Team

Bug fixes

� ��*��*�� . F<-�E� now always givesan immediate error message if a line is incom-plete. (As requested in PR#1210)

� ��;*�� is no longer very slow in pro-cessing comments: moved to C code and fewerlines checked.

� �0��&� �� could give stack imbalancewarnings if used with ‘as.is = TRUE’.

� predict.mlm ignored newdata (PR#1226) andalso offsets.

� �� was inadvertently changed in1.4.0 so that it would evaluate the requestedtest, but not display the result.

� � ��*� . %�@E� (the default) now worksas documented (and as S does). Previously itonly scaled the maximum to 1. (PR#1230)

� �= J� �� . =�; �� =A=B=C� and �� =AB =C� now work.

� �*��multiple time seriesB �*��0�� .

+��*�+�was computing ‘ylim’ from the firstseries only.

� �*�� has a new ‘xpd = � �+ ��+�’ ar-gument which by default does clipping (of thehorizontal lines) as desired (‘xpd’ = NA wasused before, erroneously in most cases).

� � ��*��B �� & . /�

now works.

� ��0�� failed when is a struc-ture/matrix. (PR#1238)

� ��4�� returns NULL when ‘op-tional=TRUE’ as promised in the documen-tation.

� ��4�� allows �� to be one of the argu-ments omitted in the method definition (but sofar no check for �� being missing)

� Allow �� to work again on very largenumbers (introduced in fixing PR#1138).(PR#1254)

� ‘ ��!��’ is now accepted by a C++ com-piler.

� �0��&� �� was failing to detect integeroverflow.

� �� was defaulting to foregroundcolour (black) fills rather than background (asused in 1.3.1 and earlier). Now background isused, but be aware that as from 1.4.0 this maybe transparent.

� -��GB ��*0�&*��.%�@E� does notsegfault anymore in one branch (PR#1262).

� ��now produces correct default labels evenwhen ‘include.lowest = TRUE’ (PR#1263).

� �� *�� works properly with a re-sponse.

� ��*��GB � . /� now works properly.

� Options ‘by = "month"’ and ‘by = "year"’to �?�!"�H$�� will always take account ofchanges to/from daylight savings time: thiswas not working on some platforms.

� �*��**��now accepts all the argumentsof �*�� (it could be called from �*��

with arguments it did not accept), and is nowdocumented.

� ��&��;��/�B �� . %�@E� now works.

� � ��*��;��B �� . %�@E� wasfailing if the fit involved an offset.


Vol. 2/1, March 2002 40

� �� on ‘"package:base"’ would crash R.(PR#1271)

� � �� or �� 0 on a ��&�� object with noterms, no names on the response and ‘intercept= FALSE’ (which is not sensible) would give anerror.

� �� on file connections was ignoring the‘origin’ argument.

� Fixed new environment handling in *�; 0��

to avoid forcing promises created by ��*0��.

� ��=�� could leak memory: now releasedvia �� .

� ? ��? BG�now keeps the names of ? D? .

� ��==H�� no longer fails on data in-dexes not generated by �� (PR#1274).

Changes on CRANby Kurt Hornik and Friedrich Leisch

CRAN packages

The following extension packages from ‘��*’were added since the last newsletter.

Bhat Functions for MLE, MCMC, CIs (originally inFortran). By E. Georg Luebeck.

CircStats Circular Statistics, from ‘Topics in circu-lar Statistics’ by S. Rao Jammalamadaka and A.SenGupta, World Scientific (2001). S original byUlric Lund, R port by Claudio Agostinelli.

ROracle Oracle Database Interface driver for R.Uses the ProC/C++ embedded SQL. By DavidA. James and Jake Luciani.

RQuantLib The RQuantLib packages provides ac-cess to (some) of the QuantLib functions fromwithin R. It is currently limited to some Optionpricing and analysis functions. The QuantLibproject aims to provide a comprehensive soft-ware framework for quantitative finance. Thegoal is to provide a standard free/open sourcelibrary to quantitative analysts and develop-ers for modeling, trading, and risk manage-ment of financial assets. By Dirk Eddel-buettel for the R interface, and the QuantLibgroup for QuantLib (��?��*�;�� *�� *).

RSQLite Database Interface R driver for SQLite.Embeds the SQLite database engine in R. ByDavid A. James.

RadioSonde RadioSonde is a collection of programsfor reading and plotting SKEW-T,log p dia-grams and wind profiles for data collectedby radiosondes (the typical weather balloon-borne instrument). By Tim Hoar, Eric Gille-land, and Doug Nychka.

agce Contains some simple functions for the analy-sis of growth curve experiments. By RaphaelGottardo.

aws Contains R functions to perform the adap-tive weights smoothing (AWS) procedure de-scribed in Polzehl und Spokoiny (2000), Adap-tive weights smoothing with applications toimage restoration, Journal of the Royal StatisticalSociety, Ser. B, 62, 2, 335–354. By Joerg Polzehl.

combinat Routines for combinatorics. By ScottChasalow.

deldir Calculates the Delaunay triangulation andthe Dirichlet or Voronoi tesselation (with re-spect to the entire plane) of a planar point set.By Rolf Turner.

dr Functions, methods, and datasets for fitting di-mension reduction regression, including pHdand inverse regression methods SIR and SAVE.These methods are described, for example, inR. D. Cook (1998), Regression Graphics, Wiley,New York. Also included is code for comput-ing permutation tests of dimension. By SanfordWeisberg.

emplik empirical likelihood ratio for means, quan-tiles, and hazards from possibly right censoreddata. By Mai Zhou and Art Owen.

evd Extends simulation, distribution, quantile anddensity functions to univariate, bivariate and(for simulation) multivariate parametric ex-treme value distributions, and provides fittingfunctions which calculate maximum likelihoodestimates for univariate and bivariate models.By Alec Stephenson.

g.data Create and maintain delayed-data packages(DDP’s). Data stored in a DDP are avail-able on demand, but do not take up mem-ory until requested. You attach a DDP with��, then read from it and assignto it in a manner similar to S-Plus, except thatyou must run ��&�� to actually com-mit to disk. By David Brahm.


Vol. 2/1, March 2002 41

geoRglm Functions for inference in generalised lin-ear spatial models. By Ole F. Christensen andPaulo J. Ribeiro Jr.

grid A rewrite of the graphics layout capabilities,plus some support for interaction. By PaulMurrell.

hdf5 Interface to the NCSA HDF5 library. By Mar-cus G. Daniels.

ifs Iterated Function Systems distribution functionestimator. By S. M. Iacus.

lasso2 Routines and documentation for solving re-gression problems while imposing an L1 con-straint on the estimates, based on the algorithmof Osborne et al. (1998). By Justin Lokhorst, BillVenables and Berwin Turlach; first port to R byMartin Maechler.

lattice Implementation of Trellis Graphics. By Deep-ayan Sarkar.

moc Fits a variety of mixtures models for multi-variate observations with user-defined distri-butions and curves. By Bernard Boulerice.

pastecs Regulation, decomposition and analysis ofspace-time series. By Frederic Ibanez, PhilippeGrosjean & Michele Etienne.

pear Package for estimating periodic autoregressivemodels. Also includes methods for plottingperiodic time series data. S original by A. I.McLeod, R port by Mehmet Balcilar.

qtl Analysis of experimental crosses to identifygenes (called quantitative trait loci, QTLs) con-tributing to variation in quantitative traits.

By Karl W Broman, with ideas from GaryChurchill and Saunak Sen and contributionsfrom Hao Wu.

spatstat Data analysis and modelling of two-dimensional point patterns, including multi-type points and spatial covariates. By AdrianBaddeley and Rolf Turner.

spsarlm Functions for estimating spatial simultane-ous autoregressive (SAR) models. By Roger Bi-vand.

New country mirrors

We now also have CRAN country mirrors inBrazil (thanks to Paulo Justiniano Ribeiro Jr�� ;�� *�� ) and in Germany.

New submission email

The email address for submissions to CRAN nowis � �� (the old address no longerworks). Uploads still go to ��

� ��.

Kurt HornikWirtschaftsuniversität Wien, AustriaTechnische Universität Wien, Austria��

Friedrich LeischTechnische Universität Wien, AustriaF �� -��

Editors:Kurt Hornik & Friedrich LeischInstitut für Statistik und WahrscheinlichkeitstheorieTechnische Universität WienWiedner Hauptstraße 8-10/1071A-1040 Wien, Austria

Editor Programmer’s Niche:Bill Venables

Editorial Board:Douglas Bates, John Chambers, Peter Dal-gaard, Robert Gentleman, Stefano Iacus, RossIhaka, Thomas Lumley, Martin Maechler, GuidoMasarotto, Paul Murrell, Brian Ripley, DuncanTemple Lang and Luke Tierney.

R News is a publication of the R project for statistical

computing, communications regarding this publica-tion should be addressed to the editors. All articlesare copyrighted by the respective authors. Pleasesend submissions to the programmer’s niche col-umn to Bill Venables, all other submissions to KurtHornik or Friedrich Leisch (more detailed submis-sion instructions can be found on the R homepage).

R Project Homepage:��

Email of editors and editorial board:��

This newsletter is available online at��


news - cbgstatcbgstat.com/method_maximal_chi_square_method/rnews_2002... · 2016. 6. 1. · news...

Documents