software tutorial for microarray...

18

Upload: others

Post on 25-Jan-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

  • Software Tutorial for Microarray

    Meta-analysis

    by line

    August 29, 2012

  • Contents

    1 Introduction 3

    1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 How to get it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Reporting bugs and update . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2 Getting Started 4

    2.1 Graphical Interface Introduction . . . . . . . . . . . . . . . . . . . . . . . 42.2 Prepare the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Load Data and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.3.1 Load Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.4 Meta Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4.1 MetaQC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.4.2 MetaDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4.3 MetaPath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    3 Example 9

    3.1 Load data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 MetaQC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 MetaDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.5 MetaPath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2

  • 1 Introduction

    1.1 Background

    Microarray plays an important role in genomic studies. With rapid development ofhigh-throughput genomic technology, combining multiple studies is very important toincrease the statistical power. Many researchers have put up with methodology dealingwith these genomic problems and some of these methods have been made as packagein some language such as R, but it is still hard for experimental scientist to use be-cause of the complicated operation in these language, which is not the emphasis of someresearchers[1]. Therefore this paper will present a user-friendly GUI software, implement-ing the metaOmics package written by. It will be easy to use as well as to understandresults.

    1.2 How to get it.

    This software could be downloaded at http://www.biostat.pitt.edu/bioinfo/software.htm

    1.3 Reporting bugs and update

    There may be a lot of settings you will not feel comfortable or convenient to use. If youfeel there is any, feel free to let the author know and he'll improve it in the future version.The author sincerely appreciate it if you could give some suggestions and feedback.Also if you �nd and bugs, feel free to contact the author1 and he'll revise that in the

    next version. Thank you for your contribution.

    1Please contact [email protected]

    3

    http://www.biostat.pitt.edu/bioinfo/software.htmmailto:[email protected]

  • 2 Getting Started

    2.1 Graphical Interface Introduction

    The GUI consists 4 parts as you can see when opening the software.

    • Studies Inforamtion. This �eld will keep track of the input data information,including studies name, gene size, sample size, platform name and year generatingthe data. Keep in mind that the gene size may change if user merge the data or�lter the data.

    • Console window. This window will provide information regarding what jobs thesoftware is corrently doing. If user clicks a button that is not allowed withoutanother action, this window will provide some warning message.

    • Load data panel. This panel allows user to load data from local computer.Afterward, user could do preprocessing then it is ready to use the meta-analysispanel.

    • meta-analysis panel. There are 3 currently availible analysis tools in this panel,including metaQC, metaDE and metaPath. User could visualize and save theresult.

    2.2 Prepare the data

    Data is supposed to be arranged as a matrix format. Each row represents a gene andeach column represents a sample. Also there should be proper gene names or probeIDnames, sample names. See Figure 2.1

    (a) data matched (b) data unmached

    Figure 2.1: two class

    4

  • 2 Getting Started

    Figure 2.2: survival data

    Now present only .txt �le and . csv �le are allowed, which might be extended into xls�le in the future. So it'll be nice if user could arrange data using excel. Just open thedata in excel and save as txt or csv.Di�erent data type may have subtle format di�erence.For two class data, multiple class data, continue data, the �rst row should be the

    sample name. The second row should be the class label.For survival data, the �rst row should be the sample name, the second row should be

    time, the third row should be censoring status. See Figure 2.2For matched data, the �rst column should be the gene symbol. See Figure 2.1aFor unmatched data, the �rst column should be the probe name, the second column

    should gene symbol. See Figure 2.1bThe rest should be the expression matrix of the study.If there are missing value in the data, please mark it as NA as convention.

    2.3 Load Data and Preprocessing

    2.3.1 Load Data

    • File type: two types of �les will be allow to be read into the software. Txt �le orcsv �le.

    • Data type: Four data types are allowed in the software. These types are two class,multiple class, Continue and Survival.

    • Logged: if the microarray data has been log transformed, please select this check-box. If not please leave it unselect.

    • Matched: if the probes ID have not been made to unique gene symbol, pleaseunselect the matched checkbox so that the software will choose the unique genesymbol with multiple probe ID using greatest IQR value. If you have alreadymatch the probes ID into gene symbol and each study has genes with unique genenames, please select this checkbox.

    • Add studies: You could select the data �les and click on open. One dataset shouldbe in one �le. Also these �les must be in the same directory.

    5

  • 2 Getting Started

    • Add info (optional): additional information associated with the dataset could beadded into the software. These information include the platform time year whereeach dataset is generated. The format should look like this. It consists of severallines corresponding to several studies. Each line, tab-delimited token is required.The �rst word is the study name(Author name), the second word is supposed tobe the platform, the third word is the year.

    • Con�rm: after you add studies and info, click on the con�rm button then the datawill be read into the software. If the dataset is big, it may take a while to get thedata into the software.

    2.3.2 Preprocessing

    • Merge: after you click the merge button, shared genes will exist across the studiesand other genes will be removed.

    • Filter: user could �lter out the unexpressed genes by mean value and uninformativegenes by its standard deviation value. Both mean and standard deviation �lterthreshold need to be speci�ed.

    • Knn imputation: in case there are some missing value in the datasets, please doKnn imputation before using any further functions. After this step, data has beensuccessfully loaded into the software and pre-processing part is also �nished. Youcould use the meta sections now.

    2.4 Meta Analysis

    By now this software contains only 3 Meta Analysis tools. In the future, it might beextended to have more tools. This tutorial will present how to use all of the three package.But user can only use one or more packages as they want.

    2.4.1 MetaQC

    MetaQC[3] will use 6 quantitative quality control measurement to determine the rankof each studies. If certain study has high rank score, it means probably the study isirrelavent to other studies.

    • pathway database for EQCp: Here user need to specify the pathway database usedfor EQCp. There several available database such as GO, Biocarta, KEGG andReactome. All of these database could be downloaded at MsigDB. Also users areable to use their own pathway database information by clicking load, but make surethe pathway database should be in the format of gmt �le. User can use either thedefault available pathway database or load pathway database from local PC, butit is not allowed to get pathway information from both approaches simultaneously.

    6

    http://www.broadinstitute.org/gsea/msigdb/index.jsp

  • 2 Getting Started

    • pathway information for CQCp and AQCp: Here user need to specify the pathwaydatabase used for CQCp and AQCp. And default availabel database and how toload user's own database are similar to the previous tip.

    • number of top pathway for EQCp: the number of top pathways used for EQCcalculation. For good performance, this shold be set as a reasonable small number.

    • B for EQC: here B means the permutation times for EQC calculation.

    • pval cut: pvalue cuto� for AQC calculation.

    • Pval adjust: whether to use B-H adjustment[7].

    After these, you could safely click metaQC button. It may take a while if the permutationtimes B is big or the datasets are big. After it �nishes processing, a QC result table willpop up, it will provide six kinds of quality score and a PCA plot result will come out.User can save the PCA plot result.After you decide with studies are of poor quality studies, you can delete these studies

    and load the data again by simply load again. Then the gene information window willupdata.

    2.4.2 MetaDE

    In this MetaDE panel, user could perform meta DE analysis to multiple studies. Severalindividual study test methods and meta test methods are availible according to inputdata type. The software could help users identifying the di�erentally expressed genes andprovide some detailed information about the genes. P-value of each individual studiesand the meta-analysis result will also be given. User could save the result and generatethe heatmap of the di�erentally expressed genes by controlling false discovery rate.

    • individual test: There are three options: regular t-test, modi�ed t-test, pairedt-test. For paired t-test, samples must be correctly labeled.

    • individual tail: user out to specify what kind of tail it to be used. Default is abs.

    • meta-test: there will be a bunch of options. There are maxP, minP, roP, Fisher[6],AW[4], Stou�er[8], SR, PR, minMCC, FEM, REM, randProd[9] with their OCcorrection. If OC correction is selected, user doesn't need to tell the softwarewhat kind of tail to use because it will compare the result of both tails. Somemethods have asymptotic result while some have not. For those methods without anasymptotic approach, user has to use permutation method and specify the numberof permutation to use. Also those methods with an asymptotic approach, usercould also use permutation method. If roP method is selected, the number rthmust be speci�ed, which is a number between 1 and the number of studies.

    • meta analysis: to start the metaDE analysis. A table will be popped up with alist of all the shared genes and the p-value in each individual study and the meta

    7

  • 2 Getting Started

    pvalue and meta qvalue. User could sort the genes by certain column of p-value orq-value. Also users could click on their interested gene and get detailed informationabout the gene(the information is from Bioconductor org.Hs.eg.db).

    • Save as �le: user can save the result as csv �le or txt �le.

    • Heatmap: user could generate the heatmap of the DE genes given the false discoverrate. The heatmap could also be saved.

    2.4.3 MetaPath

    Pathway enrichment is an important method to validate whether the discovered DEgenes are reasonable. MetaPath[5] section provide 3 Meta-Analysis Pathway Enrichmenttools(MAPE) based on gene level(MAPE_G), pathway level(MAPE_P) and a hybridof both level(MAPE_I). If user has multiple studies, this section will be easy to use todetect the pathway and visualize the result.

    • pathway database: First user has to specify which pathway database to use. Thereare 4 default pathway database, same as used in metaQC. Users can also use theirown pathway database by loading gmt �les themselves. As before, user can choosepathway database via only one of these method.

    • Permutation type: user need to point out permutation type (by gene or by sample).

    • Meta test: user need to specify meta test method to be used, here there are only 4types of methods (maxP, minP, roP, Fisher).

    • Pathway gene size range: user need to describe the range of genes one pathwayhas. Then the software will �lter out the pathway databases with genes less thanthe min size or greater than the max size.

    • Number of permutation: number of permutation used during the pathway detec-tion.

    • Meta pathway: Click this button to start MAPE_G, MAPE_P, MAPE_I. It maytake a while if the permutation times is big or the input dataset is big.

    • Qvalue.cut: user could get desirable pathway using three di�erent method by con-trolling false discovery rate.

    • Plot: they could visualize these pathway under the given false discovery rate. Usercould also save the plot.

    8

    http://www.bioconductor.org/packages/2.10/data/annotation/html/org.Hs.eg.db.html

  • 3 Example

    The software prepares some example data to show users how to use it. It is quite easyto go through these examples. Double click metaOmics.exe, you will open the software.It may take a couple of seconds to load the required packages from R. See Figure 3.1There are 9 Prostate cancer data in txt �les in the default load folder. Their Probe_IDs

    has been matched to gene symbols and the expression data has not been logged2 trans-formed. The data has two class: benign tumors are labelled 0 and localized tumors arelabelled 1.

    3.1 Load data

    First user need to specify the �le type as txt �le, data type as two class in the relatedcombo box, the data is matched by select the matched checkbox and the data is notlogged2 transformed by unselect the logged checkbox.Click on the �add studies� button(See Figure 3.2), an open-�le dialog will pop up. In

    the default folder, there are 9 studies. Select them all and then click on �open�. SeeFigure 3.3a.If you are not satis�ed with what you have loaded, you could delete the studies in the

    studies information table by clicking �delete studies� button. But you need to make surethat all the studies in the information table are in the same directory.User could add external information about the studies by clicking the �add info� button.

    This is optional and the requirement are explained in Section 2.3.1.After these steps, click on the �con�rm� button and then the software will read in the

    data. It may take a while if you dataset is too big. After con�rmation, the console windowwill tell you �load complete�. Also at this moment, the studies information window willbe updated, which will show the current studies names, gene size of each study, sample

    Figure 3.1: Open the metaOmics software

    9

  • 3 Example

    Figure 3.2: add studies

    (a) add studies dialog (b) add information dialog

    Figure 3.3: add studies and information dialog

    10

  • 3 Example

    Figure 3.4: after con�rmation

    size and the external inforamtion. See Figure 3.4. After this step, you've been �nishedloading data and will go to pre-processing section.

    3.2 Preprocessing

    In this section, you're going to do some pre-processing before the meta-analysis.The �rst thing you need to do is to merge the data by simply click on the �merge�

    button. Then all the studies will have the same dimension of genes, which means thesegenes are the common genes among the studies. Also the studies information windowwill update its information.The second thing you need to do is to �lter out the unexpressed genes by its mean

    value and uninformative genes by the standard deviation value. Here in the example,50% of genes are �ltered out by mean value and afterwards 50% of genes are �ltered outby standard deviation value. User could specify these numbers in the ��lter by mean(%)�text�eld and ��lter by sd(%)� text�eld. Then click on ��lter� button, you will �nish the�ltering part. See Figure 3.5.At last you need to do KNN imputation in case there are some missing value. Missing

    value are not allowed in metaQC and metaPath. However, in metaDE, missing valuedoes be allowed in some statistical test. But for convenience, user need to do KNNimputation before any further action. After click this button, pre-processing part is doneand user could go to the 3 meta-analysis sections.

    3.3 MetaQC

    This section allow users to perform metaQC, including 6 quality control measurement.Additionally, EQCp, CQCp, AQCp need the user to specify external pathway informa-tion. So in the example, combination of default pathway are selected. See Figure 3.6.The number of pathway for EQC is set as 5. In order to perform metaQC quickly in

    11

  • 3 Example

    Figure 3.5: �lter

    Figure 3.6: metaQC panel

    12

  • 3 Example

    (a) metaQC 6 quality scores and rank (b) PCA plot of the metaQC result

    Figure 3.7: metaQC result

    the example, the permutation number B is selected as 100. Pvalcut is set as 5%. Moredetailed information about these parameters please refer to Section 2.4.1. After specify-ing these parameters, you could perform metaQC by clicking the �metaQC� button. Theprogress of metaQC will be shown in the console information window. When the processis �nished, two dialogs will pop up. One dialog is the rank score of metaQC result, basedon which user could make a decision on which studies are of bad quality (See Figure3.7a). Another dialog will be a PCA plot of six quality score, user could save the plot as�le onto local PC (See Figure 3.7b).Study Nanni and Dhanasekaran has high metaQC rank score and relatively small

    sample size, which serves as the evident that these two studies have low correlationwith other studies. It is reasonable to remove these two studies in order to acheive amore consistent result. Therefore just select these two studies in the studies informationwindow and click on �delete� button (See Figure 3.8a), then con�rm again and thenmerge the data. This time you will probably �nd that there would be more shared genesacross the studies, which also indicates the remain studies have better correlation. In theexample again we �lter 50% unexpressed genes by mean value and 50% uninformativegenes by standard deviation value. And then do the KNN imputation again. Then thedata after metaQC selection is ready for the further meta-analysis. (See �gure 3.8b)

    13

  • 3 Example

    (a) remove studies with poor metaQC performance (b) reload data

    Figure 3.8: metaQC selections and reload data

    Figure 3.9: metaDE analysis parameter selection

    3.4 MetaDE

    After users load the data or reload the data after metaQC and their correspondingpre-processing, it is the time to perform the metaDE analysis section. In the example,individual test is selected as �regt�, individual tail is selected as �abs�, meta test is selectedas roP, asmptotic checkbox is selected and rth is choosen as 5 (See Figure 3.9).Perform metaDE analysis by clicking �meta analysis� button. After a while, a dialog

    with all the shared genes and their p-value of each individual study and meta p-value,meta q-value will pop up. User could sort these genes by their alphabetic order or bysigni�cance of any p-values (See Figure 3.10a). If the user wants to see more informationabout a particular gene, simply click on the gene name and a dialog with detailed geneinformation will pop up (See Figure 3.10b).

    14

  • 3 Example

    (a) sort metaDE analysis result (b) get detailed information about a particular gene

    Figure 3.10: metaDE analysis result

    Clicking �save as �le� button, user is able to save the metaDE analysis result as txt orcsv �le. Also in the example, controlling false discover rate at 5%, clicking on �heatmap�button will generate the expression heatmap of all the di�erentally expressed genes (SeeFigure 3.11).

    3.5 MetaPath

    In the example, the parameters are set as Figure 3.12. Default pathway database KEGGis choosen. Permutation type is selected as gene and the meta statistics is selected asmaxP. The lower limit of gene amount in pathway is 5 and the higher limit is 500.In order to get the result quickly, permutation number is set as 500. Then click on�meta pathway� button to perform MAPE_G, MAPE_P, MAPE_I separately. After it�nishes, user could click on �qvalue cuto�� button to get the signi�cant pathway undercertain qvalue threshold. Commonly used threshold is 0.05. However in our example,no pathway will be detected under such a threshold. If qvalue cuto� is 1, then all thepathway will be listed with their p-value in each individual study and 3 MAPE result(See Figure 3.13a). User could also click on �plot� to get the signi�cant pathway belowthe given qvalue cuto� (See Figure 3.13b).

    15

  • 3 Example

    Figure 3.11: di�erentally expressed genes' heatmap

    Figure 3.12: metaPath panel

    16

  • 3 Example

    (a) metaPath result (b) metaPath plot

    Figure 3.13: meta pathway result

    17

  • Bibliography

    [1] George C. Tseng, Debashis Ghosh and Eleanor Feingold. Comprehensive literaturereview and statistical consideration for microarray meta-analysis. Nucleic Acids Re-search, 2012, Vol. 40, No. 9 3785�3799.

    [2] Xingbin Wang, Dongwan D. Kang, Kui Shen, Chi Song, Shuya Lu, Lun-Ching Chang,Serena G. Liao, Zhiguang Huo, Shaowu Tang, Naftali Kaminski, Etienne Sibille, YanLin, Jia Li, and George C. Tseng. An R package Suite for Microarray Meta-analysisin Quality Control, Di�erentially Expressed Gene Analysis and Pathway EnrichmentDetection.

    [3] Dongwan D. Kang, Etienne Sibille, Naftali Kaminski, and George C. Tseng. (2012)MetaQC: Objective Quality Control and Inclusion/Exclusion Criteria for GenomicMeta-Analysis. Nucleic Acids Research. 40(2):e15.

    [4] Li J and Tseng,G.C. An adaptively weighted statistic for detecting di�erential geneexpression when combining multiple transcriptomic studies. Annals of Applied Statis-tics. accepted.

    [5] Kui Shen and George C Tseng. (2010) Meta-analysis for pathway enrichment analysiswhen com- bining multiple microarray studies. Bioinformatics. 26:1316-1323.

    [6] Fisher R. Combining independent tests of signi�cance. American Statistician, 2(5):301948.

    [7] Benjamini, Y. and Hochberg, Y. Controlling the False Discovery Rate - a Practicaland Powerful Approach to Multiple Testing. Journal of the Royal Statistical SocietySeries B-Methodological, 57, 289-300,1995.

    [8] Stou�er, S., Suchman,E., DeVinnery,L., Star,S., and Wiliams,J. The American Sol-dier,volumn I: Adjustment during Army Life. Princeton University Press, 1949.

    [9] Hong, F., et al. (2006) RankProd: a bioconductor package for detecting di�erentiallyexpressed genes in meta-analysis, Bioinformatics, 22, 2825-2827.

    18

    IntroductionBackgroundHow to get it.Reporting bugs and update

    Getting StartedGraphical Interface IntroductionPrepare the dataLoad Data and PreprocessingLoad DataPreprocessing

    Meta AnalysisMetaQCMetaDEMetaPath

    ExampleLoad dataPreprocessingMetaQCMetaDEMetaPath