1 correlating ppi node degree with snp counts michael grobe (this work supported in part by:...

30
1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

Post on 22-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

1

Correlating PPI node degree with

SNP counts

Michael Grobe

(This work supported in part by:

Research Technologies

Indiana University)

Page 2: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

2

Do PPI nodes of high degree “have” more or fewer SNPs?

Are hubs more or less susceptible to SNPs over evolutionary time?

If so, why?

If not, why?

Page 3: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

3

Hypothesis: The degrees of genes in a PPI network will correlate inversely with their SNP count.

This hypothesis will be tested using (parts of) the following data resources:

- dbSNP from NCBI,

- the Disease Gene Network data collected

by Rual, et al., Stetzl, et al., and Goh, et al.

- several other NCBI resources

Page 4: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

4

dbSNP

dbSNP is a large relational database maintained by the National Center for Biotechnology Information (NCBI) on a Microsoft SQLServer. (dbSNP seems to be misnamed.)

NCBI provides several public interfaces to dbSNP: - a web-based interface for public use

http://www.ncbi.nlm.nih.gov/SNP/ - a set of web-accessible scripts CGI scripts and (SOAP-based)

Web Services, known as the Entrez eUtils, and, - an FTP repository of the data exported from the MS SQLServer.

NCBI does NOT provide an interface for submitting SQL commands directly to the SQLServer.

However, IUSM downloads the dbSNP data from the NCBI FTP repository, loads it into a local MS SQLServer, where it is available for use via JDBC, and UITS makes it available via Web pages and (SOAP-based) Web Services.

Page 5: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

5

UITS maintains (on a DB2 datbase management system) a collection of data resources called the Centralized Life Sciences Data (CLSD) service that incorporates dbSNP via “data federation”.

dbSNP can be access via CLSD at

http://discern.uits.iu.edu:8421/access/index.html

and also via a SOAP-based interface to CLSD at

http://discern.uits.iu.edu:8421/axis/CLSDservice.jws?wsdl

dbSNP can also be accessed via JDBC, or through a direct JAX-RPC interface, if necessary.

CLSD is described in detail at http://rac.uits.iu.edu/clsd/

Page 6: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

6

List of CLSD data resources

BIND -- Pathways, Gene interactionsENZYME -- Enzyme nomenclatureePCR -- ePCR results of UniSTS vs Homo sapiensSGD -- Saccharomyces Genome DatabaseDGN – The Disease Gene Network data from Goh, et al. (Provisional)KEGG data sources: + LIGAND -- Pathways, Reactions, & Compounds + PATHWAY -- Pathway map coordinatesNCBI data sources: + LocusLink -- Genetic Loci. (retained for archival use.) + UniGene -- Gene clusters

Federated data sources, where the data is stored: * at the originating site: + NCBI Nucleotide -- Nucleotide sequences + NCBI PubMed -- Journal abstracts

* on local (mirror) servers external to CLSD but housed at IU * BLAST -- Basic Local Alignment Search Tool (mirrored at IU by UITS) * Nucleotide data: NT * Protein data: NR and Swiss-Prot * dbSNP -- Single Nucleotide Polymorphisms (mirrored at IU by IUSM)

Page 7: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

7

dbSNP is a relatively complex database. It includes about 300 tables for each species, and the separate species tables share about 80 additional tables.

dbSNP is also rather large: dbSNP catalogs Shared, Human, and Mouse (circa early 2008) fill around 150 GB and 3 billion rows (of which about 2.8 billion are in dbSNP128_human).

New versions come out every 6 months or so. This study uses Build 128, although Build 129 has been quite recently announced.

The tutorial

“Using dbSNP via SQL queries”

describes the structure and use of dbSNP via SQL.

Page 8: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

8

The DGN PPI networkThe Disease Gene Network data within CLSD includes 3 networks:

- a network of diseases that are “connected” when they involve the same gene,

- a network of 1777 genes that are “connected” when they are implicated in the same disease, and

- a Protein Protein Interaction (PPI) network built from networks defined by two different groups Rual, et al. and Stelzl, et al.

The PPI is defined in the table called PPI_RUAL_STELZL; it has 7533 unique genes and 22,052 edges (in a half-matrix form).

A companion table PPI_GENES lists every gene in the PPI network

The PPI network was traversed to construct a list of shortest paths from each node to each other node: PPI_SHORTEST_PATH_LENGTHS. This is a kind of transitive closure and contains about 53 M records.

Page 9: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

9

SNPContigLocusID

The main dbSNP table used in this project is SNPContigLocusID which contains information about the genes associated with each SNP.

The Build 128 version of SNPContigLocusID contains about 13,129,868 rows (though about half of them specify “NW_” mRNA segments and were ignored).

Here is a query that retrieves the records for 2 SNPs (among many others) that appear within, or close to, the coding region for JAK3.

select * from b126_SNPContigLocusId_36_1 where snp_id in ( 3212724, 3212755 )

Page 10: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

10

Query results:

Note that both of these SNPs have several records; SNP ID is NOT a key. SNPs may even map to different chromosomes!

Page 11: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

11

Here is a table of the Function Class (FXN_CLASS) codes

.

Page 12: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

12

Number of (NT_) SNPs in each SNP function class

select

fxn_class, count(*)

from

dbSNP128_human.b128_SNPContigLocusId_36_2

where contig_acc like 'NT_%‘ [so not all 13 Mrows will appear]

GROUP BY fxn_class

ORDER BY fxn_class

FXN_CLASS Count FXN_CLASS Count

3 78797 42 98053

6 6008473 44 15848

8 192868 53 144123

13 168608 55 27990

15 166205 73 645

41 2753 75 483

Page 13: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

13

Get gene IDs, symbols, and SNP counts

The following query uses both DGN and dbSNP data to get a list of gene IDs, their symbols, and the number SNPs associated with each gene:

select a.locus_id, b.locus_symbol, snp_counterfrom (select locus_id, count(*) as snp_counter from dbsnp128_human.b128_SNPContigLocusId_36_2 where contig_acc like 'NT_%' and locus_id in (select gene_id from disease_gene_net.ppi_genes ) group by locus_id) as a

join

(select distinct locus_id, locus_symbol from dbsnp128_human.b128_SNPContigLocusId_36_2) as b

on b.locus_id = a.locus_id order by snp_counter desc

Page 14: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

14

Gene IDs, symbols, and SNP counts

Here is a list of PPI genes with the top 100 SNP counts:

1756 DMD 60069 2104 ESRRG 7647 8224 SYN3 5495 1002 CDH4 4331 5799 PTPRN2 30328 1956 EGFR 7522 9577 BRE 5455 5884 RAD17 4286 26047 CNTNAP2 21661 9369 NRXN3 7088 8997 KALRN 5387 23254 KIAA1026 4275 5071 PARK2 19464 9215 LARGE 7046 4212 MEIS2 5271 817 CAMK2D 4207 5789 PTPRD 15867 56899 ANKS1B 7044 600 DAB1 5252 5602 MAPK10 4189 1305 COL13A1 15719 672 BRCA1 7024 351 APP 5196 8464 SUPT3H 4135 9734 HDAC9 14461 8618 CADPS 6798 2887 GRB10 5190 84570 COL25A1 4101 8379 MAD1L1 12441 6938 TCF12 6772 93986 FOXP2 5184 10142 AKAP9 4072 5152 PDE9A 12052 10207 INADL 6721 800 CALD1 5073 10466 COG5 4018 9586 CREB5 11921 1837 DTNA 6678 3119 HLA-DQB1 5039 64754 SMYD3 3988 2917 GRM7 11738 3084 NRG1 6392 659 PDE4DIP 4999 7492 ARID1B 3981 5649 RELN 11200 1896 EDA 6350 10580 SORBS1 4872 27133 KCNH5 3910 1523 CUTL1 9956 23345 SYNE1 6343 273 AMPH 4844 1390 CREM 3891 221935 SDK1 9194 4638 MYLK 6301 2066 ERBB4 4736 6262 RYR2 3879 9223 MAGI1 9091 9378 NRXN1 6157 6095 RORA 4644 8038 ADAM12 3874 23085 ERC1 9046 5558 PRIM2 5995 79109 MAPKAP1 4643 1501 CTNND2 3836 23236 PLCB1 8905 4897 NRCAM 5928 57509 MTUS1 4603 89797 NAV2 3798 1129 CHRM2 8895 2898 GRIK2 5877 4915 NTRK2 4562 10659 CUGBP2 3786 2272 FHIT 8366 9844 ELMO1 5835 1730 DIAPH2 4481 31 ACACA 3755 2918 GRM8 8128 3784 KCNQ1 5777 7518 XRCC4 4434 11214 AKAP13 3751 2139 EYA2 8091 6660 SOX5 5736 27185 DISC1 4413 1301 COL11A1 3736 29119 CTNNA3 8089 1740 DLG2 5616 1010 CDH12 4370 7273 TTN 3733 6487 ST3GAL3 8077 1630 DCC 5606 55714 ODZ3 4370 1838 DTNB 3698 53616 ADAM22 8006 5592 PRKG1 5574 6091 ROBO1 4346 7399 USH2A 3694 5890 RAD51L1 7786 3123 HLA-DRB1 5533 2895 GRID2 4332

Page 15: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

15

PPI node degree

Here is a query that uses PPI_SHORTEST_PATH_LENGTHS to get degree for each node:

select source, count(*) as degreefrom disease_gene_net.PPI_SHORTEST_PATH_LENGTHSwhere length = 1and source in ( select gene_id from DISEASE_GENE_NET.PPI_GENES )

group by sourceorder by degree

Page 16: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

16

PPI node degreeHere is a query using that closure to get gene counts for each degree:

select degree, count(*)from (select source, count(*) as degree from disease_gene_net.PPI_SHORTEST_PATH_LENGTHS where length = 1 and source in ( select gene_id from DISEASE_GENE_NET.PPI_GENES )

group by source ) as a

group by degreeorder by degree

Page 17: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

17

Degree and gene count for all genes in the PPI net: 1 2267 23 24 46 3 76 4 2 1217 24 16 47 2 77 1 3 849 25 15 48 3 78 4 4 589 26 16 49 4 79 3 5 465 27 19 50 8 80 1 6 343 28 13 51 6 82 1 7 248 29 12 53 1 83 1 8 198 30 13 54 2 84 1 9 176 31 12 55 3 87 1 10 145 32 13 56 2 89 2 11 119 33 8 57 2 94 1 12 99 34 7 58 4 95 1 13 106 35 7 59 4 97 1 14 88 36 5 60 4 99 1 15 59 37 7 62 4 103 2 16 52 38 9 63 1 105 1 17 58 39 2 64 1 118 1 18 34 40 3 65 1 123 1 19 38 41 7 67 1 124 1 20 23 42 3 69 1 129 1 21 26 43 4 73 1 151 1 22 21 44 1 75 2 153 1 23 24 45 3 76 4 176 1

Page 18: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

18

To get PPI gene IDs, symbols, and degrees:

select b.locus_id, b.locus_symbol, degreefrom (select source, count(*) as degree from disease_gene_net.PPI_SHORTEST_PATH_LENGTHS where length = 1 and source in ( select gene_id from DISEASE_GENE_NET.PPI_GENES ) group by source) as a

join

(select distinct locus_id, locus_symbol from dbsnp128_human.b128_SNPContigLocusId_36_2) as bon b.locus_id = a.source

Page 19: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

19

PPI gene IDs, symbols, and their degrees (top 100 degree values):

7157 TP53 176 4089 SMAD4 78 2547 XRCC6 59 5879 RAC1 50 5829 PXN 153 7431 VIM 78 6667 SP1 59 6303 SAT1 50 2885 GRB2 151 7704 ZBTB16 78 57473 ZNF512B 59 6774 STAT3 50 11007 CCDC85B 129 9094 UNC119 77 4067 LYN 58 8648 NCOA1 50 7186 TRAF2 124 672 BRCA1 76 5111 PCNA 58 26994 RNF11 50 7414 VCL 123 1499 CTNNB1 76 7534 YWHAZ 58 55660 PRPF40A 50 4087 SMAD2 118 1915 EEF1A1 76 55729 ATF7IP 58 25 ABL1 49 2130 EWSR1 105 3065 HDAC1 76 5777 PTPN6 57 3064 HD 49 4088 SMAD3 103 6498 SKIL 75 5970 RELA 57 4035 LRP1 49 4093 SMAD9 103 9869 SETDB1 75 6256 RXRA 56 10241 CALCOCO2 49 1956 EGFR 99 83755 KRTAP4-1 73 6908 TBP 56 4790 NFKB1 48 4188 MDFI 97 7329 UBE2I 69 596 BCL2 55 5894 RAF1 48 6714 SRC 95 5359 PLSCR1 67 1400 CRMP1 55 10399 GNB2L1 48 55791 C1orf103 94 5781 PTPN11 65 7088 TLE1 55 5371 PML 47 2534 FYN 89 2908 NR3C1 64 3320 HSP90AA1 54 7917 BAT3 47 3725 JUN 89 1742 DLG4 63 10524 HTATIP 54 801 CALM1 46 5295 PIK3R1 87 1107 CHD3 62 3717 JAK2 53 5578 PRKCA 46 2099 ESR1 84 1937 EEF1G 62 351 APP 51 8655 DYNLL1 46 5925 RB1 83 4086 SMAD1 62 3932 LCK 51 1051 CEBPB 45 7094 TLN1 82 57562 KIAA1377 62 5594 MAPK1 51 2185 PTK2B 45 367 AR 80 998 CDC42 60 5747 PTK2 51 4609 MYC 45 1387 CREBBP 79 4110 MAGEA11 60 9513 FXR2 51 11030 RBPMS 44 5764 PTN 79 5335 PLCG1 60 11161 C14orf1 51 857 CAV1 43 6464 SHC1 79 10980 COPS6 60 867 CBL 50 5300 PIN1 43 2033 EP300 78 2335 FN1 59 3866 KRT15 50

Page 20: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

20

Get SNP counts and degree values for each gene in the PPI:

select locus_id, degree, snp_counter from (select locus_id, count(*) as snp_counter from dbsnp128_human.b128_SNPContigLocusId_36_2 where contig_acc like 'NT_%' and ( fxn_class = 41 or fxn_class = 42 or fxn_class = 44 ) group by locus_id) as ajoin (select source, count(*) as degree from disease_gene_net.PPI_SHORTEST_PATH_LENGTHS where length = 1 and source in ( select gene_id from DISEASE_GENE_NET.PPI_GENES ) group by source ) as bon source = locus_idorder by degree

Page 21: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

21

Initial results:

The previous query was used to derive correlations between degree values and SNP counts per gene for every gene in the PPI network: Degree SNP Class Genes Mean Mean CorrelationAll 7403 5.9 428 0.04641,42,44 6569 6.0 8.5 0.062Not 6 7397 5.9 55 0.094

13, 15 7383 5.9 18 0.0546 7174 5.9 348 0.041

(Note that a few observations were omitted due to using the mer counting script for

non-mer work.)

Page 22: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

22

More initial results:

The same approach was used to derive correlations for the 1195 or so disease genes that also appear in the PPI net: Degree SNP Class Genes Mean Mean CorrelationAll 1193 7.5 592 0.08641,42,44 1121 7.5 14.9 0.089Not 6 1193 7.5 82.7 0.117

13, 15 1193 7.5 22.7 0.0496 1161 7.5 523 0.041

(Note that a few observations were omitted due to using the mer counting script for non-mer work.)

Page 23: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

23

Perhaps a correlation can be found as a function of mer counts?

That is, perhaps:

“DNA bases in the gene per SNP” or “RNA bases in the gene transcript per SNP” or “amino acids in the protein product per SNP”

will correlate with degree, especially for certain SNP classes?

Testing these claims requires gene, mRNA transcript, and/or protein product lengths (and maybe intron lengths).

Note that the SNPContigLocusId table includes pointers to mRNA and protein records, and includes the NCBI UIDs for each record.

Scripts (get-mRNA-lengths.pl and get-protein-lengths.pl) were written to access the mRNA and protein contig data from NCBI and to count base pairs or amino acids, respectively.

Page 24: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

24

Scripts to download mer (base and aa) data

The libwww-perl (LWP) module was used to interact with the NCBI eUtils that were mentioned earlier and are documented in “Using the NCBI eUtilities via CGI” at

http://mypage.iu.edu/~dgrobe/entrez-dogma.html

DNA lengths were obtained using a service at

http://discern.uits.iu.edu:8421/view-sequences.html

called “Get NCBI sequences for genes or specified regions” that will fetch gene FASTA records given gene names and/or NCBI UIDs.

NCBI asks users to limit access to one every 3 seconds during off-peak hours and one every 15 seconds otherwise. As a result, these runs took over 24 hours.

Page 25: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

25

The resulting “mer file” sizes are like:

- DNA length records: 22259

- mRNA length records: 32400

- Protein length records: 23803

There are frequently multiple mRNA and protein records for a gene; mean lengths were computed for each gene by downstream scripts.

A script (get-gene-mRNA-SNPs-mers-per-SNP.pl) was written to compute mean lengths and perform correlations on the mer data.

Page 26: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

26

Here are correlations between node degree and mRNA bases per SNP:

This table is for ALL PPI genes showing SNPs in the specified function class:

Bases Mean per Class Genes Degree SNP Correlation Not 6 7406 5.9 96 -0.032 All 7412 5.9 428 -0.046 41, 42, 44 6576 6.0 922 0.001

Note that the correlation between base count and SNP count was: -0.21.

Page 27: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

27

Here are correlations between node degree and DNA bases per SNP:

This table is for ALL PPI genes showing SNPs in the specified function class :

Bases Mean per Class Genes Degree SNP Correlation 6 7174 5.9 348 -0.039 All 7403 5.9 198 -0.033

Note that the correlations between base count and SNP count were -0.097 and -0.12.

Page 28: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

28

Conclusion

This study found no relationship between SNP count and PPI node degree, or between measures of mer counts per SNP and node degree.

Page 29: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

29

Discussion

Are cell networks so robust that variation which “should” normally disrupt functioning gets over-ridden?

If so, how?

Are there parallel/redundant pathways for important processes?

Are non-parallel pathways constructed to minimize the effects of variation?

Do chaperone proteins (like HSP90) help make variant proteins safe for use within the cell (a la’ Whitesell and Lundquist)? (Note: around 20% of HSP-connected genes appear in the list of 100 genes (< 2%) with the highest degree.)

Would hub genes within Reaction networks (as opposed to PPI networks) show SNP counts that correlate with their degree?

Would PPIs composed only of co-located proteins display node degree-SNP count correlations?

Do lethal genes show fewer SNPs?

Page 30: 1 Correlating PPI node degree with SNP counts Michael Grobe (This work supported in part by: Research Technologies Indiana University)

30

References

The dbSNP Build Processhttp://www.ncbi.nlm.nih.gov/books/bookres.fcgi/helpsnpfaq/Build.pdf

Using dbSNP via SQL queries http://mypage.iu.edu/~dgrobe/dbSNP/using-dbSNP-via-SQL.html

Using the relational and eUtils interfaces to dbSNP http://mypage.iu.edu/~dgrobe/dbSNP/using-dbSNP-at-IU.html

Using the NCBI eUtilities via CGI http://mypage.iu.edu/~dgrobe/entrez-dogma.html

Kwang-Il Goh, Michael E. Cusick David Valle Hum, Barton Childs Hum, Marc Vidal, and Albert-Laszlo Barabasi, The human disease network, PNAS, May 22, 2007, vol. 104, no. 21, 8685.http://www.pnas.org/content/104/21/8685.abstract

Get contents of tables related to the Goh (2007) paper http://discern.uits.iu.edu:8421/show-a-DISEASE_GENE_NET-Table.html

Whitesell, Luke, and Susan L. Lundquist, HSP and the chaparoning of cancer, Nat Rev Cancer, 2005;510:761-772.