anti-learning adam kowalczyk statistical machine learning nicta, canberra...
TRANSCRIPT
Anti-Learning
Adam KowalczykStatistical Machine Learning
NICTA, Canberra ([email protected])
1
National ICT Australia Limited is funded and supported by:
Overview
• Anti-learning– Elevated XOR
• Natural data– Predicting Chemo-Radio-Therapy (CRT) response for Oesophageal
Cancer– Classifying Aryl Hydrocarbon Receptor genes
• Synthetic data– High dimensional mimicry
• Conclusions
• Appendix: A Theory of Anti-learning– Perfect anti-learning– Class-symmetric kernels
Definition of anti-learning
Training accuracy
Training accuracy
Random guessing accuracy
Random guessing accuracy
Off-training
accuracy
Off-training
accuracy
Systematically:
Anti-learning in Low Dimensions
+1
-1-1
+1 +1-1
y
x
z
+1
-1
Anti-Learning Learning
Evaluation Measure• Area under Receiver Operating Characteristic (AROC)
f
fθ
0 0.5 10
0.5
1
False Positive
Tru
e P
osi
tive
AROC( f )
Learning and anti-learning mode of supervised classification
TP FN
AROC
0 1
1
0
FN
AROC
0 1
1
0
FN0 1
1
0
TP
TP
+
+
Learning
Anti-learning
AROC
TestTraining
Random: AROC = 0.5
?
Anti-learning in Cancer Genomics
From Oesophageal Cancer to machine learning challenge
Learning and anti-learning mode of supervised classification
OesSCC with SVM
0.0
20.0
40.0
60.0
80.0
100.0
1 10 100 1000 10000 100000
number of gene
AR
OC
LOO80:2066:3350:50
error1error2error3error4
TP
FN
AROC
0 1
1
0
FN
AROC
0 1
1
0
FN0 1
1
0
TP
TP
+
+
Learning
Anti-learningAROC
Test
Training
Random: AROC = 0.5
OesAdeno with SVM
0.0
20.0
40.0
60.0
80.0
100.0
1 10 100 1000 10000 100000
number of gene
AR
OC
LOO80:2066:3350:50
error1error2error3error4
Anti-learning in Classification of Genes in Yeast
Training Gene Activity Class # Gene Class 1 YDR439W change 2 YHR051W change … … … 38 YKL181W change 39 YLR368W control … … … 84 YFL061W control 85 YDR388W nc … … … 3017 YFL039C nc 3018 YAL015C nc
Training Gene Activity Class # Gene Class 1 YDR439W change 2 YHR051W change … … … 38 YKL181W change 39 YLR368W control … … … 84 YFL061W control 85 YDR388W nc … … … 3017 YFL039C nc 3018 YAL015C nc
Abstract 10894548There are about 800 genes in Saccharomyces cerevisiae whose transcriptionis cell-cycle regulated. Some of these form clusters of co-regulatedgenes. The 'CLB2' cluster contains 33 genes whose transcription peaksearly in mitosis, including CLB1, CLB2, SWI5, ACE2, CDC5, CDC20 and othergenes important for mitosis. Here we find that the genes in this clusterlose their cell cycle regulation in a mutant that lacks two forkheadtranscription factors, Fkh1 and Fkh2. Fkh2 protein is associated with thepromoters of CLB2, SWI5 and other genes of the cluster. These resultsindicate that Fkh proteins are transcription factors for the CLB2cluster. The fkh1 fkh2 mutant also displays aberrant regulation of the'SIC1' cluster, whose member genes are expressed in the M-G1 interval andare involved in mitotic exit. This aberrant regulation may be due toaberrant expression of the transcription factors Swi5 and Ace2, which aremembers of the CLB2 cluster and controllers of the SIC1 cluster. Thus, acascade of transcription factors operates late in the cell cycle.Finally, the fkh1 fkh2 mutant displays a constitutive pseudohyphalmorphology, indicating that Fkh1 and Fkh2 may help control the switch tothis mode of growth.
Abstract 10894548There are about 800 genes in Saccharomyces cerevisiae whose transcriptionis cell-cycle regulated. Some of these form clusters of co-regulatedgenes. The 'CLB2' cluster contains 33 genes whose transcription peaksearly in mitosis, including CLB1, CLB2, SWI5, ACE2, CDC5, CDC20 and othergenes important for mitosis. Here we find that the genes in this clusterlose their cell cycle regulation in a mutant that lacks two forkheadtranscription factors, Fkh1 and Fkh2. Fkh2 protein is associated with thepromoters of CLB2, SWI5 and other genes of the cluster. These resultsindicate that Fkh proteins are transcription factors for the CLB2cluster. The fkh1 fkh2 mutant also displays aberrant regulation of the'SIC1' cluster, whose member genes are expressed in the M-G1 interval andare involved in mitotic exit. This aberrant regulation may be due toaberrant expression of the transcription factors Swi5 and Ace2, which aremembers of the CLB2 cluster and controllers of the SIC1 cluster. Thus, acascade of transcription factors operates late in the cell cycle.Finally, the fkh1 fkh2 mutant displays a constitutive pseudohyphalmorphology, indicating that Fkh1 and Fkh2 may help control the switch tothis mode of growth.
KDD’02 task: identification of Aryl Hydrocarbon Receptor genes (AHR data)
Gene Abstracts # Gene Abstract ID 1 YML034W 10734188 2 YML034W 10894548 3 YHR051W 207698 4 YHR051W 10449761 5 YHR051W 1324416 … … … 16,955 YLR337C 7968536 16,956 YBR202W 10649446 16,957 YBR202W 9852095 16,958 YBR202W 9335335 16,959 YDL248W 8832390
Gene Abstracts # Gene Abstract ID 1 YML034W 10734188 2 YML034W 10894548 3 YHR051W 207698 4 YHR051W 10449761 5 YHR051W 1324416 … … … 16,955 YLR337C 7968536 16,956 YBR202W 10649446 16,957 YBR202W 9852095 16,958 YBR202W 9335335 16,959 YDL248W 8832390
Gene Interactions # Gene Gene 1 YNL331C YNL331C 2 YCR088W YFL039C 3 YCR088W YDR388W 4 YCR088W YNL138W 5 YER045C YMR308C … … … 2,119 YER03 YPL192C 2,120 YLR277C YKR002W 2,121 YLR277C YPR107C 2,122 YPR107C YKR002W 2,123 YBR046C YDR103W
Gene Interactions # Gene Gene 1 YNL331C YNL331C 2 YCR088W YFL039C 3 YCR088W YDR388W 4 YCR088W YNL138W 5 YER045C YMR308C … … … 2,119 YER03 YPL192C 2,120 YLR277C YKR002W 2,121 YLR277C YPR107C 2,122 YPR107C YKR002W 2,123 YBR046C YDR103W
Gene function # Gene Function Hierarchy 1 YHR051W respiration|ENERGY 2 YHR051W mitochondrion|SUBCELLULAR LOCALISATION 3 YHR124W meiosis|cell cycle|CELL CYCLE AND DNA PROCESSING 4 YKL181W amino acid biosynthesis|amino acid
metabolism|METABOLISM 8 YKL181W budding, cell polarity and filament formation|fungal
cell differentiation|cell differentiation|CELL FATE … … … 22,528 YFL061W nitrogen and sulfur utilization|nitrogen and sulfur
metabolism|METABOLISM 22,529 YJL047C
-A UNCLASSIFIED PROTEINS
22,530 YDL176W UNCLASSIFIED PROTEINS 22,531 YAL015C DNA repair|DNA recombination and DNA repair|DNA
processing|CELL CYCLE AND DNA PROCESSING 22,532 YAL015C stress response|CELL RESCUE, DEFENSE AND VIRULENCE
Gene function # Gene Function Hierarchy 1 YHR051W respiration|ENERGY 2 YHR051W mitochondrion|SUBCELLULAR LOCALISATION 3 YHR124W meiosis|cell cycle|CELL CYCLE AND DNA PROCESSING 4 YKL181W amino acid biosynthesis|amino acid
metabolism|METABOLISM 8 YKL181W budding, cell polarity and filament formation|fungal
cell differentiation|cell differentiation|CELL FATE … … … 22,528 YFL061W nitrogen and sulfur utilization|nitrogen and sulfur
metabolism|METABOLISM 22,529 YJL047C
-A UNCLASSIFIED PROTEINS
22,530 YDL176W UNCLASSIFIED PROTEINS 22,531 YAL015C DNA repair|DNA recombination and DNA repair|DNA
processing|CELL CYCLE AND DNA PROCESSING 22,532 YAL015C stress response|CELL RESCUE, DEFENSE AND VIRULENCE
Protein Class # Gene Protein Hierarchy 1 YHR205W AGC group|Protein Kinases 2 YGR080W Protein Kinases 3 YLL055W Major facilitator superfamily proteins (MFS) 4 YKL173W GTP-binding proteins involved in protein
synthesis|GTP-binding proteins 5 YKL157W Proteases … … … 2,064 YFL037W other GTP-binding proteins|GTP-binding proteins 2,065 YDL192W ARF|small GTP-binding proteins (RAS superfamily)|GTP-
binding proteins 2,066 YNL102W associated subunits|DNA-directed DNA
polymerases|Polymerases 2,067 YJR001W Major facilitator superfamily proteins (MFS) 2,068 YIR022W Proteases
Protein Class # Gene Protein Hierarchy 1 YHR205W AGC group|Protein Kinases 2 YGR080W Protein Kinases 3 YLL055W Major facilitator superfamily proteins (MFS) 4 YKL173W GTP-binding proteins involved in protein
synthesis|GTP-binding proteins 5 YKL157W Proteases … … … 2,064 YFL037W other GTP-binding proteins|GTP-binding proteins 2,065 YDL192W ARF|small GTP-binding proteins (RAS superfamily)|GTP-
binding proteins 2,066 YNL102W associated subunits|DNA-directed DNA
polymerases|Polymerases 2,067 YJR001W Major facilitator superfamily proteins (MFS) 2,068 YIR022W Proteases
Gene localization # Gene Localisation Hierarchy 1 YHR051W mitochondrial inner membrane|mitochondria 2 YHL020C nucleus 3 YGR072W cytoplasm 4 YGR072W nucleus 5 YGR218W cytoplasm … … … 5,144 YLR191W peroxisomal membrane|peroxisome 5,145 YMR065W spindle pole body|cytoskeleton 5,146 YMR065W ER membrane|ER 5,147 YAL015C nucleus 5,148 YAL015C mitochondria
Gene localization # Gene Localisation Hierarchy 1 YHR051W mitochondrial inner membrane|mitochondria 2 YHL020C nucleus 3 YGR072W cytoplasm 4 YGR072W nucleus 5 YGR218W cytoplasm … … … 5,144 YLR191W peroxisomal membrane|peroxisome 5,145 YMR065W spindle pole body|cytoskeleton 5,146 YMR065W ER membrane|ER 5,147 YAL015C nucleus 5,148 YAL015C mitochondria
Test Gene List # Gene 1 YDR228C 2 YHR051W … YJL154C … YKL181W … YLR368W … … 1488 YFL039C 1489 YAL015C
Test Gene List # Gene 1 YDR228C 2 YHR051W … YJL154C … YKL181W … YLR368W … … 1488 YFL039C 1489 YAL015C
Test Gene List # Gene 1 YDR228C 2 YHR051W … YJL154C … YKL181W … YLR368W … … 1488 YFL039C 1489 YAL015C
Test Gene List # Gene 1 YDR228C 2 YHR051W … YJL154C … YKL181W … YLR368W … … 1488 YFL039C 1489 YAL015C
Gene Aliases # Gene Aliases 1 YML034W SRC1 2 YHR051W
COX6
3 YKL181W PRP1 PRS1 4 YHR124W NDT80 5 YGR072W UPF3 SUA6 … … … 4,045 YLR19C HCR1 4,046 YLR265C LIF2 NEJ1 4,047 YGL113W SLD3 4,048 YLR087C CSF1 4,049 YAL015C NTG1 FUN33
Gene Aliases # Gene Aliases 1 YML034W SRC1 2 YHR051W
COX6
3 YKL181W PRP1 PRS1 4 YHR124W NDT80 5 YGR072W UPF3 SUA6 … … … 4,045 YLR19C HCR1 4,046 YLR265C LIF2 NEJ1 4,047 YGL113W SLD3 4,048 YLR087C CSF1 4,049 YAL015C NTG1 FUN33
Anti-learning in AHR-data set from KDD Cup 2002
Average of 100 trials; random splits: training: test = 66% : 34%
KDD Cup 2002
Yeast Gene Regulation Prediction Taskhttp://www.biostat.wisc.edu/~craven/kddcup/task2.ppt
Vogel- AI Insight
- change
- ch
ange
or c
o ntr
o l
Single class SVM38/84 training examples1.3/2.8% of data used in ~14,000 dimensions
Anti-learning in High Dimensional Approximation (Mimicry)
Paradox of High Dimensional Mimicry
high dimensional features
• If detection is based of large number of features, • the imposters are samples from a distribution with the marginals perfectly matching distribution of individual features for a finite genuine sample, then• imposters are be perfectly detectable by ML-filters in the anti-learning mode
Mimicry in High Dimensional Spaces
dkfor
E
X
E
X
nE Ei
kki
kEi
ki
kX ,...,1
||
ˆ
ˆ,||
ˆ},...,1{
2)()(
)(
)(
)(
Ydd
iii niYYY ,...,1,,..., )()1( R
dkDY kkkj ,...,1)ˆ,ˆ( )()()(
},1{}1{}1{
dii
d
Y
d
XRZ
)\,()\,(
,:),,lg(
TZTZ
TZT
fACCorfAROC
RRfparamAf d
100000,10000,5000
,100
d
nn YX
Xdd
iii niRXXX ,...,1,,..., )()1(
dkDX kkkj ,...,1),( )()()(
Quality of mimicry
Average of independent test for of 50 repeats
)ˆ,ˆ(
,...,1)5.1,5.0(,)5,5(,),()()()(
)()()()()(
kkkj
kkkkkj
NY
dkUUNX
d = 1000 d = 5000
= | nE | / |nX| = | nE | / |nX|
)1,(~ izX
E T
)1,(~ izY
Formal result
:
)1,(~ izX
E T
)1,(~ izY
Proof idea 1:Geometry of the mimicry data
Key Lemma:
d
d
k
k
d
1
2)(
Proof idea 1: Geometry of the mimicry data
2S
Simplex:
3S
4S
nS
EX nn 2S
En2S
Yn2S
12 En
12 En
2
d
d
k
k
d
1
2)(
Proof idea 2:
1d
d
1d
1dR
dR
nR
1dR
ni
i
d
d
..11
)1(
Z
ni
i
d
d
..1
)(
Z
ni
i
d
d
..11
)1(
Z
Proof idea 2:
1d
d
1d
oxxx
1dR
dR
nR nR
1dR
ni
i
d
d
..11
)1(
Z
ni
i
d
d
..1
)(
Z
ni
i
d
d
..11
)1(
Z
Proof idea 2:
1d
d
1d
oxxx
1dR
dR
nR nR
1dR 11 dd
11 dd
ni
i
d
d
..11
)1(
Z
ni
i
d
d
..1
)(
Z
ni
i
d
d
..11
)1(
Z
Proof idea 3:kernel matrix
..
100...0|.|...
01.....|.....|.....
0......|.....|.....
.......|.....|..
......0|.....|.....
.....10|.....|.....
0...001|...|...
|
.....|100...0
......|01.....
.....|0......
|.......
.....|......0
......|.....10
.....|0...001
...
ijk
Yn
XnEn
YnXn
En
Proof idea 4
)1,(~ izX
E T
)1,(~ izY
1,\\
1,\
1,\
,)(
,
,
j
j
j
ii
ii
ii
ii
j
j
j
j
const
const
const
f
ETX
TY
ETX
z
TY
E \TXE XT
TY
bkf ii
ii
),()( zzzT
Theory of anti-learning
ii yx
Hadamard Matrix
1111
1111
1111
1111
),0(
3
2
1
N
i
i
i
CS-kernels
Perfect learning/anti-learning for CS-kernels
Kowalczyk & Chapelle, ALT’ 05
False positive
Tru
e
po
sitiv
e
Test ROCS-T
Train ROCT
1
1
Perfect learning/anti-learning for CS-kernels
Kowalczyk & Chapelle, ALT’ 05
Perfect learning/anti-learning for CS-kernels
Perfect learning/anti-learning for CS-kernels
Perfect anti-learning theorem
Kowalczyk & Smola, Conditions for Anti-Learning
Anti-learning in classification of Hadamard dataset
Kowalczyk & Smola, Conditions for Anti-Learning
AHR data set from KDD Cup’02
Kowalczyk, Smola, submittedKowalczyk & Smola, Conditions for Anti-Learning
From Anti-learning to learning Class Symmetric CS– kernel case
Kowalczyk & Chapelle, ALT’ 05
Perfect anti-learning : i.i.d. a learning curve
n = 100, nRand = 1000
random A
RO
C:
me
an
± s
td
1 2 4 530
nsamples i.i.d. samples from the perfect anti-learning-set S
nnsamples /
More is not necessarily better!
Conclusions
• Statistics and machine learning are indispensable components of forthcoming revolution in medical diagnostics based on genomic profiling
• High dimensionality of the data poses new challenges pushing statistical techniques into uncharted waters
• Challenges of biological data can stimulate novel directions of machine learning research
Acknowledgements
• Telstra– Bhavani Raskutti
• Peter MacCallum Cancer Centre– David Bowtell– Coung Duong– Wayne Phillips
• MPI– Cheng Soon Ong– Olivier Chapelle
• NICTA– Alex Smola