final_f01
TRANSCRIPT
-
7/28/2019 final_f01
1/12
-
7/28/2019 final_f01
2/12
3. (T/F 2 points) In HMMs, if there are at least two distinct mostlikelyhiddenstatesequencesandthetwostatesequencescross inthemiddle(shareasinglestateatanintermediatetimepoint),thenthereareatleastfourmostlikelystatesequences.
4. (T/F2points)OneadvantageofBoostingisthatitdoesnotoverfit.
5. (T/F2points) Supportvector machinesareresistanttooutliers,i.e.,verynoisyexamplesdrawnfromadifferentdistribution.
6. (T/F2points)Activelearningcansubstantiallyreducethenumberoftrainingexamplesthatweneed.
Problem2Consider two classifiers: 1) an SVM with a quadratic (second order polynomial) kernelfunctionand2)anunconstrainedmixtureoftwoGaussiansmodel,oneGaussianperclasslabel. Theseclassifiers trytomap examples inR2 tobinary labels. Weassume thattheproblemisseparable,noslackpenaltiesareaddedtotheSVMclassifier,andthatwehavesufficientlymanytrainingexamplestoestimatethecovariancematricesofthetwoGaussiancomponents.
1. (T/F2points) ThetwoclassifiershavethesameVC-dimension.
2. (4points)Supposeweevaluatedthestructuralriskminimizationscoreforthetwoclassifiers. The score is the bound on the expected loss of the classifier, when theclassifierisestimatedonthebasisofntrainingexamples. Whichofthetwoclassifiersmightyieldthebetter(lower)score? Provideabriefjustification.
2Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
-
7/28/2019 final_f01
3/12
3. (4points)Supposenowthatweregularizetheestimationofthecovariancematricesfor the mixture of two Gaussians. In other words, we would estimate each classconditionalcovariancematrixaccordingto
n n
reg = + S (1)n+n n+n
where n isthenumberoftraining examples, isthe unregularizedestimateofthecovariancematrix(samplecovariancematrixoftheexamples inoneclass), S isourprior covariance matrix (same for both classes), and n the equivalent sample sizethatwecanusetobalancebetweenthepriorandthedata.IncomputingtheVC-dimensionofaclassifier,wecanchoosethesetofpointsthatwe try to shatter. In particular, we can scale any k points by a large factor andusetheresultingsetofpoints forshattering. In lightofthis, wouldyouexpectourregularizationtochangetheVC-dimension? Whyorwhynot?
4. (T/F 2 points) Regularization in the above sense would improvethestructuralriskminimizationscoreforthemixtureoftwoGaussians.
3Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
-
7/28/2019 final_f01
4/12
Problem3The problem here is to predict the identity of a person based on a single handwrittencharacter. The observed characters (one of a, b, or c) are transformed into binary 8by8pixelimages. Therearefourdifferentpeopleweneedtoidentifyonthebasisofsuchcharacters. Todothis,wehaveatrainingsetofabout200examples,whereeachexampleconsistsofabinary8x8imageandtheidentityofthepersonitbelongsto. Youcanassumethattheoverallnumberofoccurencesofeachpersonandeachcharacterinthetrainingsetisroughlybalanced.Wewouldliketouseamixtureofexpertsarchitecturetosolvethisproblem.
1. (2points)Howmighttheexpertsbeuseful? Suggestwhattaskeachexpertmightsolve.
2. (4points)Drawagraphicalmodelthatdescribesthemixtureofexpertsarchitectureinthiscontext. Indicatewhatthevariablesareandthenumberofvaluesthattheycantake. Shadeanynodescorrespondingtovariablesthatarealwaysobserved.
4Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
-
7/28/2019 final_f01
5/12
3. (4points)Beforeimplementingthemixtureofexpertsarchitecture,weneedtoknowtheparametricformoftheconditionalprobabilitiesinyourgraphicalmodel. Provideareasonablespecificationoftherelevantconditionalprobabilitiestotheextentthatyoucouldaskyourclass-matetoimplementtheclassifier.
4. (4 points) So we implemented your method, ran the estimation algorithm once,and measured the test performance. The method was unfortunately performing ata chance level. Provide two possible explanations for this. (there may be multiplecorrectanswershere)
5. (3 points) Would we get anything reasonable out of the estimation if, initially,all the experts were identical while the parameters of the gating network would bechosen randomly? By reasonable we mean training performance. Provide a brief
justification.
5Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
-
7/28/2019 final_f01
6/12
6. (3points)Wouldwegetanythingreasonableouttheestimation ifnowthegatingnetworkisinitiallysetsothatitassignseachtrainingexampleuniformlytoallexpertsbut the experts themselves are initialized with random parameter values? Again,reasonablereferstothetrainingerror. Provideabriefjustification.
Problem4Consider
the
following
pair
of
observed
sequences:
Sequence1(st): A A T T G G C C A A T T G G C C ...Sequence2(xt): 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 ...Positiont: 0 1 2 3 4 ...
whereweassumethatthepattern(highlightedwiththespaces)willcontinueforever. Letst {A , G , T , C }, t = 0,1,2, . . . denote the variables associated with the first sequence,and xt {1,2}, t = 0,1,2, . . . the variables characterizing the second sequence. So, forexample,giventhesequencesabove,theobservedvaluesforthesevariablesares0 =A, s1 =A, s2 =C , . . .,and,similarly,x0 = 1, x1 = 1, x2 = 2, . . ..
1. (4points)Ifweuseasimplefirstorderhomogeneousmarkovmodeltopredictthefirstsequence(valuesforst only),what isthemaximum likelihoodsolutionthatwewouldfind? Inthetransitiondiagrambelow,pleasedrawtherelevanttransitionsandtheassociatedprobabilities(thisshouldnotrequiremuchcalculation)
st
A
T
C
G
A
T
C
G
st1
. . . . . .
6Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
-
7/28/2019 final_f01
7/12
2. (T/F2points) TheresultingfirstorderMarkovmodelisergodic
3. (4points)To improvetheMarkovmodelabitwewould liketodefineagraphicalmodel that predicts the value of st on the basis of the previous observed valuesst1, st2, . . . (looking as far back as needed). The model parameters/structure areassumedtoremainthesameifweshiftthemodelonestep. Inotherwords,itisthesamegraphicalmodelthatpredictsst onthebasisofst1, st2, . . .asthemodelthatpredicts st1 on the basis of st2, st3, . . .. In the graph below, draw theminimumnumber of arrows that are needed to predict the first observed sequence perfectly(disregardingthefirstfewsymbols inthesequence). Sinceweslidethemodelalongthesequence,youcandrawthearrowsonlyforst.
st
st1
st2
st3
st4
. . . . . .
4. Now,toincorporatethesecondobservationsequence,wewilluseastandardhiddenMarkovmodel:
st
st1
st2
st3
st4
. . . . . .
. . . . . .
xt
xt1
xt2
xt3
xt4
where again st {A , G , T , C } and xt {1,2}. We will estimate the parameters ofthisHMMintwodifferentways.
(I) Treatthepairofobservedsequences(st, xt)(givenabove)ascompleteobservationsofthevariablesinthemodelandestimatetheparametersinthemaximumlikelihoodsense. TheinitialstatedistributionP0(s0)issetaccordingtotheoverallfrequencyofsymbolsinthefirstobservedsequence(uniform).
(II) Useonlythesecondobservedsequence(xt)inestimatingtheparameters,againinthemaximumlikelihoodsense. Theinitialstatedistributionisagainuniformacrossthefoursymbols.
Weassumethatbothestimationprocesseswillbesuccessfullrelativetotheircriteria.
7Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
-
7/28/2019 final_f01
8/12
a) (3 points) What are the observation probabilities P(x|s) (x {1,2}, s {A,G,T ,C})resultingfromthefirstestimationapproach? (shouldnotrequiremuchcalculation)
b) (3points)Whichestimationapproachislikelytoyieldamoreaccuratemodeloverthesecondobservedsequence(xt)? Brieflyexplainwhy.
5. ConsidernowthetwoHMMsresultingfromusingeachoftheestimationapproaches(approachesIandIIabove). TheseHMMsareestimatedonthebasisofthepairofobservedsequencesgivenabove. Wedliketoevaluatetheprobabilitythatthesetwomodels assignto a new (different) observationsequence 1 2 1 2, i.e., x0 = 1, x1 =2, x2 = 1, x3 =2. Forthefirstmodel, forwhichwehavesome ideaaboutwhatthest variableswillcapture,wealsowanttoknowthetheassociatedmostlikelyhiddenstatesequence. (theseshouldnotrequiremuchcalculation)
a) (2points)Whatistheprobabilitythatthefirstmodel(approachI)assignstothisnewsequenceofobservations?
b) (2points)What istheprobabilitythatthesecondmodel(approach II)givestothenewsequenceofobservations?
8Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
-
7/28/2019 final_f01
9/12
c) (2 points) What is the most likely hidden state sequence in the first model(fromapproachI)associatedwiththenewobservedsequence?
6. (4points)Finally, letsassumethatweobserveonlythesecondsequence(xt)(thesamesequenceasgivenabove). InbuildingagraphicalmodeloverthissequencewearenolongerlimitingourselvestoHMMs. However,weonlyconsidermodelswhosestructure/parametersremainthesameasweslidealongthesequence. Thevariablesst areincludedasbeforeastheymightcomehandyashiddenvariablesinpredictingthe
observed
sequence.
a) Inthefigurebelow,drawthearrowsthatanyreasonablemodelselectioncriterion
wouldfindgivenanunlimitedsupplyoftheobservedsequencext, xt+1, . . .. Youonly need to draw the arrows for the last pair of variables in the graphs, i.e.,(st,xt)).
st
st1
st2
st3
st4
. . . . . .
. . . . . .
xt
xt1
xt2
xt3
xt4
b) Givenonlyasmallnumberofobervations, themodelselectioncriterionmightselectadifferentmodel. Inthefigurebelow,indicateapossiblealternatemodelthat any reasonable model selection criterion would find given only a few examples. Youonlyneed todrawthearrows forthe lastpairofvariables inthegraphs,i.e.,(st, xt)).
st
st1
st2
st3
st4
. . . . . .
. . . . . .
xt
xt1
xt2
xt3
xt4
9Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
-
7/28/2019 final_f01
10/12
x1
x2
x3
x4
x5
x6
BA
Figure1: DecisionmakersAandBandtheircontextProblem5TheBayesiannetworkinfigure1claimstomodelhowtwopeople,callthemAandB,makedecisionsindifferentcontexts. Thecontextisspecifiedbythesettingofthebinarycontextvariablesx1, x2, . . . , x6. Thevaluesofthesevariablesarenotknownunlesswespecificallyaskforsuchvalues.
1. (6points)Weare interested infindingoutwhat informationwedhavetoacquireto ensure that A and B will make their decisions independently from one another.SpecifythesmallestsetofcontextvariableswhoseinstantiationwouldrenderAandBindependent. Brieflyexplainyourreasoning(theremaybemorethanonestrategyforarrivingatthesamedecision)
2. (T/F2points) Wecan ingeneralachieve independencewith lessinformation,i.e.,wedonthavetofullyinstantiatetheselectedcontextvariablesbutprovidesomeevidenceabouttheirvalues
10Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
-
7/28/2019 final_f01
11/12
3. (4 points) Could your choice of the minimal set of context variables change if wealsoprovidedyouwiththeactualprobabilityvaluesassociatedwiththedependenciesinthegraph? Provideabriefjustification.
Additionalsetoffigures
st
A
T
C
G
A
T
C
G
st1
. . . . . .
st
st1
st2
st3
st4
. . . . . .
st
st1
st2
st3
st4
. . . . . .
. . . . . .
xt
xt1
xt2
xt3
xt4
11Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].
-
7/28/2019 final_f01
12/12