final_f01

Upload: acrosstheland8535

Post on 03-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 final_f01

    1/12

  • 7/28/2019 final_f01

    2/12

    3. (T/F 2 points) In HMMs, if there are at least two distinct mostlikelyhiddenstatesequencesandthetwostatesequencescross inthemiddle(shareasinglestateatanintermediatetimepoint),thenthereareatleastfourmostlikelystatesequences.

    4. (T/F2points)OneadvantageofBoostingisthatitdoesnotoverfit.

    5. (T/F2points) Supportvector machinesareresistanttooutliers,i.e.,verynoisyexamplesdrawnfromadifferentdistribution.

    6. (T/F2points)Activelearningcansubstantiallyreducethenumberoftrainingexamplesthatweneed.

    Problem2Consider two classifiers: 1) an SVM with a quadratic (second order polynomial) kernelfunctionand2)anunconstrainedmixtureoftwoGaussiansmodel,oneGaussianperclasslabel. Theseclassifiers trytomap examples inR2 tobinary labels. Weassume thattheproblemisseparable,noslackpenaltiesareaddedtotheSVMclassifier,andthatwehavesufficientlymanytrainingexamplestoestimatethecovariancematricesofthetwoGaussiancomponents.

    1. (T/F2points) ThetwoclassifiershavethesameVC-dimension.

    2. (4points)Supposeweevaluatedthestructuralriskminimizationscoreforthetwoclassifiers. The score is the bound on the expected loss of the classifier, when theclassifierisestimatedonthebasisofntrainingexamples. Whichofthetwoclassifiersmightyieldthebetter(lower)score? Provideabriefjustification.

    2Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

  • 7/28/2019 final_f01

    3/12

    3. (4points)Supposenowthatweregularizetheestimationofthecovariancematricesfor the mixture of two Gaussians. In other words, we would estimate each classconditionalcovariancematrixaccordingto

    n n

    reg = + S (1)n+n n+n

    where n isthenumberoftraining examples, isthe unregularizedestimateofthecovariancematrix(samplecovariancematrixoftheexamples inoneclass), S isourprior covariance matrix (same for both classes), and n the equivalent sample sizethatwecanusetobalancebetweenthepriorandthedata.IncomputingtheVC-dimensionofaclassifier,wecanchoosethesetofpointsthatwe try to shatter. In particular, we can scale any k points by a large factor andusetheresultingsetofpoints forshattering. In lightofthis, wouldyouexpectourregularizationtochangetheVC-dimension? Whyorwhynot?

    4. (T/F 2 points) Regularization in the above sense would improvethestructuralriskminimizationscoreforthemixtureoftwoGaussians.

    3Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

  • 7/28/2019 final_f01

    4/12

    Problem3The problem here is to predict the identity of a person based on a single handwrittencharacter. The observed characters (one of a, b, or c) are transformed into binary 8by8pixelimages. Therearefourdifferentpeopleweneedtoidentifyonthebasisofsuchcharacters. Todothis,wehaveatrainingsetofabout200examples,whereeachexampleconsistsofabinary8x8imageandtheidentityofthepersonitbelongsto. Youcanassumethattheoverallnumberofoccurencesofeachpersonandeachcharacterinthetrainingsetisroughlybalanced.Wewouldliketouseamixtureofexpertsarchitecturetosolvethisproblem.

    1. (2points)Howmighttheexpertsbeuseful? Suggestwhattaskeachexpertmightsolve.

    2. (4points)Drawagraphicalmodelthatdescribesthemixtureofexpertsarchitectureinthiscontext. Indicatewhatthevariablesareandthenumberofvaluesthattheycantake. Shadeanynodescorrespondingtovariablesthatarealwaysobserved.

    4Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

  • 7/28/2019 final_f01

    5/12

    3. (4points)Beforeimplementingthemixtureofexpertsarchitecture,weneedtoknowtheparametricformoftheconditionalprobabilitiesinyourgraphicalmodel. Provideareasonablespecificationoftherelevantconditionalprobabilitiestotheextentthatyoucouldaskyourclass-matetoimplementtheclassifier.

    4. (4 points) So we implemented your method, ran the estimation algorithm once,and measured the test performance. The method was unfortunately performing ata chance level. Provide two possible explanations for this. (there may be multiplecorrectanswershere)

    5. (3 points) Would we get anything reasonable out of the estimation if, initially,all the experts were identical while the parameters of the gating network would bechosen randomly? By reasonable we mean training performance. Provide a brief

    justification.

    5Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

  • 7/28/2019 final_f01

    6/12

    6. (3points)Wouldwegetanythingreasonableouttheestimation ifnowthegatingnetworkisinitiallysetsothatitassignseachtrainingexampleuniformlytoallexpertsbut the experts themselves are initialized with random parameter values? Again,reasonablereferstothetrainingerror. Provideabriefjustification.

    Problem4Consider

    the

    following

    pair

    of

    observed

    sequences:

    Sequence1(st): A A T T G G C C A A T T G G C C ...Sequence2(xt): 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 ...Positiont: 0 1 2 3 4 ...

    whereweassumethatthepattern(highlightedwiththespaces)willcontinueforever. Letst {A , G , T , C }, t = 0,1,2, . . . denote the variables associated with the first sequence,and xt {1,2}, t = 0,1,2, . . . the variables characterizing the second sequence. So, forexample,giventhesequencesabove,theobservedvaluesforthesevariablesares0 =A, s1 =A, s2 =C , . . .,and,similarly,x0 = 1, x1 = 1, x2 = 2, . . ..

    1. (4points)Ifweuseasimplefirstorderhomogeneousmarkovmodeltopredictthefirstsequence(valuesforst only),what isthemaximum likelihoodsolutionthatwewouldfind? Inthetransitiondiagrambelow,pleasedrawtherelevanttransitionsandtheassociatedprobabilities(thisshouldnotrequiremuchcalculation)

    st

    A

    T

    C

    G

    A

    T

    C

    G

    st1

    . . . . . .

    6Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

  • 7/28/2019 final_f01

    7/12

    2. (T/F2points) TheresultingfirstorderMarkovmodelisergodic

    3. (4points)To improvetheMarkovmodelabitwewould liketodefineagraphicalmodel that predicts the value of st on the basis of the previous observed valuesst1, st2, . . . (looking as far back as needed). The model parameters/structure areassumedtoremainthesameifweshiftthemodelonestep. Inotherwords,itisthesamegraphicalmodelthatpredictsst onthebasisofst1, st2, . . .asthemodelthatpredicts st1 on the basis of st2, st3, . . .. In the graph below, draw theminimumnumber of arrows that are needed to predict the first observed sequence perfectly(disregardingthefirstfewsymbols inthesequence). Sinceweslidethemodelalongthesequence,youcandrawthearrowsonlyforst.

    st

    st1

    st2

    st3

    st4

    . . . . . .

    4. Now,toincorporatethesecondobservationsequence,wewilluseastandardhiddenMarkovmodel:

    st

    st1

    st2

    st3

    st4

    . . . . . .

    . . . . . .

    xt

    xt1

    xt2

    xt3

    xt4

    where again st {A , G , T , C } and xt {1,2}. We will estimate the parameters ofthisHMMintwodifferentways.

    (I) Treatthepairofobservedsequences(st, xt)(givenabove)ascompleteobservationsofthevariablesinthemodelandestimatetheparametersinthemaximumlikelihoodsense. TheinitialstatedistributionP0(s0)issetaccordingtotheoverallfrequencyofsymbolsinthefirstobservedsequence(uniform).

    (II) Useonlythesecondobservedsequence(xt)inestimatingtheparameters,againinthemaximumlikelihoodsense. Theinitialstatedistributionisagainuniformacrossthefoursymbols.

    Weassumethatbothestimationprocesseswillbesuccessfullrelativetotheircriteria.

    7Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

  • 7/28/2019 final_f01

    8/12

    a) (3 points) What are the observation probabilities P(x|s) (x {1,2}, s {A,G,T ,C})resultingfromthefirstestimationapproach? (shouldnotrequiremuchcalculation)

    b) (3points)Whichestimationapproachislikelytoyieldamoreaccuratemodeloverthesecondobservedsequence(xt)? Brieflyexplainwhy.

    5. ConsidernowthetwoHMMsresultingfromusingeachoftheestimationapproaches(approachesIandIIabove). TheseHMMsareestimatedonthebasisofthepairofobservedsequencesgivenabove. Wedliketoevaluatetheprobabilitythatthesetwomodels assignto a new (different) observationsequence 1 2 1 2, i.e., x0 = 1, x1 =2, x2 = 1, x3 =2. Forthefirstmodel, forwhichwehavesome ideaaboutwhatthest variableswillcapture,wealsowanttoknowthetheassociatedmostlikelyhiddenstatesequence. (theseshouldnotrequiremuchcalculation)

    a) (2points)Whatistheprobabilitythatthefirstmodel(approachI)assignstothisnewsequenceofobservations?

    b) (2points)What istheprobabilitythatthesecondmodel(approach II)givestothenewsequenceofobservations?

    8Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

  • 7/28/2019 final_f01

    9/12

    c) (2 points) What is the most likely hidden state sequence in the first model(fromapproachI)associatedwiththenewobservedsequence?

    6. (4points)Finally, letsassumethatweobserveonlythesecondsequence(xt)(thesamesequenceasgivenabove). InbuildingagraphicalmodeloverthissequencewearenolongerlimitingourselvestoHMMs. However,weonlyconsidermodelswhosestructure/parametersremainthesameasweslidealongthesequence. Thevariablesst areincludedasbeforeastheymightcomehandyashiddenvariablesinpredictingthe

    observed

    sequence.

    a) Inthefigurebelow,drawthearrowsthatanyreasonablemodelselectioncriterion

    wouldfindgivenanunlimitedsupplyoftheobservedsequencext, xt+1, . . .. Youonly need to draw the arrows for the last pair of variables in the graphs, i.e.,(st,xt)).

    st

    st1

    st2

    st3

    st4

    . . . . . .

    . . . . . .

    xt

    xt1

    xt2

    xt3

    xt4

    b) Givenonlyasmallnumberofobervations, themodelselectioncriterionmightselectadifferentmodel. Inthefigurebelow,indicateapossiblealternatemodelthat any reasonable model selection criterion would find given only a few examples. Youonlyneed todrawthearrows forthe lastpairofvariables inthegraphs,i.e.,(st, xt)).

    st

    st1

    st2

    st3

    st4

    . . . . . .

    . . . . . .

    xt

    xt1

    xt2

    xt3

    xt4

    9Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

  • 7/28/2019 final_f01

    10/12

    x1

    x2

    x3

    x4

    x5

    x6

    BA

    Figure1: DecisionmakersAandBandtheircontextProblem5TheBayesiannetworkinfigure1claimstomodelhowtwopeople,callthemAandB,makedecisionsindifferentcontexts. Thecontextisspecifiedbythesettingofthebinarycontextvariablesx1, x2, . . . , x6. Thevaluesofthesevariablesarenotknownunlesswespecificallyaskforsuchvalues.

    1. (6points)Weare interested infindingoutwhat informationwedhavetoacquireto ensure that A and B will make their decisions independently from one another.SpecifythesmallestsetofcontextvariableswhoseinstantiationwouldrenderAandBindependent. Brieflyexplainyourreasoning(theremaybemorethanonestrategyforarrivingatthesamedecision)

    2. (T/F2points) Wecan ingeneralachieve independencewith lessinformation,i.e.,wedonthavetofullyinstantiatetheselectedcontextvariablesbutprovidesomeevidenceabouttheirvalues

    10Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

  • 7/28/2019 final_f01

    11/12

    3. (4 points) Could your choice of the minimal set of context variables change if wealsoprovidedyouwiththeactualprobabilityvaluesassociatedwiththedependenciesinthegraph? Provideabriefjustification.

    Additionalsetoffigures

    st

    A

    T

    C

    G

    A

    T

    C

    G

    st1

    . . . . . .

    st

    st1

    st2

    st3

    st4

    . . . . . .

    st

    st1

    st2

    st3

    st4

    . . . . . .

    . . . . . .

    xt

    xt1

    xt2

    xt3

    xt4

    11Cite as: Tommi Jaakkola, course materials for 6.867 Machine Learning, Fall 2006. MIT OpenCourseWare(http://ocw.mit.edu/), Massachusetts Institute of Technology. Downloaded on [DD Month YYYY].

  • 7/28/2019 final_f01

    12/12