web mining giuseppe attardi [includes slides borrowed from c. manning]

204
Web Mining Web Mining Giuseppe Attardi Giuseppe Attardi [includes slides borrowed from C. [includes slides borrowed from C. Manning] Manning]

Upload: lesley-davis

Post on 11-Jan-2016

224 views

Category:

Documents


0 download

TRANSCRIPT

  • Web MiningGiuseppe Attardi

    [includes slides borrowed from C. Manning]

  • Types of MiningWeb Usage MiningAccess logs, query logs, etc.For improving performance (caching)For understanding user needs or preferencesWeb Content MiningFor extracting information or knowledge

  • Content MiningWeb SearchLink AnalysisCategorizationClusteringKnowledge ExtractionOntologySemantic WebQuestion Answering

  • Text Categorization

  • CategorizationProblem: Given a universe of objects and a pre-defined set of classes, or categories, assign each object to its correct class.Examples:

  • OverviewDefinition of Text Categorization

    TechniquesNave BayesDecision TreesMaximum Entropy Modelingk-Nearest Neighbor ClassificationSVM

  • Text categorizationClassification (= Categorization)Task of assigning objects to classes or categories

    Text categorizationTask of classifying the topic or theme of a document

  • Statistical classificationTraining set of objectsData representation modelModel classTraining procedureEvaluation

  • Training set of objectsA set of objects, each labeled by one or more classesExample from Reuters

    5-MAR-1987 09:22:57.75earnusa

    NORD RESOURCES CORP 4TH QTR NET DAYTON, Ohio, March 5 - Shr 19 cts vs 13 ctsNet 2,656,000 vs 1,712,000Revs 15.4 mln vs 9,443,000Avg shrs 14.1 mln vs 12.6 mlnShr 98 cts vs 77 ctsNet 13.8 mln vs 8,928,000Revs 58.8 mln vs 48.5 mlnAvg shrs 14.0 mln vs 11.6 mlnNOTE: Shr figures adjusted for 3-for-2 split paid Feb 6, 1987.Reuter

  • Data Representation ModelThe training set is encoded via a data representation modelTypically, each object in the training set is represented by a pair (x, c), where:x: a vector of measurementsc: class label

  • Data RepresentationFor text categorization:use words that are frequent in earnings documentsthe 20 most representative words are:vs, min, cts, loss, &, 000, profit, dlrs, pct, etc.Each document is represented as vector

    where

    and tfij is the number of occurrences of word i in document j and lj is the length of document j.

    wordsijvs5mln5cts3;3&30004loss00034profit0dlrs312pct0is0s0that0net3lt2at0

  • Model Class and Training ProcedureModel classA parameterized family of classifierse.g. a model class for binary classification: g(x) = wx + w0if g(x) > 0, choose class c1, else c2Training procedureAlgorithm to select one classifier from this familyi.e., to select proper parameters values (e.g. w, w0)

  • EvaluationBorrowing from IR, NLP systems are evaluated by precision, recall, etc.Example: for text categorization, given a set of documents of which a subset is in a particular category (say, earnings), the system classifies some other subset of the documents as belonging to the earnings category.The results of the system are compared with the actual results as follows:

  • Evaluation measuresPrecision

    Recall

    Accuracy

    Error

  • Evaluation of text categorizationmacro-averagingCompute an evaluation measure for each contingency table separately and average over categoriesgives equal weight to each category

    macro-averaged precision =

    micro-averagingMake a single contingency table for all categories by summing the scores in each cell, then compute the evaluation measure for the whole tablegives equal weight to each object

    micro-averaged precision =

  • Classification TechniquesNave BayesDecision TreesMaximum Entropy ModelingSupport Vector Machinesk-Nearest Neighbor

  • Bayesian Classifiers

  • Bayesian MethodsLearning and classification methods based on probability theoryBayes theorem plays a critical role in probabilistic learning and classificationBuild a generative model that approximates how data is producedUses prior probability of each category given no information about an itemCategorization produces a posterior probability distribution over the possible categories given a description of an item

  • NotationP(A)probability of an event AP(A, B)probability of both events A and BP(A | B)conditional probability of event A given that an event B has occurred

  • Bayes Ruleprior probabilityposterior probability

  • Maximum a posteriori Hypothesis

  • Maximum likelihood HypothesisIf all hypotheses are a priori equally likely,need only to consider the P(D | h) term:

  • Nave Bayes ClassifiersTask: Classify a new instance based on a tuple of attribute values

  • Nave Bayes Classifier: AssumptionsP(cj)Can be estimated from the frequency of classes in the training examples.P(x1,x2,,xn | cj) O(|X|n|C|)Could only be estimated if a very, very large number of training examples was available.Conditional Independence Assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities.

  • The Nave Bayes ClassifierConditional Independence Assumption: features are independent of each other given the class:

  • Learning the ModelCommon practice: maximum likelihoodsimply use the frequencies in the data

  • Problem with Max LikelihoodWhat if we have seen no training cases where patient had no flu and muscle aches?

    Zero probabilities cannot be conditioned away, no matter the other evidence!

  • Smoothing to Avoid OverfittingSomewhat more subtle version# of values of Xioverall fraction in data where Xi=xi,kextent ofsmoothing

  • Text Classification Using Nave Bayes:Basic methodAttributes are text positions, values are words.Naive Bayes assumption is clearly violated.Example?Still too many possibilitiesAssume that classification is independent of the positions of the wordsUse same parameters for each position

  • Text Classification Algorithms: LearningFrom training corpus, extract VocabularyCalculate required P(cj) and P(xk | cj) termsFor each cj in C dodocsj subset of documents for which the target class is cj Textj single document containing all docsjfor each word xk in Vocabularynk number of occurrences of xk in Textj

  • Text Classification Algorithms: Classifyingpositions all word positions in current document which contain tokens found in VocabularyReturn cNB, where

  • Naive Bayes Time ComplexityTraining Time: O(|D|Ld + |C||V|)) where Ld is the average length of a document in DAssumes V and all Di , ni, and nij pre-computed in O(|D|Ld) time during one pass through all of the data.Generally just O(|D|Ld) since usually |C||V| < |D|Ld Test Time: O(|C| Lt) where Lt is the average length of a test documentVery efficient overall, linearly proportional to the time needed to just read in all the data

  • Underflow PreventionMultiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflowSince log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilitiesClass with highest final un-normalized log probability score is still the most probable

  • Nave Bayes Posterior ProbabilitiesClassification results of nave Bayes (the class with maximum posterior probability) are usually fairly accurateHowever, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are notOutput probabilities are generally very close to 0 or 1

  • Two ModelsModel 1: Multivariate binomialOne feature Xw for each word in dictionaryXw = true in document d if w appears in dNaive Bayes assumption: Given the documents topic, appearance of one word in document tells us nothing about chances that another word appears

  • Two ModelsModel 2: MultinomialOne feature Xi for each word pos in documentfeatures values are all words in dictionaryValue of Xi is the word in position iNave Bayes assumption: Given the documents topic, word in one position in document tells us nothing about value of words in other positionsSecond assumption: word appearance does not depend on positionP(Xi = w | c ) = P(Xj = w | c)

    for all positions i,j, word w, and class c

  • Parameter estimationBinomial model:

    Multinomial model:

    creating a mega-document for topic j by concatenating all documents in this topicuse frequency of w in mega-documentfraction of documents of topic cjin which word w appearsfraction of times in which word w appears across all documents of topic cj

  • Feature selection via Mutual InformationWe might not want to use all words, but just reliable, good discriminatorsIn training set, choose k words which best discriminate the categories.One way is in terms of Mutual Information:

    For each word w and each category c

  • Feature selection via MI (2)For each category we build a list of k most discriminating termsFor example (on 20 Newsgroups):sci.electronics: circuit, voltage, amp, ground, copy, battery, electronics, cooling, rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, Greedy: does not account for correlations between termsIn general feature selection is necessary for binomial NB, but not for multinomial NB

  • Evaluating CategorizationEvaluation must be done on test data that are independent of the training data (usually a disjoint set of instances)Classification accuracy: c/n where n is the total number of test instances and c is the number of test instances correctly classified by the systemResults can vary based on sampling error due to different training and test setsAverage results over multiple training and test sets (splits of the overall data) for the best results

  • Example: AutoYahoo!Classify 13,589 Yahoo! webpages in Science subtree into 95 different topics (hierarchy depth 2)

  • Sample Learning Curve(Yahoo Science Data)

  • Importance of Conditional IndependenceAssume a domain with 20 binary (true/false) attributes A1,, A20, and two classes c1 and c2.Goal: for any case A=A1,,A20 estimate P(A,ci).A) No independence assumptions:Computation of 221 parameters (one for each combination of values) !The training database will not be so large!Huge Memory requirements / Processing time.Error Prone (small sample error).B) Strongest conditional independence assumptions (all attributes independent given the class) = Naive Bayes:P(A,ci)=P(A1,ci)P(A2,ci)P(A20,ci) Computation of 20*22 = 80 parameters.Space and time efficient.Robust estimations.What if the conditional independence assumptions do not hold??C) More relaxed independence assumptionsTradeoff between A) and B)

  • Conditions for Optimality of Nave BayesFactSometimes NB performs well even if the Conditional Independence assumptions are badly violated. QuestionsWHY? And WHEN?HintClassification is about predicting the correct class label and NOT about accurately estimating probabilities.AnswerAssume two classes c1 and c2.A new case A arrives.NB will classify A to c1 if:P(A, c1) > P(A, c2)Despite the big error in estimating the probabilities the classification is still correct.

    Correct estimation accurate predictionbut NOTaccurate prediction Correct estimation

  • Nave Bayes is Not So NaveNave Bayes: First and Second place in KDD-CUP 97 competition, among 16 (then) state of the art algorithmsGoal: Financial services industry direct mail response prediction model.Predict if the recipient of mail will actually respond to the advertisement 750,000 records.Robust to Irrelevant FeaturesIrrelevant Features cancel each other without affecting resultsInstead Decision Trees & Nearest-Neighbor methods can heavily suffer from this.Very good in Domains with many equally important featuresDecision Trees suffer from fragmentation in such cases especially if little dataA good dependable baseline for text classification (but not the best)!Optimal if the Independence Assumptions hold:If assumed independence is correct, then it is the Bayes Optimal Classifier for problemVery Fast:Learning with one pass over the data; testing linear in the number of attributes, and document collection sizeLow Storage requirementsHandles Missing Values

  • Nave Bayes DrawbacksDoesnt do higher order interactionsTypical example: Chess end gamesEach move completely changes the context for the next moveC4.5 99.5% accuracy; NB 87% accuracy.What if you have BOTH high order interactions AND few training data?Doesnt model features that do not equally contribute to distinguishing the classesIf few features ONLY mostly determine the class, additional features usually decrease the accuracy Because NB gives same weight to all features

  • Decision Trees

  • Decision TreesExample: decision whether to assign documents to the category "earning"node 17681 articlesP(c|n1) = 0.300split: ctsvalue: 2node 25977 articlesP(c|n2) = 0.116split: netvalue: 1node 51704articlesP(c|n5) = 0.943split: vsvalue: 2cts < 2cts 2node 35436 articlesP(c|n3) = 0.050net < 1node 4541 articlesP(c|n4) = 0.649node 6201 articlesP(c|n6) = 0.694node 71403 articlesP(c|n7) = 0.996vs 2net 1vs < 2

  • Decision Trees - Training procedure (1)Growing a tree with training datasplitting criterionfor finding the feature and its value on which to splite.g. maximum information gainstopping criteriondetermines when to stop splittinge.g. all elements at node have same categoryPruning it back to reasonable sizeto avoid overfitting the training sete.g. dlrs and pct in just one documentto optimize performance

  • Maximum Information GainInformation gainH(t) H(t|a) = H(t) (pLH(tL) + pRH(tR))where:a is attribute we split ont is distribution of the node we splitpL and pR are the percent of nodes passed on to left and right nodestL and tR are the distributions of left and right nodesChoose attribute which maximizes IG

    Example:H(n1) = - 0.3 log(0.3) - 0.7 log(0.7) = 0.881H(n2) = 0.518H(n5) = 0.315H(n1) H(n1| cts) =0.881 (5977/7681) 0.518 (1704/7681) 0.315 = 0.408

  • Decision Trees Pruning (1)At each step, drop a node considered least helpfulFind best tree using validation on validation setvalidation set: portion of training data held out from trainingFind best tree using cross-validationI. Determine the optimal tree size1. Divide training data into N partitions2. Grow using N-1 partitions, and prune using held-out partition3. Repeat 2. N times4. Determine average pruned tree size as optimal tree sizeII. Training using total training data, and pruning back to optimal size

  • Decision Trees - Pruning (2)Effect of pruning on accuracyOptimal performance on test set pruning 951 nodes

  • Decision Trees SummaryUseful for non-trivial classification tasks (for simple problems, use simpler methods)Tend to split the training set into smaller and smaller subsets:may lead to poor generalizationsnot enough data for reliable predictionaccidental regularitiesVolatile: very different model from slightly different dataCan be interpreted easilyeasy to trace the patheasy to debug ones codeeasy to understand a new domain

  • Maximum Entropy Modeling

  • Maximum Entropy ModelingMaximum Entropy ModelingThe model with maximum entropy of all the models that satisfy the constraintsdesire to preserve as much uncertainty as possibleModel class: log linear modelTraining procedure: generalized iterative scaling

  • MaxEntropy: example data

  • MaxEnt: example predictions

  • Maximum Entropy Modeling (2)Model class: loglinear model

    i : weight for i-th featureZ : normalizing constant

    Class of new documentcompute p(x, 0), p(x, 1)choose the class label with the greater probability

  • Maximum Entropy Modeling (3)Training procedure: generalized interative scalingExpected value of p:

    Epfi maximum entropy distribution p*:Ep*fi = Epfi

    Algorithminitialize (1). compute Epfi. n=1compute p(n)(x, c) for each training datacompute Ep(n)fiupdate (n+1)if converged, stop. otherwise n=n+1, goto 2

  • Maximum Entropy Modeling (4)Define (K+1)th feature: for the constraint that sum of fi is equal to C

    Expected value of p is defined as

    Expected value for the empirical distribution is computed as

    Expected value of p is approximately computed as

  • GIS Algorithm (full)Initialize {ai(1)}.

  • Maximum Entropy Modeling (6)Application to text categorizationtrained on 9603 articles, 500 iterationtest result: 88.6% accuracy

  • Maximum Entropy Modeling (7)Shortcoming of MEMrestricted to binary featurelow performance in some situationscomputationally expensive: slow convergencethe lack of smoothing can cause problemStrength of MEMcan specify all possible relevant informationcomplex features can be definedcan use heterogeneous features and weighting featurean integrated framework for feature selection & classificationa very large number of features could be down to a manageable size during training procedure

  • Vector Space Classifiers

  • Vector Space RepresentationEach document is a vector, one component for each term (= word)Normalize to unit lengthProperties of vector spaceterms are axesn docs live in this spaceeven with stemming, may have 10,000+ dimensions, or even 1,000,000+

  • Classification Using Vector SpacesEach training doc a point (vector) labeled by its classSimilarity hypothesis: docs of the same class form a contiguous region of space. Or: Similar documents are usually in the same class.Define surfaces to delineate classes in space

  • Classes in a Vector SpaceGovernmentScienceArtsSimilarityhypothesistrue ingeneral?

  • Given a Test DocumentFigure out which region it lies inAssign corresponding class

  • Test Document = GovernmentGovernmentScienceArts

  • k-Nearest Neighbor

  • k-Nearest Neighbor ClassificationTo classify document d into class cDefine k-neighborhood N as k nearest neighbors of dCount number of documents l in N that belong to cEstimate P(c|d) as l/k

  • Example: k=6 (6NN)GovernmentScienceArtsP(science| )?

  • kNN Learning AlgorithmLearning is just storing the representations of the training examples in DTesting instance x:Compute similarity between x and all examples in DAssign x the category of the most similar example in DDoes not explicitly compute a generalization or category prototypesAlso called:Case-based learningMemory-based learningLazy learning

  • kNN Is Close to OptimalCover and Hart 1967Asymptotically, the error rate of 1-nearest-neighbor classification is less than twice the Bayes rate [error rate of classifier knowing model that generated data]In particular, asymptotic error rate is 0 if Bayes rate is 0Assume: query point coincides with a training pointBoth query point and training point contribute error 2 times Bayes rate

  • kNN with Inverted IndexNaively finding nearest neighbors requires a linear search through |D| documents in collectionBut determining k nearest neighbors is the same as determining the k best retrievals using the test document as a query to a database of training documentsUse standard vector space inverted index methods to find the k nearest neighborsTesting Time: O(B|Vt|) where B is the average number of training documents in which a test-document word appearsTypically B
  • kNN: DiscussionNo feature selection necessaryScales well with large number of classesDont need to train n classifiers for n classesClasses can influence each otherSmall changes to one class can have ripple effectScores can be hard to convert to probabilitiesNo training necessaryActually: perhaps not true. (Data editing, etc.)

  • kNN vs. Naive BayesBias/Variance tradeoffVariance CapacitykNN has high variance and low bias.Infinite memoryNB has low variance and high bias.Decision surface has to be linear (hyperplane see later)Consider: Is an object a tree? (Burges)Too much capacity/variance, low biasBotanist who memorizesWill always say no to new object (e.g., # leaves)Not enough capacity/variance, high biasLazy botanistSays yes if the object is greenWant the middle ground

  • kNN vs. Nave BayesBias/Variance tradeoffVariance CapacitykNN has high variance and low biasRegression has low variance and high biasConsider: Is an object a tree? (Burges)Too much capacity/variance, low biasBotanist who memorizesWill always say no to new object (e.g., #leaves)Not enough capacity/variance, high biasLazy botanistSays yes if the object is green

  • kNN: DiscussionClassification time linear in training setNo feature selection necessaryScales well with large number of classesDont need to train n classifiers for n classesClasses can influence each otherSmall changes to one class can have ripple effectScores can be hard to convert to probabilitiesNo training necessaryActually: not true. Why?

  • Binary ClassificationConsider 2 class problemsHow do we define (and find) the separating surface?How do we test which region a test doc is in?

  • Separation by HyperplanesAssume linear separability for now:in 2 dimensions, can separate by a linein higher dimensions, need hyperplanesCan find separating hyperplane by linear programming (e.g. perceptron):separator can be expressed asax + by = c

  • Linear Programming / PerceptronFind a, b, c, such thatax + by c for red pointsax + by c for blue points

  • Relationship to Nave Bayes?Find a, b, c, such thatax + by c for red pointsax + by c for blue points

  • Linear ClassifiersMany common text classifiers are linear classifiersDespite this similarity, large performance differencesFor separable problems, there is an infinite number of separating hyperplanes. Which one do you choose?What to do for non-separable problems?

  • Which Hyperplane?In general, lots of possiblesolutions for a, b, c

  • Which Hyperplane?Lots of possible solutions for a, b, c.Some methods find a separating hyperplane, but not the optimal one (e.g., perceptron)Most methods find an optimal separating hyperplaneWhich points should influence optimality?All pointsLinear regressionNave BayesOnly difficult points close to decision boundarySupport vector machinesLogistic regression (kind of)

  • Hyperplane: ExampleClass: interest (as in interest rate)Example features of a linear classifier (SVM) wi ti wi ti 0.70 prime 0.67 rate 0.63 interest 0.60 rates 0.46 discount 0.43 bundesbank -0.71 dlrs -0.35 world -0.33 sees -0.25 year -0.24 group -0.24 dlr

  • More Than Two ClassesOne-of classification: each document belongs to exactly one classHow do we compose separating surfaces into regions?Any-of or multiclass classificationFor n classes, decompose into n binary problemsVector space classifiers for one-of classificationUse a set of binary classifiersCentroid classificationK Nearest Neighbor classification

  • Composing Surfaces: Issues???

  • Set of Binary Classifiers: Any ofBuild a separator between each class and its complementary set (docs from all other classes)Given test doc, evaluate it for membership in each classApply decision criterion of classifiers independentlyDone

  • Set of Binary Classifiers: One ofBuild a separator between each class and its complementary set (docs from all other classes)Given test doc, evaluate it for membership in each classAssign document to class with:maximum scoremaximum confidencemaximum probabilityWhy different from multiclass/any of classification?

  • Negative ExamplesFormulate as above, except negative examples for a class are added to its complementary set.Positive examplesNegative examples

  • Centroid ClassificationGiven training docs for a class, compute their centroidNow have a centroid for each classGiven query doc, assign to class whose centroid is nearest

  • ExampleGovernmentScienceArts

  • Support Vector Machines

  • Which Hyperplane?In general, lots of possible solutions for a, b, cSupport Vector Machine (SVM) finds an optimal solution

  • Support Vector Machine (SVM)SVMs maximize the margin around the separating hyperplane:a.k.a. large margin classifiersThe decision function is fully specified by a subset of training samples, the support vectorsQuadratic programming problemState of the art text classification method

  • Geometric MarginDistance from example to the separator is Examples closest to the hyperplane are support vectorsMargin of the separator is the width of separation between support vectors of classesr

  • Maximum Margin: Formalizationw: hyperplane normalxi: data point iyi: class of data point i (+1 or -1)

    Constraint optimization formalization:

    xi w + b +1for yi = +1xi w + b -1for yi = -1maximize margin: 2/||w||

  • Quadratic ProgrammingOne can show that hyperplane w with maximum margin is:

    ai: Lagrange multipliersxi: data point iyi: class of data point i (+1 or -1)where the ai are the solution to maximizing:Most ai will be zero

  • Building an SVM ClassifierNow we know how to build a separator for two linearly separable classesWhat about classes whose exemplary docs are not linearly separable?

  • Not Linearly SeparableFind a line that penalizespoints on the wrong side

  • Penalizing Bad PointsDefine distance for each point withrespect to separator ax + by = c: (ax + by) - c for red pointsc - (ax + by) for blue points.Negative for bad points.

  • Solve Quadratic ProgramSolution gives separator between two classes: choice of a,bGiven a new point (x,y), can score its proximity to each class:evaluate ax+bySet confidence threshold357

  • Non-linear SVMsDatasets that are linearly separable (with some noise) work out great:

    But what are we going to do if the dataset is just too hard?

    How about mapping data to a higher-dimensional space:0x2x

  • Non-linear SVMs: Feature spacesGeneral idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:: x (x)

  • The Kernel TrickThe linear classifier relies on an inner product between vectors K(xi,xj) = xiTxjIf every datapoint is mapped into high-dimensional space via some transformation : x (x), the inner product becomes:K(xi, xj) = (xi)T(xj)A kernel function is some function that corresponds to an inner product in some expanded feature space.Example: 2-dimensional vectors x = [x1 x2]; let K(xi,xj) = (1 + xiTxj)2Need to show that K(xi,xj) = (xi)T(xj):K(xi,xj) = (1 + xiTxj)2 == 1+ xi12xj12 + 2 xi1xj1 xi2xj2+ xi22xj22 + 2xi1xj1 + 2xi2xj2 == [1 xi12 2 xi1xi2 xi22 2xi1 2xi2]T [1 xj12 2 xj1xj2 xj22 2xj1 2xj2]= (xi) T(xj) where (x) = [1 x12 2 x1x2 x22 2x1 2x2]

  • KernelsWhy use kernels?Make non-separable problem separableMap data into better representational spaceCommon kernelsLinearPolynomial K(x,z) = (1+xTz)dRadial basis function (infinite dimensional space)

  • Evaluation: Classic Reuters Data Set Most (over)used data set21578 documents9603 training, 3299 test articles (ModApte split)118 categoriesAn article can be in more than one categoryLearn 118 binary category distinctionsAverage document: about 90 types, 200 tokensAverage number of classes assigned1.24 for docs with at least one categoryOnly about 10 out of 118 categories are largeCommon categories(#train, #test) Earn (2877, 1087) Acquisitions (1650, 179) Money-fx (538, 179) Grain (433, 149) Crude (389, 189) Trade (369,119) Interest (347, 131) Ship (197, 89) Wheat (212, 71) Corn (182, 56)

  • Performance of SVMSVM are seen as best-performing method by manyStatistical significance of most results not clearThere are many methods that perform about as well as SVMExample: regularized regression (Zhang&Oles)Example of a comparison study: Yang&Liu

  • Dumais et al. 1998: Reuters - Accuracy

    notes

    All of these runs are for training data (9603) split into train-train (7147) and train-test.

    Hi David and Sue,

    Here is a copy of the Excel spreadsheet with the Reuters results that we were looking at earlier today.

    The labelling scheme for the different sheets in the workbook is:

    * "fsXXX" where XXX is the number of features selected.

    * if the sheet has an "eps" in the name, then epsilon smoothing was used. Otherwise a Beta(1,1) prior was used.

    * All runs are Naive Bayes, unless the sheet name includes "KDB1" or KDB2" which are 1 and 2 dependence models, respectively.

    summary (2)

    FindsimNBayesTreesLinearSVM

    earn92.9%95.9%97.8%98.2%

    acq64.7%87.8%89.7%93.0%

    money-fx46.7%56.6%66.2%72.7%

    grain67.5%78.8%85.0%93.2%

    crude70.1%79.5%85.0%88.8%

    trade65.1%63.9%72.5%75.2%

    interest63.4%64.9%67.1%76.7%

    ship49.2%85.4%74.2%70.7%

    wheat68.9%69.7%92.5%90.5%

    corn48.2%65.3%91.8%91.8%

    Avg Top 1064.6%81.5%88.4%91.2%

    Avg All Cat61.7%75.2%na86.2%

    RocchioNBayesTreesLinearSVMnew svm data from john 2/19/98

    earn92.9%95.9%97.8%98.2%

    acq64.7%87.8%89.7%92.8%

    money-fx46.7%56.6%66.2%74.0%

    grain67.5%78.8%85.0%92.4%

    crude70.1%79.5%85.0%88.3%

    trade65.1%63.9%72.5%73.5%

    interest63.4%64.9%67.1%76.3%

    ship49.2%85.4%74.2%78.0%

    wheat68.9%69.7%92.5%89.7%

    corn48.2%65.3%91.8%91.1%

    Avg Top 1064.6%81.5%88.4%91.4%

    Avg All Cat61.7%75.2%na86.4%

    run times (msec), excluding feature extraction (except findsim which includes feature extraction)

    catFindsim-estNbayesTrees

    0~.20.60.9

    1~.20.60.3

    3~.20.60.3

    4~.20.60.3

    5~.20.60.3

    6~.20.60.3

    7~.20.60.3

    8~.20.60.3

    9~.20.60.3

    10~.20.60.3

    summary

    accuracy

    ThorstenMS

    catlabelNBayesRocchioC4.5-treek-NNSVM-polSVM-rbfNbayesBayesNetsTreesFindsimLinear SVM

    0earn95.495.795.996.997.99895.9%95.8%97.8%92.90%98.20%

    1acq91.98784.592.294.395.287.8%88.3%89.7%64.70%93.00%

    3money-fx62.662.870.678.275.275.456.6%58.8%66.2%46.70%72.70%

    4grain73.579.289.982.290.992.178.8%81.4%85.0%67.50%93.20%

    5crude84.176.175.384.987.287.879.5%79.6%85.0%70.10%88.80%

    6trade31.875.558.776.777.27863.9%69.0%72.5%65.10%75.20%

    7interest58.47148.874.870.675.664.9%71.3%67.1%63.40%76.70%

    8ship78.182.573.678.785.485.185.4%84.4%74.2%49.20%70.70%

    9wheat63.477.186.778.985.985.269.7%82.7%92.5%68.90%90.50%

    10corn48.863.178.976.285.784.865.3%76.4%91.8%48.20%91.80%

    estmicro-top10-opttest77.182.682.886.189.990.7

    microavg-top1081.5%85.0%88.4%64.60%91.20%

    microavg-90cats-opttest73.4%78.778.98285.686.3

    microavg-allcats75.2%80.0%na61.70%86.20%

    training times (sec)

    catNbayesBayesNetsTreesFindsim-est

    081452100

    181500

    381450

    481500

    581450

    681451450

    781450

    881450

    981400

    108145800

    ave81461450

    run times (msec), excluding feature extraction (except findsim which includes feature extraction)

    catNbayesBayesNets (est.)TreesFindsim-est

    00.61.20.9~.2

    10.61.20.3~.2

    30.61.20.3~.2

    40.61.20.3~.2

    50.61.20.3~.2

    60.61.20.3~.2

    70.61.20.3~.2

    80.61.20.3~.2

    90.61.20.3~.2

    100.61.20.3~.2

    summary-drew

    accuracyNbayesBnetsFindSimTreesSVM-lin

    ThorstenMSTraining (sec)8145< 114510

    catlabelNbayesFindSimRTreeC4.5SVM-rbfNbayesBNetsFindsimTreesSVM-linClassification (msec)< 0.6< 1.2~ 0.2< 0.3

  • Yang&Liu: SVM vs Other Methods

  • Results for Kernels (Joachims)

  • Confusion matrixIn a perfect classification, only the diagonal has non-zero entries53Class assigned by classifierActual ClassEntry (i, j) means 53 of the docs actually inclass i were put in class j by the classifier

  • SVM SummaryChoose hyperplane based on support vectorsSupport vector = critical point close to decision boundary(Degree-1) SVMs are linear classifiersKernels: powerful and elegant way to define similarity metricPerhaps best performing text classifierBut there are other methods that perform about as well as SVM, such as regularized logistic regression (Zhang & Oles 2001)Partly popular due to availability of SVMlightSVMlight is accurate and fast and free (for research)Now lots of software: libsvm, TinySVM, .Comparative evaluation of methodsReal world: exploit domain specific structure!

  • ResourcesManning and Schtze. Foundations of Statistical Natural Language Processing. Chapter 16. MIT Press.Christopher J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition, 1998.S. T. Dumais, Using SVMs for text categorization, IEEE Intelligent Systems, 13(4), Jul/Aug 1998.S. T. Dumais, J. Platt, D. Heckerman and M. Sahami. 1998. Inductive learning algorithms and representations for text categorization. Proceedings of CIKM 98, pp. 148-155. A re-examination of text categorization methods (1999)Yiming Yang, Xin Liu 22nd Annual International SIGIR.Tong Zhang, Frank J. Oles: Text Categorization Based on Regularized Linear Classification Methods. Information Retrieval 4(1): 5-31 (2001) Trevor Hastie, Robert Tibshirani and Jerome Friedman, "Elements of Statistical Learning: Data Mining, Inference and Prediction" Springer-Verlag, New York. T. Joachims, Learning to Classify Text using Support Vector Machines. Kluwer, 2002Tong Zhang, Frank J. Oles: Text Categorization Based on Regularized Linear Classification Methods. Information Retrieval 4(1): 5-31 (2001)

  • Question Answering

  • Question Answering from Open-Domain TextSearch Engines (IR) return list of (possibly) relevant documentsUsers still to have to dig through returned list to find answerQA: give the user a (short) answer to their question, perhaps supported by evidence

  • People do ask questionsExamples from AltaVista query logwho invented surf music?how to make stink bombswhere are the snowdens of yesteryear?which english translation of the bible is used in official catholic liturgies?how to do clayarthow to copy psxhow tall is the sears tower?Examples from Excite query log (12/1999)how can i find someone in texaswhere can i find information on puritan religion?what are the 7 wonders of the worldhow can i eliminate stressWhat vacuum cleaner does Consumers Guide recommendAround 1215% of query logs

  • The Google answer #1Include question words (why, who, etc.) in stop-listDo standard IRSometimes this (sort of) works:Question: Who was the prime minister of Australia during the Great Depression?Answer: James Scullin (Labor) 192931

  • Page about Curtin (WW II Labor Prime Minister)(Can deduce answer)Page about Curtin (WW II Labor Prime Minister)(Lacks answer)Page about Chifley(Labor Prime Minister)(Can deduce answer)

  • But often it doesntQuestion: How much money did IBM spend on advertising in 2002?Answer: I dunno, but Id like to

  • The Google answer #2Take the question and try to find it as a string on the webReturn the next sentence on that web page as the answerWorks brilliantly if this exact question appears as a FAQ question, etc.Works lousily most of the timeReminiscent of the line about monkeys and typewriters producing ShakespeareBut a slightly more sophisticated version of this approach has been revived in recent years with considerable success

  • AskJeeves AskJeeves was the most hyped example of Question answeringHave basically given up now: just web search except when there are factoid answers of the sort MSN also doesIt largely did pattern matching to match your question to their own knowledge base of questionsIf that works, you get the human-curated answers to that known questionIf that fails, it falls back to regular web searchA potentially interesting middle ground, but a fairly weak shadow of real QA

  • Online QA ExamplesLCC: http://www.languagecomputer.com/demos/question_answering/index.htmlAnswerBus is an open-domain question answering system: www.answerbus.comIonaut: http://www.ionaut.com:8400/EasyAsk, AnswerLogic, AnswerFriend, Start, Quasm, Mulder, Webclopedia, etc.

  • Question Answering at TRECQuestion answering competition at TRECUntil 2004, consisted of answering a set of 500 fact-based questions, e.g. When was Mozart born?.For the first three years systems were allowed to return 5 ranked answer snippets (50/250 bytes) to each question.IR thinkMean Reciprocal Rank (MRR) scoring:1, 0.5, 0.33, 0.25, 0.2, 0 for 1, 2, 3, 4, 5, 6+ docMainly Named Entity answers (person, place, date, )From 2002 the systems are only allowed to return a single exact answer and the notion of confidence has been introduced

  • The TREC Document CollectionThe retrieval collection uses news articles from the following sources:AP newswire, 1998-2000New York Times newswire, 1998-2000Xinhua News Agency newswire, 1996-2000In total there are 1,033,461 documents in the collection: 3GB of textToo much to handle entirely using NLP techniques so the systems usually consist of an initial information retrieval phase followed by more advanced processingMany supplement this text with use of the web, and other knowledge bases

  • Sample TREC questions1. Who is the author of the book, The Iron Lady: A Biography of Margaret Thatcher?2. What was the monetary value of the Nobel Peace Prize in 1989?3. What does the Peugeot company manufacture?4. How much did Mercury spend on advertising in 1993?5. What is the name of the managing director of Apricot Computer?6. Why did David Koresh ask the FBI for a word processor?7. What debts did Qintex group leave?8. What is the name of the rare neurological disease with symptoms such as: involuntary movements (tics), swearing, and incoherent vocalizations (grunts, shouts, etc.)?

  • Top Performing SystemsCurrently the best performing systems at TREC can answer approximately 60-80% of the questionsApproaches and successes have varied a fair dealKnowledge-rich approaches, using a vast array of NLP techniquesNotably Harabagiu, Moldovan et al. SMU/UTD/LCCAskMSR system stressed how much could be achieved by very simple methods with enough textMiddle ground is to use a large collection of surface matching patterns (Insight Software)

  • TREC 2000 Q&A track693 fact-based, short answer questionseither short (50 B) or long (250 B) answer ~3 GB newspaper/newswire text (AP, WSJ, SJMN, FT, LAT, FBIS)Score: MRR (penalizes second answer)Questions: 186 (Encarta), 314 (seeds from Excite logs), 193 (syntactic variants of 54 originals)

  • Sample Questions

  • Question Types

    Class 1Answer: single datum or list of itemsC: who, when, where, how (old, much, large)Class 2A: multi-sentenceC: extract from multiple sentencesClass 3A: across several textsC: comparative/contrastiveClass 4A: an analysis of retrieved informationC: synthesized coherently from several retrieved fragmentsClass 5A: result of reasoningC: word/domain knowledge and common sense reasoning

  • Question subtypes

  • TREC 2000 ResultsKnowledge RichApproach:Falcon

  • Falcon: ArchitectureQuestionQuestion Semantic FormExpectedAnswerTypeQuestion Logical FormAnswer ParagraphsAnswer Semantic FormAnswerAnswer Logical FormParagraph IndexQuestion ProcessingParagraph ProcessingAnswer ProcessingCollins Parser + NE ExtractionParagraph filteringCollins Parser + NE ExtractionAbduction FilterCoreference ResolutionQuestion TaxonomyQuestion ExpansionWordNet

  • Question parseWho was the first Russian astronaut to walk in spaceWPVBDDTJJNNPNPTOVBINNNNPNPPPVPSVPS

  • Question semantic formastronautwalkspaceRussianfirstPERSONfirst(x) astronaut(x) Russian(x) space(z) walk(y, z, x) PERSON(x)Question logic form:Answer type

  • Expected Answer TypesizeArgentinadimensionQUANTITYWordNetQuestion: What is the size of Argentina?

  • Questions about definitionsSpecial patterns:What {is|are} ?What is the definition of ?Who {is|was|are|were} ?Answer patterns:{is|are}, {a|an|the} -

  • Question TaxonomyReasonNumberMannerLocationOrganizationProductLanguageMammalCurrencyNationalityQuestionGameReptileCountryCityProvinceContinentSpeedDegreeDimensionRateDurationPercentageCount

  • Question expansionMorphological variantsinvented inventorLexical variantskiller assassinfar distanceSemantic variantslike prefer

  • Indexing for Q/AAlternatives:IR techniquesParse texts and derive conceptual indexesFalcon uses paragraph indexing:Vector-Space plus proximityReturns weights used for abduction

  • Abduction to justify answersBackchaining proofs from questionsAxioms:Logical form of answerWorld knowledge (WordNet)Coreference resolution in answer textExampleQ: When was the internal combustion engine invented?A: The first internal-combustion engine was built in 1867invent create_mentally create buildEffectiveness:14% improvementFilters 121 erroneous answers (of 692)Takes 60% of question processing time

  • TREC 2001 ResultsSurfaceApproach:Insight Software

  • TREC 2001: no NLPBest system from Insight Software using surface patternsAskMSR uses a Web Mining approach, by retrieving suggestions from Web searches

  • Insight Sofware: Surface patterns approachBest at TREC 2001: 0.68 MRRUse of Characteristic PhrasesWhen was bornTypical answersMozart was born in 1756.Gandhi (1869-1948)...Suggests phrases (regular expressions) like was born in ( -Use of Regular Expressions can help locate correct answer

  • Use Pattern LearningExample:The great composer Mozart (1756-1791) achieved fame at a young ageMozart (1756-1791) was a geniusThe whole world would always be indebted to the great music of Mozart (1756-1791)Longest matching substring for all 3 sentences is "Mozart (1756-1791)Suffix tree would extract "Mozart (1756-1791)" as an output, with score of 3Reminiscent of Information Extraction pattern learning

  • Pattern Learning (cont.)Repeat with different examples of same question typeGandhi 1869, Newton 1642, etc.Some patterns learned for BIRTHDATEa. born in , b. was born on , c. ( -d. ( - )

  • Experiments6 different Q typesfrom Webclopedia QA Typology (Hovy et al., 2002a)BIRTHDATELOCATIONINVENTORDISCOVERERDEFINITIONWHY-FAMOUS

  • Experiments: pattern precisionBIRTHDATE table:1.0 ( - )0.85 was born on ,0.6 was born in 0.59 was born 0.53 was born0.50- ( 0.36 ( -DEFINITION1.0 and related 1.0form of , 0.94as , and

  • Shortcomings & ExtensionsNeed for POS &/or semantic types"Where are the Rocky Mountains?"Denver's new airport, topped with white fiberglass cones in imitation of the Rocky Mountains in the background , continues to lie empty in Named Entity tagger &/or ontology could enable system to determine background is not a location name

  • Shortcomings ... (cont.)Long distance dependenciesWhere is London?London, which has one of the most busiest airports in the world, lies on the banks of the river Thameswould require pattern like: , ()*, lies on Abundance & variety of Web data helps system to find an instance of patterns w/o losing answers to long distance dependencies

  • Shortcomings... (cont.)System currently has only one anchor wordDoesn't work for Q types requiring multiple words from question to be in answerIn which county does the city of Long Beach lie?Long Beach is situated in Los Angeles Countyrequired pattern: is situated in Did not use caseWhat is a micron?...a spokesman for Micron, a maker of semiconductors, said SIMMs are...If Micron had been capitalized in question, would be a perfect answer

  • AskMSR: Web Mining Web Question Answering: Is More Always Better?Dumais, Banko, Brill, Lin, Ng (Microsoft, MIT, Berkeley)

    Q: Where is the Louvre located?Want Paris or France or 75058 Paris Cedex 01 or a mapDont just want URLs

  • AskMSR: Details12345

  • Step 1: Rewrite queriesIntuition: The users question is often syntactically quite close to sentences that contain the answerWhere is the Louvre Museum located?The Louvre Museum is located in Paris

    Who created the character of Scrooge?Charles Dickens created the character of Scrooge.

  • Query rewritingClassify question into seven categoriesWho is/was/are/were?When is/did/will/are/were ?Where is/are/were ?a. Category-specific transformation ruleseg For Where questions, move is to all possible locationsWhere is the Louvre Museum located is the Louvre Museum located the is Louvre Museum located the Louvre is Museum located the Louvre Museum is located the Louvre Museum located isb. Expected answer Datatype (eg, Date, Person, Location, )When was the French Revolution? DATE

    Hand-crafted classification/rewrite/datatype rulesNonsense, but who cares? Itsonly a few more queries to Google.

  • Step 2: Query search engineSend all rewrites to a Web search engineRetrieve top N answersFor speed, rely just on search engines snippets, not the full text of the actual document

  • Step 3: Mining N-GramsSimple: Enumerate all N-grams (N=1,2,3 say) in all retrieved snippetsUse hash table and other fancy footwork to make this efficientWeight of an N-gram: occurrence count, each weighted by reliability (weight) of rewrite that fetched the documentExample: Who created the character of Scrooge?Dickens - 117Christmas Carol - 78Charles Dickens - 75Disney - 72Carl Banks - 54A Christmas - 41Christmas Carol - 45Uncle - 31

  • Step 4: Filtering N-GramsEach question type is associated with one or more data-type filters = regular expressionsWhenWhereWhat Who

    Boost score of N-grams that do match regexpLower score of N-grams that dont match regexpDetails omitted from paper.DateLocationPerson

  • Step 5: Tiling the Answers Dickens Charles Dickens Mr CharlesScores

    20

    15

    10 merged, discard old n-grams Mr Charles DickensScore 45N-Gramstile highest-scoring n-gramN-GramsRepeat, until no more overlap

  • AskMSR ResultsStandard TREC contest test-bed: ~1M documents; 900 questionsTechnique doesnt do too well (though would have placed in top 9 of ~30 participants!)MRR = 0.262 (i.e., right answer ranked about #4-#5 on average)Why? Because it relies on the enormity of the Web!Using the Web as a whole, not just TRECs 1M documents MRR = 0.42 (i.e., on average, right answer is ranked about #2-#3)

  • AskMSR IssuesIn many scenarios (e.g., monitoring an individuals email) we only have a small set of documentsWorks best/only for Trivial Pursuit-style fact-based questionsLimited/brittle repertoire ofquestion categoriesanswer data types/filtersquery rewriting rules

  • PiQASso: Pisa Question Answering SystemComputers are useless, they can only give answersPablo Picasso

  • PiQASso ArchitectureSentenceSplitterIndexerQueryFormulation/ExpansionWordNetMiniPar?DocumentcollectionMiniParTypeMatchingRelationMatchingAnswer ParsAnswerScoringPopularityRankingAnswer found?AnswerQuestionanalysisAnswer analysisWNSenseQuestionClassification

  • Question AnalysisWhat metal has the highest melting point?metal, highest, melting, point2 Keyword extraction1 Parsing3 Answer type detectionSUBSTANCE4 Relation extraction

    NL question is parsedPOS tags are used to select search keywordsExpected answer type is determined applying heuristic rules to the dependency treeAdditional relations are inferred and the answer entity is identified

  • Answer AnalysisTungsten is a very dense material and has the highest melting point of any metal.1 Parsing.2 Answer type check3 Relation extractionSUBSTANCE

    4 Matching DistanceTungsten6 Popularity RankingANSWERParse retrieved paragraphsParagraphs not containing an entity of the expected type are discardedDependency relations are extracted from Minipar outputMatching distance between word relations in question and answer is computedToo distant paragraphs are filtered outPopularity rank used to weight distances5 Distance Filtering

  • Match Distance between Question and AnswerAnalyze relations between corresponding words considering:number of matching words in question and in answerdistance between words. Ex: moon - satelliterelation types. Ex: words in the question related by subj while the matching words in the answer related by pred

  • http://medialab.di.unipi.it/askpiqasso.html

  • TREC 2004Questions: factoid (230) and list (56) in series (65), definitionWhere was Franz Kafka born?When was he born?What books did he author?Question series provide context

  • TREC 2004 Results (factoid)Knowledge RichApproach:PowerAnswer2

  • PowerAnswer2 ArchitectureContext InsertionStrategy SelectionAnswer DetectionKeyword SelectionKeyword Alternations/ExpansionsPassage RetrievalLexical MatchingSemantic MatchingAnswer RankingAnswer FormulationQANERParserCOGEX

  • COGEX Theorem ProverLinguistic axioms improve overall accuracy by 12%Semantic ParserAxiom BuildingSemantic CalculusLogic ProverAnwer RankingQTextLinguisticAxiomsWord KnowledgeAxiomsWordNetAxioms

  • QA: Linguistic + Web mining

  • Saarland: linguistic + WebMatch question with syntactic pattern surface structure of possible answerSearch Web, parse, tag and match resultsUse NER to identify answer

  • ExampleQuestion: When did Amtrak begin operations?pattern: When,did,NP,Verb,NP|PPtarget:ref(3), ref(4)->past, ref(5), in, NPtarget:in,NP,[,],ref(3),ref(4)->past,ref(5)answerType:NP|PPWeights:Date = 4, year = 3, other = 1

  • Web searchQueries:Amtrak began operations inIn Amtrak began operationsResults:Amtrak began operations in 1971 through 1971, when Amtrak began operations in the stateMatches:1971: 4the state: 1

  • Effectiveness

  • QA without linguistic knowledge

  • Lexiclone: ApproachFind answer in a smaller and specific databaseExtract a few predicative definitions from the text summaryFind in AQUAINT sentence with same predicatives as those foundAccuracy: 0.63 (second best score in 2003)

  • Lexiclone: SummaryMost dont live in the city, 3 nouns, 3 verbs, 4 adjectivesBuild all triads:City live mostMost live cityCity live inMost live in

  • QA: role of Named Entities

  • Lexical Terms Extraction as input to IRQuestions approximated by sets of unrelated words (lexical terms)Similar to bag-of-word IR models: but choose nominal non-stop words and verbs

    Question (from TREC QA track)Lexical termsQ002: What was the monetary value of the Nobel Peace Prize in 1989?monetary, value, Nobel, Peace, PrizeQ003: What does the Peugeot company manufacture?Peugeot, company, manufactureQ004: How much did Mercury spend on advertising in 1993?Mercury, spend, advertising, 1993

  • Extracting Answers for Factoid Questions: NERIn TREC 2003 the LCC QA system extracted 289 correct answers for factoid questionsA Name Entity Recognizer was responsible for 234 of themNames are classified into classes matched to questions

  • NE-driven QAThe results of the past 5 TREC evaluations of QA systems indicate that current state-of-the-art QA is largely based on the high accuracy recognition of Named Entities:Precision of recognitionCoverage of name classesMapping into concept hierarchiesParticipation into semantic relations (e.g. predicate-argument structures or frame semantics)

  • Not all problems are solved yet!Where do lobsters like to live?on a Canadian airline

    Where are zebras most likely found?near dumpsin the dictionary

    Why can't ostriches fly?Because of American economic sanctions

    Whats the population of Mexico?Three

    What can trigger an allergic reaction?..something that can trigger an allergic reaction

  • ReferencesAskMSR: Question Answering Using the Worldwide WebMichele Banko, Eric Brill, Susan Dumais, Jimmy Linhttp://www.ai.mit.edu/people/jimmylin/publications/Banko-etal-AAAI02.pdfIn Proceedings of 2002 AAAI SYMPOSIUM on Mining Answers from Text and Knowledge Bases, March 2002Web Question Answering: Is More Always Better?Susan Dumais, Michele Banko, Eric Brill, Jimmy Lin, and Andrew Nghttp://research.microsoft.com/~sdumais/SIGIR2002-QA-Submit-Conf.pdfD. Ravichandran and E.H. Hovy. 2002. Learning Surface Patterns for a Question Answering System. ACL conference, July 2002.

  • ReferencesS. Harabagiu, D. Moldovan, M. Paca, R. Mihalcea, M. Surdeanu, R. Bunescu, R. Grju, V.Rus and P. Morrescu. FALCON: Boosting Knowledge for Answer Engines. The Ninth Text REtrieval Conference (TREC 9), 2000. Marius Pasca and Sanda Harabagiu, High Performance Question/Answering, in Proceedings of the 24th Annual International ACL SIGIR Conference on Research and Development in Information Retrieval (SIGIR-2001), September 2001, New Orleans LA, pages 366-374. L. Hirschman, M. Light, E. Breck and J. Burger. Deep Read: A Reading Comprehension System. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, 1999. C. Kwok, O. Etzioni and D. Weld. Scaling Question Answering to the Web. ACM Transactions in Information Systems, Vol 19, No. 3, July 2001, pages 242-262. M. Light, G. Mann, E. Riloff and E. Breck. Analyses for Elucidating Current Question Answering Technology. Journal of Natural Language Engineering, Vol. 7, No. 4 (2001).

  • Dependency Parsing

  • Parsing in QATop systems in TREC 2005 perform parsing of queries and answer paragraphsSome use specially built parserParsers are slow: ~ 1min/sentence

  • Constituent ParsingRequires GrammarCFG, PCFG, Unification GrammarProduces phrase structure parse treeRolls-Royce Inc. said it expects its sales to remain steadyADJPVPNPSVPSNPVPNPVP

  • Dependency TreeWord-word dependency relationsFar easier to understand and to annotateRolls-Royce Inc. said it expects its sales to remain steady

  • Parsing as classificationInductive dependency parsingParsing based on Shift/Reduce actionsLearn from annotated corpus which action to perform

  • Parser ActionsRightHoVER:auxvistoVER:pperunaDETragazzaNOMconPREgliDETocchialiNOM.POSnexttopShiftLeft

  • Parser StateThe parser state is a quadruple S, I, T, A, whereS is a stack of partially processed tokensI is a list of (remaining) input tokensT is a stack of temporary tokensA is the arc relation for the dependency graph

    (w, r, h) A represents an arc w h, tagged with dependency r

  • Parser Actions

  • Learning Features

  • Learning EventleggiNOMleDETantiADVchePRO,PONSerbiaNOMeranoVERdiscusseADJchePROSostenevaVERcontextleft contexttarget nodesright context(-3, W, che), (-3, P, PRO),(-2, W, leggi), (-2, P, NOM), (-2, M, P), (-2, W, P),(+2, W, ,), (+2, P, PON)

  • ProjectivityA dependency tree is projective iff every arc is projectiveAn arc wiwk is projective iff j, i < j < k or i > j > k, wi * wk

  • Non ProjectiveVtinu tchto pstroj lze take pouvat nejen jako fax , ale

  • Actions for non-projective arcs

  • CoNLL-X Shared Task Results

  • Future DirectionsOpinion ExtractionFinding opinions (positive/negative)Blog track in TREC2006Intent AnalysisDetermine author intent, such as: problem (description, solution), agreement (assent, dissent), preference (likes, dislikes), statement (claim, denial)

  • ReferencesS. Chakrabarti, Mining the Web, Morgan-Kaufmann, 2004.G. Attardi. Experiments with a Multilanguage Non-projective Dependency Parser, CoNLL-X, 2006.H. Yamada, Y. Matsumoto. Statistical Dependency Analysis with Support Vector Machines. In Proc. IWPT, 2003.

    Bias/variance in terms of resulting classifier given randomly selected training set; why it is a tradeoff; when to choose low-bias method, when to choose low-variance methodBias/variance in terms of resulting classifier given randomly selected training set; why it is a tradeoff; when to choose low-bias method, when to choose low-variance methodOf course it is a difficult area to evaluate the performance of systems as different people have different ideas as to what constitutes an answer, especially if the systems are trying to return exact answers.Mozart was born in 1756.

    The reason for wanting only exact answers to be returned, is to get systems to accurately pinpoint the answer within the text. Systems can then provide as much context as the user requests, but to return context you need to know the extent of the answer within the text, i.e. exactly what piece of text constitutes just the answer and nothing else.