information and telecommunication technology center (ittc) university of kansas smartxautofill...

16
Information and Telecommunication Technology Center (ITTC) University of Kansas SmartXAutofill SmartXAutofill Intelligent Data Entry Assistant Intelligent Data Entry Assistant for XML Documents for XML Documents Danico Lee Danico Lee April 7, 2005 April 7, 2005

Post on 19-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Information and TelecommunicationTechnology Center (ITTC)

University of Kansas

SmartXAutofillSmartXAutofillIntelligent Data Entry AssistantIntelligent Data Entry Assistant

for XML Documentsfor XML Documents

Danico LeeDanico LeeApril 7, 2005April 7, 2005

Information and Telecommunication

Technology Center (ITTC)University of Kansas

Background – XML TechnologyBackground – XML Technology XML is a mark-up language for data representation XML is a mark-up language for data representation

and data exchangeand data exchange Characteristics and advantages of XML:Characteristics and advantages of XML:

Users from different professions can Users from different professions can define their own tagsdefine their own tags and attribute and attribute names Allow people in the same field to names Allow people in the same field to exchange data and informationexchange data and information

XML document structures can be XML document structures can be nestednested to any level of to any level of complexitycomplexity XML document can contain an optional description of its grammar for XML document can contain an optional description of its grammar for

performing performing structural validationstructural validation XML is important in today’s high-volume data-collection XML is important in today’s high-volume data-collection

environmentsenvironments As of 10/26/2004, As of 10/26/2004, 1,556,0091,556,009 people worldwide were using a single XML people worldwide were using a single XML

application toolapplication tool Lots of software applications for XML, e.g. SOAP, XML Spy, Microsoft Office Lots of software applications for XML, e.g. SOAP, XML Spy, Microsoft Office

20032003 Major companies are putting their data in XMLMajor companies are putting their data in XML

Many professional groups have developed their own XML ontologies, Many professional groups have developed their own XML ontologies, e.g. OMF for meteorologists and CML for chemists e.g. OMF for meteorologists and CML for chemists

Information and Telecommunication

Technology Center (ITTC)University of Kansas

XML Document - Example XML Document - Example

Information and Telecommunication

Technology Center (ITTC)University of Kansas

Background - Autofill TechnologyBackground - Autofill Technology

Currently Currently 70 million70 million workers or workers or 59%59% of working adults in of working adults in the U.S. complete forms on a regular basisthe U.S. complete forms on a regular basis

Data entry process is tedious, error-prone, time-Data entry process is tedious, error-prone, time-consuming and person-power intensiveconsuming and person-power intensive

Most businesses continue to process almost Most businesses continue to process almost 80%80% of their forms of their forms manually (according to Verity Inc.)manually (according to Verity Inc.)

AutofillAutofill and Auto-completeAuto-complete technologies ease the burden of data entry by automatically predicting and suggesting values for empty data fields

ProblemsProblems with current autofill technologies: with current autofill technologies: Require a perfect match with the historical data, e.g. AOL Or, require previously stored templates, e.g. Roboform Mostly are for web-base forms Can only handle simple data, e.g. name and address in online shopping

forms; no support for complex XML structures Inaccurate

Information and Telecommunication

Technology Center (ITTC)University of Kansas

MotivationMotivation XML is the primary standard of data representation and XML is the primary standard of data representation and

data exchangedata exchange Most businesses continue to process almost Most businesses continue to process almost 80%80% of their of their

forms manuallyforms manually Data entry process for XML documents is Data entry process for XML documents is tedious, error-tedious, error-

prone, time-consuming prone, time-consuming and and person-power intensiveperson-power intensive Current software tools for XML only simplify the Current software tools for XML only simplify the

implementation processimplementation process Information for XML documents still needs to be manually enteredInformation for XML documents still needs to be manually entered

Previous software tools for assisting data entryPrevious software tools for assisting data entry Inaccurate Inaccurate Do not support complex XML grammarsDo not support complex XML grammars

Information and Telecommunication

Technology Center (ITTC)University of Kansas

ApproachApproach Our goal: reduce the burden on the user by automating Our goal: reduce the burden on the user by automating

the data entry into XML documents the data entry into XML documents SmartXAutofillSmartXAutofill - an intelligent data entry assistant for - an intelligent data entry assistant for

predicting and automating inputs for XML documents predicting and automating inputs for XML documents based on the contents of historical document collections in the based on the contents of historical document collections in the

same XML domainsame XML domain Incorporate an Incorporate an ensemble classifierensemble classifier that integrates multiple that integrates multiple

internal classification algorithms into a single architecture internal classification algorithms into a single architecture Each internal classifier uses Each internal classifier uses approximate techniquesapproximate techniques from from

Machine Learning to predict and suggest a value for an Machine Learning to predict and suggest a value for an empty XML field empty XML field

Approximate match: predict the empty node values between the Approximate match: predict the empty node values between the values in a historical collection of XML documents and the values values in a historical collection of XML documents and the values in a partially filled document, e.g. probabilisticin a partially filled document, e.g. probabilistic

Very different from current autofill systems which require a perfect Very different from current autofill systems which require a perfect match between the incomplete document and the values of stored match between the incomplete document and the values of stored documentsdocuments

Information and Telecommunication

Technology Center (ITTC)University of Kansas

OverviewOverview

1. User enters data into an XML form and moves cursor to an empty field2. SmartXAutofill examines the data entered3. SmartXAutofill examines the historical XML collection4. Machine Learning algorithms predict what the data value should be5. Weighting System learns and improves from past performance by rewarding

algorithms that make correct predictions6. Voting System forms a consensus decision 7. SmartXAutofill returns one or more suggestions for the current field8. User selects one of the SmartXAutofill suggestions or enters another value

Information and Telecommunication

Technology Center (ITTC)University of Kansas

Underlying Technology – Underlying Technology – Ensemble LearningEnsemble Learning

ProblemProblem: impossible to predict which classification : impossible to predict which classification algorithm will work best for what type of document algorithm will work best for what type of document

Solution: Solution: Ensemble classifierEnsemble classifier A collection of a number of classification algorithms; each A collection of a number of classification algorithms; each

classifier provides predictions for the value of an XML node classifier provides predictions for the value of an XML node Learn which individual algorithms provide better predictive Learn which individual algorithms provide better predictive

accuracy for different XML domains and for different nodes in accuracy for different XML domains and for different nodes in the XML documents in these domains the XML documents in these domains

Adapt itself to the specific XML collection, and perform better Adapt itself to the specific XML collection, and perform better than any individual predictive algorithmthan any individual predictive algorithm

Boosting is one of the most widely used ensemble Boosting is one of the most widely used ensemble methodmethod

Information and Telecommunication

Technology Center (ITTC)University of Kansas

Underlying Technology – Underlying Technology – Ensemble Learning (cont’d)Ensemble Learning (cont’d)

Our ensemble Our ensemble boosts the internal classifiersboosts the internal classifiers based on based on their their past performancespast performances through weighting the individual through weighting the individual classifiers classifiers

Previous work in boosting combined the same type of classifier, Previous work in boosting combined the same type of classifier, learned by the same methodology, but trained on different learned by the same methodology, but trained on different examples examples

Our ensemble combines different types of classifiers into an Our ensemble combines different types of classifiers into an integrated classification framework integrated classification framework

Extra feature: collection of XML documents used for Extra feature: collection of XML documents used for prediction are constrained by a “prediction are constrained by a “time windowtime window” ”

Only Only NN latest documents are used latest documents are used NN is defined by the user is defined by the user Allow the system to adapt itself to the type of documents being Allow the system to adapt itself to the type of documents being

entered recentlyentered recently

Information and Telecommunication

Technology Center (ITTC)University of Kansas

Ensemble Weighted Voting ExampleEnsemble Weighted Voting Example Three classifiers provide three

suggestions each All classifiers have the same weight

initially Classifiers are modified based on their

performance for different nodes in the XML domain

Classifier A makes three suggestions: the top one receives a rank value

of 3 the second one of 2 the third one of 1

Rank values are multiplied by the weight of the classifier and then normalized by the sum of the weights of all the classifiers

Suggestion with the highest score is the one selected by the ensemble and presented to the user

Information and Telecommunication

Technology Center (ITTC)University of Kansas

SmartXAutofillSmartXAutofill Demo Demo

Editor for the element

“Title”

Drop-down box containing the

best suggested values

Pop-up menu for adding new

elements

Editor for the element “place”, which has two child elements, “room”

and bldg

Node Information displays data

about the currently selected element

Suggestion Information displays the top-ranked suggestions from each suggestor for the currently selected element

Voting Information displays a bar for each possible suggestion - colored components show contribution of each suggestor in the vote

Weight Information displays historical accuracy of each suggestor for the currently selected element

Information and Telecommunication

Technology Center (ITTC)University of Kansas

Testing ApproachTesting Approach To span the size and complexity dimensions, XML document data To span the size and complexity dimensions, XML document data

were collected from 11 domainswere collected from 11 domains APAIS, BioMed, CALL, iProClass, PSD, NASA, NREF, SPROT, APAIS, BioMed, CALL, iProClass, PSD, NASA, NREF, SPROT,

UniProf, UWM, and WSU UniProf, UWM, and WSU Size ranged from around 50 to 5000 documentsSize ranged from around 50 to 5000 documents Between 20 and 420 nodes per document Between 20 and 420 nodes per document

Document collections were randomly separated into two sets: Document collections were randomly separated into two sets: seedseed (10% of the collection or 100 documents), and (10% of the collection or 100 documents), and trainingtraining collections collections

Seed collection - historical information for making predictions Seed collection - historical information for making predictions Training collection - trained the ensemble by modifying its weights Training collection - trained the ensemble by modifying its weights

based on the accuracy of the suggestionbased on the accuracy of the suggestion Continuously trained the learning component and tested the systemContinuously trained the learning component and tested the system Documents were randomly selected and all nodes were suggested in Documents were randomly selected and all nodes were suggested in

random orderrandom order Add documents from the training collection to the seed after usedAdd documents from the training collection to the seed after used Note: Classifier does not made suggestion for a particular field if there Note: Classifier does not made suggestion for a particular field if there

were no historical data for it or if every previous value for the field was were no historical data for it or if every previous value for the field was unique, e.g. abstracts of papersunique, e.g. abstracts of papers

Information and Telecommunication

Technology Center (ITTC)University of Kansas

Accuracy (%)

XML Domain No of Documents Naïve Bayes KNN Freq Recency Ensemble

APAIS 200 51.67 50.44 51.43 52.08 52.96

BioMed 2716 3.91 8.85 11.01 11.18 12.47

CALLS 271 99.81 99.81 99.81 99.45 99.83

iProClass 449 37.45 40.93 37.15 39.34 46.06

NASA 2335 23.07 22.23 24.99 27.04 30.43

NREF 450 56.89 66.79 64.08 69.28 70.47

PSD 450 23.41 30.84 30.25 30.4 35.87

SPROT 4900 11.2 19.38 17.66 19.35 23.8

UMW 450 37.94 32.9 38.41 52.66 54.13

UniProf 192 46.37 48.13 46.37 30.59 49.84

WSU 353 45.24 36.89 37.98 55.11 57.71

Test Results for Different DomainsTest Results for Different Domains

Information and Telecommunication

Technology Center (ITTC)University of Kansas

Weights of selected XML nodes Weights of selected XML nodes from iProClass domain from iProClass domain

NodeId Naïve Bayes KNN Freq Recency

6 0.4146 0.5805 0.4829 0.7724

7 0.6878 0.7528 0.5772 0.8423

13 0.1404 0.1066 0.1184 0.0880

14 0.0100 0.0100 0.0186 0.0100

22 0.0449 0.1124 0.0375 0.0637

28 0.2543 0.1890 0.2474 0.0825

Information and Telecommunication

Technology Center (ITTC)University of Kansas

Test Result DiscussionTest Result Discussion

Different classification algorithms Different classification algorithms perform better for different domains perform better for different domains

Ensemble classifier performed at least as Ensemble classifier performed at least as well as the best performing internal well as the best performing internal classification algorithm for a domainclassification algorithm for a domain

Different classifiers are preferred for Different classifiers are preferred for different nodesdifferent nodes

Information and Telecommunication

Technology Center (ITTC)University of Kansas

Our Technology - Our Technology - SmartXAutofillSmartXAutofill

First methodology proven to intelligently First methodology proven to intelligently predict, predict, suggest and autofill datasuggest and autofill data for XML documents for XML documents

Learn and adaptLearn and adapt itself to any XML domain itself to any XML domain without the need of custom algorithmswithout the need of custom algorithms

““Time windowTime window” allows the technology to adapt ” allows the technology to adapt itself to the particular set of XML documents itself to the particular set of XML documents being filled at that timebeing filled at that time

Speed up data entry process for XML Speed up data entry process for XML documents from documents from 20%20% to to 99%99%