xwrapcomposer - ecology labfaculty.cse.tamu.edu/caverlee/pubs/liu06xwrap.pdfservice, such as ncbi...

Copyright © 2006, Idea Group Inc. Copying or distributing in print or electronic forms without written permission of Idea GroupInc. is prohibited.

International Journal of Web Services Research, 3(2), 33-60, April-June 2006 33

XWRAPComposer:A Multi-Page Data Extraction Service

Ling Liu, Georgia Institute of Technology, USAJianjun Zhang, Georgia Institute of Technology, USA

Wei Han, IBM Research, Almaden Research Center, USACalton Pu, Georgia Institute of Technology, USA

James Caverlee, Georgia Institute of Technology, USASungkeun Park, Georgia Institute of Technology, USA

Terence Critchlow, Lawrence Livermore National Laboratory, USADavid Buttler, Lawrence Livermore National Laboratory, USA

Matthew Coleman, Lawrence Livermore National Laboratory, USA

ABSTRACT

We present a service-oriented architecture and a set of techniques for developing wrapper codegenerators, including the methodology of designing an effective wrapper program constructionfacility and a concrete implementation, called XWRAPComposer. Our wrapper generationframework has two unique design goals. First, we explicitly separate tasks of building wrappersthat are specific to a Web service from the tasks that are repetitive for any service, thus the codecan be generated as a wrapper library component and reused automatically by the wrappergenerator system. Second, we use inductive learning algorithms that derive information flowand data extraction patterns by reasoning about sample pages or sample specifications. Moreimportantly, we design a declarative rule-based script language for multi-page informationextraction, encouraging a clean separation of the information extraction semantics from theinformation flow control and execution logic of wrapper programs. We implement these designprinciples with the development of the XWRAPComposer toolkit, which can semi-automaticallygenerate WSDL-enabled wrapper programs. We illustrate the problems and challenges of multi-page data extraction in the context of bioinformatics applications and evaluate the design anddevelopment of XWRAPComposer through our experiences of integrating various BLASTservices.

Keywords: code generator; data extraction; service oriented architecture; Web services

INTRODUCTIONWith the wide deployment of Web ser-

vice technology, the Internet and the WorldWide Web (Web) have become the most popu-lar means for disseminating both business andscientific data from a variety of disciplines. For

example, vast and growing amount of life sci-ences data reside in specialized Bioinformaticsdata sources, and many of them are accessibleonline with specialized query processing capa-bilities. Concretely, the Molecular Biology Da-tabase Collection currently holds over 500 data

34 International Journal of Web Services Research, 3(2), 33-60, April-June 2006


sources (DBCAT, 1999), not even includingmany tools that analyze the information con-tained therein. Bioinformatics data sources overthe Internet have a wide range of query pro-cessing capabilities. Typically, many Web-based sources allow only limited types of se-lection queries. To compound the problem, datafrom one source often must be combined withdata from other sources to provide scientistswith the information they need.

Motivating ScenarioIn the Bioinformatics and Bioengineer-

ing domain, many biologists currently use avariety of tools, such as DNA microarrays, todiscover how DNA and the proteins they en-code may allow an organism to respond to vari-ous stress conditions such as exposure to en-vironmental mutagens (Quandt, Frech, Karas,Wingender, & Werner, 1995; Altschul et al.,1997; DBCAT, 1999). One way to accomplishthis task is for genomics researchers to identifygenes that react in the desired way, and thendevelop models to capture the common ele-ments. This model will be used to identify pre-viously unidentified genes that may also re-spond in similar fashion based on the commonelements. Figure 1 illustrates a workflow that agenomics researcher has created to gather thedata required for this analysis. This type ofworkflow significantly differs from traditionalworkflows, as it is iteratively generated to dis-cover the correct process with a small set ofdata as the initial input. At each step the re-searcher selects and extracts the part of theoutput data that is useful for his genomic analy-sis in the next step, and determines which ser-vices should be used in the next step in hisdata collection process. Once the workflow isconstructed, the genomic researcher will usethe workflow as the data collection pattern tocollect large quantities of data and perform largescale genomic analysis. Concretely, Figure 1shows a pattern of a promoter model where thedata collection is performed in eight steps us-ing possibly eight or more Bioinformatics datasources through service oriented computinginterfaces.

In Step (1), microarrays containing thegenes of interest are produced and exposed todifferent levels of a specific mutagen in the wet-lab, usually in a time dependent manner.

In Step (2) gene expression changes aremeasured and clustered using some computa-tional tools (e.g., Clusfavor (Peterson, 2002)),such that genes that changed significantly in amicro-array analysis experiment are identifiedand clustered. The representative genes fromClusfavor analysis will be used as the input forthe next data collection step. Typically the re-searcher must choose from a wide variety oftools available for this task either manuallybased on his past experience or using a Webservice selection facility. Each tool offers spe-cific advantages in terms of their ability to ana-lyze the microarray data, and each requires adifferent method of execution.

In Step (3), the full sequence from eachof the representative genes chosen in the sec-ond step is retrieved from gene-banks.

In Step (4), each gene sequence retrievedin Step (3) will be submitted to a gene matchingservice, such as NCBI Blast Web service, thatwill return homologs (other genes with similarsequences). The returned sequences will befurther examined to find promoter sequences.Again, there are several services that providegene similarity matching, many of which spe-cialize in a particular species, such as ACEdb(Stein & Thierry-Mieg, 1999).

Once related sequences are discovered,approximately 1000-5000 bases of the DNA se-quence around the alignment are extracted tocapture the promoter regulatory elements —the region of a gene where RNA polymerasecan bind and begin transcription to create theproteins that regulate cell function. In Step (5),these promoter sequences are identified andanalyzed using specific tools, such as Mat-In-spector (Peterson, 2002), TRANSFAC, TRRD,or COMPEL (Quandt et al., 1995) to find thecommon transcription binding factors. To ex-tract specific data, such as portions of a DNAsequence, returned by the sources, the dataneeds to be converted into a well-known for-mat, such as XML, and post-processed in or-



der to extract just the portions that are relevantfor the next step.

In Step (6), regulatory profiles are thencompared across each gene in the cluster todelineate common response elements that canbe fed into the promoter model generator tocreate a promoter model in Step (7). Once themodel is created, it can be used to search genedatabases to find other candidate genes rel-evant to the study in Step (8), which starts anew iteration where these genes are fed backinto this general workflow to refine and expandthe promoter model until the genomic researcheris satisfied with the result. The collection ofgenes found in this iterative process will bepresented as the final results of this complexdata analysis task.

It is important to point out that each ofthese steps requires service selection, auto-mated data extraction, service composition andintegration. Choosing the appropriate sourcedepends on the content, capabilities and loadof the source, as well as the trustworthiness ofthe source. Some sites have much stricter stan-dards on the quality of the data that they admit,while others publish information as soon as itis available. Depending on the current needs ofa particular researcher, different types of sites

may be more appropriate to query. In additionto selecting a capable and trustworthy source,there are significant issues in extracting datafrom the sites. Most sites have custom queryinterfaces and return results through a seriesof HTML pages. For example, NCBI BLAST(Basic Local Alignment Search Tool) (Altschulet al., 1997) requires the user to take three orfour steps in order to retrieve the matching se-quence homologs. First, a gene sequence mustbe submitted through an HTML form. Usersmay then optionally select the format in whichthe data returned should be represented. Then,a series of delay pages are shown while theservice calculates the final answer. Once theanswer is computed, a page listing the relatedsequence ids and their alignment informationare presented. The full homolog sequence isavailable by following a link from each align-ment. Just to retrieve one set of similar se-quences from this tool requires a significantamount of human effort in following each link,extracting the 1000-5000 bases of the DNA se-quence around the alignment and integratingthe data from each extraction to form the finalresult of one BLAST search.

Figure 1. Example workflow for developing a promoter model



Challenges of Data Extractionand Data Integration

The extraordinary growth of service ori-ented computing has been fueled by the en-hanced ability to make a growing amount ofinformation available through the Web. Thisbrings good news and bad news.

The good news is that Web services pro-vide the standard invocation interface for re-mote service calls and the bulk of useful andvaluable information is designed and publishedin a human browsing format (HTML or XML).The bad news is that these “human-oriented”Web pages returned by Web services are diffi-cult for programs to capture and extract infor-mation of interests automatically, and to fuseand integrate data from multiple autonomousand yet heterogeneous data producer services.Also different Web services use different andevolving custom data formats.

A popular approach to handle this prob-lem is to write data wrappers to encapsulate theaccess to Web sources and to automate theinformation extraction tasks on behalf of hu-man. A wrapper is a software program special-ized to a single data source or single Web ser-vice (e.g., a Web site), which converts the sourcedocuments and queries from the source datamodel to another, usually a more structured,data model (Liu, Pu, & Han, 1999).

Several projects have implemented hand-coded wrappers for a variety of sources (Haas,Kossmann, Wimmers, & Yan, 1997; Bayardo, Jr.et al., 1997; Li et al., 1997; Knoblock et al., 1998).However, manually writing such a wrapper andmaking it robust is costly due to the irregular-ity, heterogeneity, and frequent updates of theWeb site and the data presentation formats theyuse. Hand-coding wrappers can become a ma-jor pain in situations where the data integrationapplications are more interested in integratingnew data sources or frequently changing Websources. We observe that, with a good designmethodology, only a relatively small part of thewrapper code deals with the source-specificdetails, and the rest of the code is either com-mon among wrappers or can be expressed at ahigher level, more structured fashion. There are

a number of challenging issues in automationof the wrapper code generation process.

First, most Web pages are HTML or XMLdocuments, which are semi-structured textiles,annotated with various HTML presentationtags. Due to the frequent changes in presenta-tion style of the HTML documents, the lack ofsemantic description of their information con-tent, and the difficulty in making all applica-tions in one domain use the same XML schema,it is hard to identify the content of interest us-ing common pattern recognition technologysuch as string regular expression specificationused in LEX and YACC.

Second, wrappers for Web sourcesshould be more robust and adaptive in the pres-ence of changes in both presentation style andinformation content of the Web pages.

It is expected that the wrappers gener-ated by the wrapper generation systems willhave lower maintenance overhead thanhandcrafted wrappers for unexpected changes.

Third, wrappers often serve as interfaceprograms and pass the Web data extracted toapplication-specific information broker agentsor information integration mediators for moresophisticated data analysis and data manipula-tion. Thus it is desirable to provide a wrapperinterface language that is simple, self-describ-ing, and yet powerful enough for extracting andcapturing information from most of the Webpages. In scientific computing domains suchas bioinformatics and bioengineering, informa-tion extraction over multiple different pagesimposes additional challenges for wrapper codegeneration systems due to the varying correla-tion of the pages involved. The correlation canbe either horizontal when grouping data fromhomogeneous documents (such as multiple re-sult pages from a single search) or vertical whenjoining data from heterogeneous but relateddocuments (a series of pages containing infor-mation about a specific topic). Furthermore, thecorrelation can be extended into a graph ofworkflows as describe in Figure 1.

Therefore, there is an increasing demandfor automated wrapper code generation sys-tems to incorporate a multi-page information



extraction service. A multi-page wrapper notonly enriches the capability of wrappers to ex-tract information of interests but also increasesthe sophistication of wrapper code generation.

Surprisingly, almost all existing wrappersgenerated by application code generators (DISLGroup, Georgia Institute of Technology, 2000;Sahuguet & Azavant, 1999; Baumgartner,Flesca, & Gottlob, 2001) are single-page wrap-pers in the sense that the wrapper program re-sponds to a keyword query by analyzing onlythe page immediately returned.

Most wrappers cannot follow the linkswithin this page to continue the informationextraction from other linked pages, unless sepa-rate queries are issued to locate other linkedpages.

Bearing all these issues in mind, we de-velop a code generation framework for build-ing a semi-automated wrapper code generationsystem that can generate wrappers capable ofextracting information from multiple inter-linkedWeb documents, and we implement this frame-work with XWRAPComposer, a toolkit for semi-automatically generating Java wrapper pro-grams that can collect and extract data frommultiple inter-linked pages automatically.XWRAPComposer has three unique featureswith regard to supporting multi-page data ex-traction.

First, we introduce interface, outerface,and composer script for each wrapper programwe generate. By encoding wrapper develop-ers’ knowledge in Interface Specification,Outerface Specification, and Composer Script,XWRAPComposer integrates single-pagewrapper programs into a composite wrappercapable of extracting information across mul-tiple inter-linked pages from one service pro-vider.

Second, XWRAPComposer transformsthe multi-page information extraction probleminto an integration problem of multiple single-page data extraction results, and utilizes thecomposer script to interconnect a sequence ofsingle-page data extraction results, offering flex-ible execution choices to address diverse needsof different users. It generates platform-inde-

pendent Java code that can be executed locallyon users’ machine. It also provides a WSDL-plugin module to allow users to produce WSDLenabled wrappers as Web Services (W3C, 2003).

Third but not the least, XWRAPComposersupports micro-workflow management, such asintermediate information flow or result audit-ing. We demonstrate this capability by inte-grating XWRAPComposer and its generatedwrappers with some process modeling toolssuch as Ptolemy (Berkeley, 2003), allowing us-ers to interactively manage different compo-nents of a wrapper and the interaction betweenthem. In the following sections, we first give anoverview of the XWRAPComposer system ar-chitecture, and then describe some importantdesign and development efforts, using ourmotivating scenario described in this sectionas our application environment. Finally, we de-scribe the status of the XWRAPComposer sys-tem development and discuss the future work.

The Design FrameworkA multi-page wrapper code generation is

a complex process and it is not reasonable, ei-ther from a logical point of view or from an imple-mentation point of view, to consider the con-struction process as occurring in one singlestep. For this reason, we partition the wrapperconstruction process into a series of subpro-cesses called phases, as shown in Figure 2. Aphase is a logically cohesive operation thattakes as input one representation of the sourcedocument and produces as output another rep-resentation. XWRAPComposer wrapper gen-eration goes through six phases to constructand release a Java wrapper. Tasks within a phaserun concurrently using a synchronized queue;each runs its own thread. For example, we de-cide to run the task of fetching a remote docu-ment and the task of repairing the bad format-ting of the fetched document using two con-currently synchronous threads in a single passof the source document. The task of generat-ing a syntactic-token parse tree from an HTMLdocument requires as input the entire docu-ment; thus, it cannot be done in the same passas the remote document fetching and the syn-



tax reparation. Similar analysis applies to theother tasks such as code generation, testing,and packaging.

The interaction and information exchangebetween any two of the phases is performedthrough communication with the bookkeepingand the error handling routines. The book keep-ing routine of the wrapper generator collectsinformation about all the data objects that ap-pear in the retrieved source document, keepstrack of the names used by the program, andrecords essential information about each. Forexample, a wrapper needs to know how manyarguments a tag expects, whether an elementrepresents a string or an integer. The data struc-ture used to record this information is called asymbol table. The error handler is designedfor the detection and reporting errors in thefetched source document. The error messagesshould allow a wrapper developer to determineexactly where the errors have occurred. Errorscan be encountered at virtually all the phasesof a wrapper. Whenever a phase of the wrapperdiscovers an error, it must report the error tothe error handler, which issues an appropriatediagnostic message. Once the error has beennoted, the wrapper must modify the input tothe phase detecting the error, so that the lattercan continue processing its input, looking forsubsequent errors. Good error handling is diffi-cult because certain errors can mask subsequent

errors. Other errors, if not properly handled,can spawn an avalanche of spurious errors.Techniques for error recovery are beyond thescope of this paper.

Figure 2 presents an architecture sketchof the XWRAPComposer system. The systemarchitecture of XWRAPComposer consists offour major components: (1) Remote Connec-tion and Source-specific Parser; (2) Multi-pageData Extraction; (3) Code Generation and Pack-aging; and (4) Debugging and Release. Othercomponents include GUI interface, bookkeep-ing and error handling. The GUI interface al-lows wrapper developers to specify workflowof the multi-page data extraction, the request-respond flow control rules and cross-page dataextraction rules interactively.

Remote Connection and Source-specificParser is the first component which preparesand sets up the environment for informationextraction process by performing the followingthree tasks. First, it accepts an URL selectedand entered by the XWRAPComposer user, is-sues an HTTP request to the remote serviceprovider identified by the given URL, andfetches the corresponding Web document (orso called page object). During this process, theXWRAPComposer will learn the search inter-face and the remote service invocation proce-dure in the background and generate a set ofrules that describe the list of interface func-

Figure 2. XWRAPComposer system architecture



tions and parameters as well as how they areused to fetch a remote document from a givenWeb source. The list of interface functions in-clude the declaration to the standard libraryroutines for establishing the network connec-tion, issuing an HTTP request to the remoteWeb server through a HTTP Get or HTTP Postmethod, and fetching the corresponding Webpage. Other desirable functions include build-ing the correct URL to access the given serviceand pass the correct parameters, and handlingredirection, failures, or authorization if neces-sary. Second, it cleans up bad HTML tags andsyntactical errors using an XWRAPComposerplugin such as HTML TIDY (Raggett, 1999;W3C, 1999). Third, it transforms the retrievedpage object into a parse tree or so-called syn-tactic token tree. This page object will be usedas a sample for XWRAPComposer to interactwith the user to learn and derive the importantinformation extraction rules, and the list oflinked pages the user is interested in extract-ing information in conjunction with this page.In addition, all wrappers generated byXWRAPComposer use the streaming mode in-stead of the blocking mode. Namely, the wrap-per will read the Web page block1 one at atime. An interface specification will be createdin this phase.

Multi-page Data Extraction is the sec-ond component which is responsible for deriv-ing information flow control logic and multi-page extraction logic. Both are represented inform of rules. The former describes the flowcontrol logic of the targeted service in respond-ing to a service request and the latter describeshow to extract information content of interestfrom the answer page and the linked pages ofinterest. XWRAPComposer performs the multi-page information extraction task in four steps:(1) specify the structure of the retrieved docu-ment (page object) in a declarative extractionrule language. (2) identify the interesting re-gions of the main page object and generatinginformation extraction rules for this page; (3)identify the list of URLs referenced in the ex-tracted regions in the main page; and (4) gener-ating information extraction rules for each of

the pages linked from the interesting regionsof the main page object. We perform singlepage data extraction process using theXWRAPElite (DISL Group, Georgia Instituteof Technology, 2000) toolkit, a single page dataextraction service developed by the XWRAPteam at Georgia Tech. At the end of this phase,XWRAPComposer produces two specifica-tions: an outerface specification that describesthe output format of the extraction result willbe produced, and a composer script that de-scribes both the information flow control pat-terns and the multi-page data extraction pat-terns.

Code Generation and Packaging is thethird component, which generates the wrapperprogram code by applying three sets of rulesabout the target service produced in the firsttwo steps: (1) the search and remote invoca-tion rules; and (2) the request-respond flowcontrol rules, and the information extractionrules. A key technique in our implementation isthe smart encoding of these three types of se-mantic knowledge in the form of active XML-template format. The code generator interpretsthe XML-template rules by linking each execut-able component with the corresponding rulesets.

The code generator also produces theXML representation for the retrieved samplepage object as a by-product.

Debugging and Release is the fourthcomponent and the final phase of the multi-page wrapping process. It allows the user toenter a set of alternative service requests to thesame service provider to debug the wrapperprogram generated by running theXWRAPComposer’s code debugging module.For each page object obtained, the debuggingmodule will automatically go through the syn-tactic structure normalization to rule out syn-tactic errors, the flow control and informationextraction steps to check if new or updated flowcontrol rules or data extraction rules should beincluded. In addition, the debug-monitoringwindow will pop up to allow the user to browsethe debug report. Whenever an update to anyof the three sets of rules occurs, the debugging



module will run the code generator to create anew version of the wrapper program. Once theuser is satisfied with the test results, he or shemay invoke the release to obtain the releaseversion of the wrapper program, including as-signing the version release number, packagingthe wrapper program with application plug-insand user manual into a compressed tar file.

The XWRAPComposer wrapper genera-tor takes the following three inputs: interfacespecification, outerface specification, and com-poser script, and compiles them into a Javawrapper program, which can be further extendedinto either a multi-page data extraction Webservice (with WSDL specification) or a Ptolemywrapper actor, which can be used for large scaledata integration.

In the next section, we focus our discus-sion primarily on multi-page data extractioncomponent of the XWrapComposer, and pro-vide a walkthrough example to illustrate themulti-page extraction process, including a briefdescription of the wrapping interface and re-mote invocation component as the necessary

preprocessing step for information extraction,a short summary of code generation as thepostprocessing for the multi-page extraction.

EXAMPLE WALKTHROUGHBefore describing the detailed techniques

used in designing multi-page data extractionservices, we first present a walkthrough ofXWRAPComposer using the motivating ex-ample introduced earlier.

Recall the workflow presented in Figure1, where a biologist first uses a program calledClusfavor to cluster genes that have changedsignificantly in a micro-array analysis experi-ment. After extracting all gene IDs from theClusfavor result, he feeds them into the NCBIBlast service, which searches all related se-quences over a variety of data sources. Thereturned sequences will be further examined tofind promoter sequences. Let us focus on theNCBI BLAST service. Figure 3 shows theworkflow of how a BLAST service request toNCBI will be served. It consists of four steps:(1) BLAST response step presents the user

Figure 3. Scientific data integration example scenario



with a request ID. (2) BLAST delay step pre-sents the user with the time delay for the result.(3) BLAST Summary presents the user with anoverview of all gene IDs that match well withthe given gene sequence id. And finally, (4)BLAST Detail shows for each gene id listed inthe summary page, the full sequence detail andthe goal is to extract approximately 1000-5000bases of the DNA sequence around the align-ment to capture the promoter regulatory ele-ments, the region of a gene where RNA poly-merase can bind and begin transcription to cre-ate the proteins that can regulate cell function.

Figure 4 illustrates a typical BLAST queryusing the NCBI service (NCBI, 2003). A BLASTquery involves five steps. The first step is tofeed a gene sequence into the text entry of thequery interface. Due to the time complexity of aBLAST search, the NCBI service provider typi-cally returns a response page with a request IDand the first estimate of the waiting time foreach BLAST search. The biologist may laterask NCBI for the BLAST results using the re-quest ID (Step 2), the NCBI service will pre-sents a delay page if the BLAST search is notcompleted and results are not yet ready to dis-play (Step 3). Once the BLAST results are de-livered, they are displayed in a BLAST sum-mary page, which contains a summary of all

genes matching the search query condition.Each of the matching genes will provide a linkto the NCBI BLAST Detail page (Step 4). If thegene ID used for the BLAST query is incorrectgene ID or NCBI does not provide BLAST ser-vice for the given gene ID, an error page will bedisplayed. If the summary page does not in-clude detailed information that the biologist isinterested in, he has to visit each detail page(Step 5) through the URLs embedded in thesummary page.

A critical challenge for providing system-level support for scientists to achieve suchcomplex data integration tasks is the problemof locating, accessing, and fusing informationfrom a rapidly growing, heterogeneous, anddistributed collection of data sources availableon the Web. This is a complex search problemfor two reasons. First, as the example in Figure3 shows, scientists today have much more com-plex data collection requirements than ordinarysurfers on the Web. They often want to collecta set of data from a sequence of searches overa large selection of heterogeneous data sources,and the data selected from one search step of-ten forms the falter condition for the next searchstep, turning a keyword-based query into asophisticated search and information extractionworkflow. Second, such complex workflows are

Figure 4. Multipage query with an NCBI Web site



manually performed daily by scientists or datacollection lab researchers (computer sciencespecialists). Automating such complex searchand data collection workflows presents threemajor challenges.

• Different service providers use different re-quest-respond flow control logics to presentthe answer pages to search queries.

• Cross-page data extraction has more com-plex extraction logic than the single pageextraction system. In addition, different ap-plications require different sets of data to beextracted by the cross-page data extractionengine. Typically, only portions of one pageand the links that lead the extraction to thenext page need to be extracted.

• Data items extracted from multiple inter-linkedpages require being associated with seman-tically meaningful naming convention. Thus,mechanisms that can incorporate the knowl-edge of the domain scientists who issuedsuch cross-page extraction job are critical.

There are several ways to design an NCBIBLAST wrapper. First, we can develop twowrappers, one for NSBI BLAST summary andone for NCBI BLAST Detail. The NCBI BLASTsummer wrapper can be integrated with theNCBI BLAST Detail wrapper by service com-position. In this approach, we need to capturethe request-respond flow control through aflowcontrol logic in the composer script of NCBISummary wrapper.

The outerface specification of the NCBIsummary wrapper consists of the general over-view of the given gene id and the list of geneIDs that are relevant to the given gene ID. TheNCBI BLAST Detail wrapper needs to extractapproximately 1000-5000 bases of the DNA se-quence around the alignment. The compositewrapper NCBI BLAST will be composed of theNCBI summary wrapper and a list of executionsof the NCBI BLAST Detail wrapper. In the nextsection we describe the XWRAPComposerdesign using this example.

MULTI-PAGEDATA EXTRACTION SERVICE

We have developed a methodology anda framework for extraction of information frommultiple pages connected via Web page links.The main idea is to separate what to extractfrom how to extract, and distinguish informa-tion extraction logic from request-respond flowcontrol logic. The control logic describes thedifferent ways in which a service request(query) could be answered from a given ser-vice provider. The data extraction logic de-scribes the cross-page extraction steps, includ-ing what information is important to extract ateach page and how such information is used asa complex falter in the next search and extrac-tion step.

We use interface description to specifythe necessary input objects for wrapping thetarget service and the outerface description todescribe what should be extracted and pre-sented as the final result by the wrapper pro-gram. We design and develop an XWRAPCom-poser Script language (a set of functional con-structs) to describe the request-respond flowcontrol logic and multi-page data extractionlogic. It is also to implement the output align-ment and tagging of data items extracted basedon the outerface specification.

The compilation process of theXWRAPComposer includes generating codebased on three sets of rules: (1) Remote con-nection and interface rules, (2) the request-re-spond flow control logic and multi-page extrac-tion logic outlined in the composer script, (3)the correct output alignment and semanticallymeaningful tagging based on the outerfacespecification.

Interface and Outerface SpecificationInterface specification describes the

schema of the data that the wrapper takes asinput. It defines the source location and theservice request (query) interface for the wrap-per to be generated. Outerface specificationdescribes the schema of the result that the wrap-per outputs. It defines the type and structure



of objects extracted. The composer script con-sists of two sets of rule-based scripts. The re-quest-respond flow control script describes thealternative ways that the target service will re-spond to a remote service request, includingresult not found, multiple results found or singleresult found, or server errors. The multi-pagedata extraction script which describes (1) theextraction rules for the main page, (2) the ex-traction rules for each of the interesting pageslinked from the main page, and (3) the rules onhow to glue single page data extraction compo-nents. XWRAPComposer’s scripting languagehas domain-specific plugins to facilitate the in-corporation of domain-dependent correlationsbetween the fragments of information extractedand the domain-specific tagging scheme. Eachwrapper generated by XWRAPComposer willbe associated with an interface specification,an outerface description, and a composer script.

The design of the XWRAPComposer In-terface and Outerface Specification serves twoimportant objectives. First, it will ease the useof XWRAPComposer wrappers as external ser-vices to any data integration applications. Sec-ond, it will facilitate the XWRAPComposerwrapper code generation system to generateJava code. Therefore, some components of thespecification may not be directly useful for theusers of these wrappers. In the first release ofthe XWRAPComposer implementation, we de-scribe the input and output schema of a multi-page (composite) wrapper in XML Schema anduse the two XML schemas as the interface andouterface specification. Concretely, the inter-face specification describes the wrapper nameand which data provider’s service needs to bewrapped by giving the source URL and otherrelated information. The outerface specificationdescribes what data items should be extractedand produced by the wrapper and the semanti-cally meaningful names to be used to tag thosedata items. Figure 5 shows a fragment of theinterface and outerface description of an ex-ample NCBI BLAST summary wrapper (LDRDTeam, 2004).

Multi-Page Data Extraction ScriptThe XWRAPComposer multi-page data

extraction service will generate a composerscript for each wrapper it creates. Each com-poser script usually contains three types of rootcommands, document retrieval, data extractionand post processing. The document retrievalcommands construct a file request or an HTTPrequest and fetch the document.

The data extraction commands specifythe detailed instructions on how to extract in-formation from the fetched document. The postprocessing commands allow adding semanticfalters to make the extracted results conform tothe outerface specification.

The general usage of commands is asshown in Figure 5.

Where <object id> is the id of the out-put object from the command, <input id> is theid of the input object. Both input and outputobjects are XML nodes. For example,FetchDocument returns the content of a Webpage, which is a text node in XML. Each com-mand specifies a set of built-in properties.<value> can be a string value, enclosed by apair of quotes, such as “this is a string value”, oran XPath expression, enclosed by a pair of brack-ets, such as [detailLink/text()]@<xpathroot>.The value of “xpathroot” should be either <inputid> or <object id> generated from previous com-mands.

If the command is used for data extrac-tion, such as extractLink and extractContentExtraction code, the detail extraction logic needsto be specified. The main command type for theextraction script is grab functions.XWRAPComposer also provides miscella-neous commands for request-respond flow con-trol, process management and Boolean com-parison.

In order to output XML data more flex-ibly, an XSL style sheet may be applied to anyXML object using the ApplyStyleSheet com-mand. Table 1 shows a list of commands thatare currently supported in the first release ofthe XWRAPComposer toolkit (DISL Group,Georgia Institute of Technology, 2003).



<XCwrapper name=“XC BlastN Summary” sourceURL=“http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?PAGE=Nucleotides”><interface><!— input schema in XML Schema —>

<xsd:element name=“input” type=“xsd:string”><xsd:complexType>

<xsd:sequence><xsd:element name=“select db” type=“string”/><xsd:element name=“query sequence” type=“string”/>

</xsd:sequence></xsd:complexType>

</xsd:element></interface><outerface><!— output schema in XML Schema —>

<xsd:element name=“resultDoc”><xsd:complexType>

<xsd:element name=“output”><xsd:complexType>

<xsd:choice minOccurs=“0” maxOccurs=“unbounded”><xsd:element name=“homolog”>

<xsd:complexType><xsd:sequence>

<xsd:element name=“geneid” type=“string”/><xsd:element name=“description” type=“string”/><xsd:element name=“length” type=“int”/><xsd:element name=“score” type=“string”/><xsd:element name=“expect” type=“string”/><xsd:element name=“identities” type=“string”/><xsd:element name=“strand” type=“string”/><xsd:element name=“link” type=“string”/><xsd:element name=“beginMatch” type=“int”/><xsd:element name=“endMatch” type=“int”/><xsd:element name=“alignment” type=“string”/><xsd:element name=”beginMatch” type=”string”/>

</xsd:sequence></xsd:complexType>

</xsd:element></xsd:choice>

</xsd:complexType></xsd:element><xsd:attribute name=”docLocation” type=”string”/><xsd:attribute name=”docType” type=”string”/><xsd:attribute name=”createdBy” type=”string”/><xsd:attribute name=”creationDate” type=”string”/>

</xsd:complexType></xsd:element>

</outerface></XCwrapper>

Figure 5. Example of interface and outerface specification — NCBi Summary

Generate <object id> :: <command name> (<input id>) {Set <property1 name> { <value> } [more value]Set <property2 name> { <value> }/* if the command is data extraction. */

[extraction code]}



Figure 6 gives an extraction script examplefor the NCBI Summary wrapper. Given a fullsequence as the input, we first construct anNCBI Blast search URL based on the NCBI Blastinterface description. The script fragment Setvariable { [text()]} indicates the sequence isin the input with the XPath, “text()”. The firstscript command FetchDocument retrieves theNCBI Blast response page that contains a re-quest ID. We extract the ID and construct theURL of the search results from the main pageobject. The control-flow command while...do...periodically invokes the secondFetchDocument to retrieve the result page un-til the results are delivered. Finally we useGrabXWRAPEliteData to extract useful datafrom the main result page. We use the com-mand ExtractLink to locate each of the linkedpages of interest from the main page object anduse the command ExtractContent to invoke theXWRAPElite single page data extraction ser-vice to extract useful data from each linked page.Due to the space restriction, we omit the con-crete techniques used in XWRAPComposer forsingle page data extraction and refer readers toButtler, Liu, and Pu (2001) and Wei (2003) forfurther detail.

Code GenerationXWRAPComposer generate its wrapper

programs in two steps. First, it reads the userspecified interface, outerface and composerscript, and generates an XWRAPComposerwrapper, which contains the Java source code,an executable Java program, and a set of con-figuration files. The configuration files includethe input and output schemas obtained frominterface and outerface specification of thewrapper, and the resource files used in the dataextraction phase such as XSLT files. Concretely,the code generation phase consists of threemain functions, as shown in Figure 2. The codegeneration process starts with reading the in-terface specification and generating the codefor search interface construction, followed bygenerating the remote invocation method toestablish the remote connection.

Then, the code generator will generatethe Java code that implements the request-re-spond flow control logic described in the com-poser script. For each possible request-respondstate, the code for parsing the correspondingrespond page will be generated. Furthermore,based on the extraction logic specified for eachof the possible respond pages, we can gener-

Table 1. Supported XWRAPComposer extraction root commands



ate the data extraction code fragment for eachrespond page and generate the glue code tocompose the list of single page data extractioncode into a multi-page data extraction routine.

The third functional component is to gen-erate debugging and release code to support

an iterative process of testing, fixing bugs, re-packaging, and release. An XWRAPComposeruser may feed a series of input pages to the de-bugging and release module to debug the wrap-per program generated by XWRAPComposer.For each input page, the debugging module will

Figure 6. Extraction script example for NCBi summary



automatically go through the search interfaceconstruction, remote connection establishment,document parsing, and multi-page data extrac-tion to check if the expected output (specifiedin the outerface description) is returned. Oncethe user is satisfied with the test results, he orshe may choose to release the generated wrap-per program, which contains the Java sourcecode, configuration files, the release versionnumber, the required jar files (Java executables),and the user manual.

Execution Model of anXWRAPComposer WrapperA typical XWRAPComposer wrapper

consists of the following five basic functionalmodules.

The Search Interface module accepts theuser input through the protocols defined bythe user, such as the SOAP request in the Webservice scenario. It constructs the service re-quest (query command) and parameter list thatwill be forwarded to the wrapped target ser-vice. Consider the NCBI BLAST wrapper, itssearch interface accepts the gene sequence andthe other parameters such as alignment preci-sion from the input specification file or GUI in-terface. It composes the HTTP POST command,which will be used to execute the query.

The Remote Invocation module acceptsthe service request (query command) and pa-rameters generated by the search interface andconverted them into the query acceptable bythe wrapped target service. The query can bean HTTP POST command, an FTP GET com-mand, or an RPC call. The remote invocationmodule interacts with the wrapped target ser-vice following the remote connection protocoldefined by the wrapped target service and thecommunication procedure defined by the con-figuration file. The query result page will beforwarded to the parser for preprocessing be-fore entering the multi-page data extractionmodule.

The Page Parser translates the resultpage received from the remote invocation mod-ule into a token tree structure, filters out theuninteresting information such as advertise-

ments from Web pages, and converts the re-ceived document into a standard format suchas HTML or XML. In addition to building atoken based parse tree, the page parser shouldincorporate the domain specific knowledgeabout the page encoded in the composer scriptto facilitate the data extraction process. Formulti-page wrappers, the page parser will parsethe main respond page based on its extractionrules and locate the list of linked pages of inter-est. For each of the linked pages of interest, theparser triggers the remote invocation moduleto fetch the actual page and parses the pagebased on its corresponding extraction rules.

The Information Extraction module pro-cesses each of the parsed documents passedfrom the parser and extracts the objects of in-terest defined by the outerface specification. Ituses the domain specific knowledge about thepages of interest, encoded in the composer ex-traction script, to guide the concrete multi-pagedata extraction process. For each extracted dataobject, the XML tagging procedure is appliedto assign a tag name to the object based on thetagging rules encoded in the composer script.

The Output Packaging and Deliverymodule merges the output from the informationextraction module and packages it into the finalresult format defined by the outerface specifi-cation. Then it delivers the data package to theuser who initiates the execution of the wrapperprogram.

The first prototype of XWRAPComposersystem is written in Java. Wrappers generatedby XWRAPComposer are also coded in Java.In our first prototype implementation, the fivecomponents execute sequentially ¡ a compo-nent starts execution only after the previouscomponent finishes. The next extension ofXWRAPComposer code generation system isto introduce parallel extraction among these fivecomponents. Parallel execution improves theperformance, but it also incurs higher complex-ity in implementation.

Figure 7 and Figure 8 demonstrate twoXWRAPComposer wrappers and their mini-workflow structure. The GUI interface is devel-oped using Ptolemy (Berkeley, 2003) (a process



modeling tool). Each wrapper can be used as aPtolemy actor (see the left menu on the screenshot) and is composed of four steps:StartWrapping initiates all the environment pa-rameters, and triggers ReadInputFile to read agene ID from a specified input file. The gene IDwill then be sent to NCBiSummary Wrapperactor which performs the wrapping functionupon receiving a BLAST service request withthe given gene ID, and returns the set of ids ofrelated genes as results. The last step isXMLDisplay, which pops up a window topresent the wrapping results. Figure 9 and Fig-ure 10 show the result of NCBI BLAST Sum-mary wrapper and NCBI BLAST Detail wrapperrespectively.

WSDL-Enabled WrappersXWRAPComposer is developed with

two objectives in mind. First, we want to gener-ate wrapper programs that can be used in com-mand line or embedded in an application sys-tem as a wrapper procedure. This approach pro-vides end users with the flexibility of customiz-ing their systems by using XWRAPComposerwrapper programs as building block.

However, end users have to use Java pro-gramming languages for their system implemen-tation because the generated XWRAPComposerwrapper programs are in Java. To free the end-user from the reliance on a chosen programminglanguage like Java, we want XWRAP-Composerto be able to generate WSDL-enabled wrappersto allow each wrapper program to be used as a

Figure 7. NCBI blast summary wrapper

Figure 8. NCBI blast detail wrapper



Web service (W3C, 2002), which is our secondobjective. We chose Web services because itwas proposed and has been successfullyadopted by many systems for providing plat-form-independent and programming language-independent service access. End users canimplement their client applications with full flex-ibility as long as their systems can access ourserver using SOAP protocol. Our discussionso far has been focused on the first objective.

In this section we briefly describe how to gen-erate WSDL enabled wrappers.

In order to enable XWRAPComposer togenerate WSDL-enabled wrapper services, weadd two extensions to the XWRAPComposerwrapper generation system. First, we encapsu-late an XWRAPComposer wrapper into a gen-eral Web service servlet. The servlet automati-cally extracts the input from a SOAP request,feeds it into the wrapper, and inserts the wrap-

Figure 9. Ptolemy wrapper actor result example — NCBi Blast Summary

Figure 10. Ptolemy wrapper actor result example — NCBi Blast Detail



ping results in a SOAP envelope before sendingback to the user. In this sense, XWRAPComposerwrappers are working as service providers toend users. When they interact with wrapped datasources, those XWRAPComposer wrappers actas the clients of those services. Second, to easethe implementation and deployment ofXWRAPComposer wrappers as Web services,we incorporate a WSDL generator to automati-cally generate Web service description by bind-ing the wrapper’s interface and outerface withthe servlet configuration. Figure 11 shows theextensions added to the XWRAPComposer toproduce wrappers as WSDL Web services.

Wrapper Program RepositoryAs a part of the XWRAPComposer ef-

fort, we design and develop an online wrappergeneration and registration system to assist theusage of XWRAPComposer wrappers and sim-plify the wrapper generation and managementoverhead. All wrappers generated byXWRAPComposer can be registered directlyinto our online wrapper repository. A snapshotof this repository is shown in Figure 12.

Consider the first wrapper shown in Fig-ure 12. The target service provider is http://fugu.hgmp.mrc.ac.uk/blast/. It provides a stan-dard BLAST interface. After obtaining theXWRAPComposer wrapper source code andjar file, the user can upload this wrapper

Figure 11. Web-service enabled wrappers

Figure 12. XWRAPComposer online wrapper repository



through an online registration interface, avail-able at http://disl.cc.gatech.edu/ldrdscript/html/registerwrapper.htm. One can download thegenerated wrapper source code directly byclicking on wrapper code column of the corre-sponding target service provider. Using theXWRAPComposer library and the composerscripts that we released, this wrapper sourcecode can be compiled on the user’s local ma-chine and executed as command line Java ap-plication. A user can also use our online wrap-per execution interface to execute each regis-tered wrapper either as a servlet or a Web ser-vice. An example of online execution result isgiven in Figure 13. All XWRAPComposer wrap-pers for BLASTN services are presenting auniform interface to the end users, which facili-tates the large scale integration of multipleBLASTN services.

RELATED WORKThe very nature of scientific research and

discovery leads to the continuous creation ofinformation that is new in content or represen-tation or both. Despite the efforts to fit molecu-

lar biology information into standard formatsand repositories such as the PDB (Protein DataBank) and NCBI, the number of databases andtheir content have been growing, pushing theenvelope of standardization efforts such asmmCIF (Westbrook & Bourne, 2000). Provid-ing integrated and uniform access to these da-tabases has been a serious research challenge.Several efforts (Critchlow, Fidelis, Ganesh,Musick, & Slezak, 2000; Davidson et al., 1999;Goble et al., 2001; Haas et al., 2001; McGinnis,1998; Siepel et al., 2001) have sought to allevi-ate the interoperability issue, by translatingqueries from a uniform query language into thenative query capabilities supported by the in-dividual data sources. Typically, these previ-ous efforts address the interoperability prob-lem from a digital library point of view, i.e., theytreat individual databases as well-knownsources of existing information. While they pro-vide a valuable service, due to the growing rateof scientific discovery, an increasing amountof new information (the kind of hot-off-the-bench information that scientists would be mostinterested in) falls outside the capability of these

Figure 13. XWRAPComposer wrapper execution result: An example



previous interoperability systems or services.Wrappers have been developed either

manually or with software assistance, and usedas a component of agent-based systems, so-phisticated query tools and general mediator-based information integration systems(Wiederhold, 1992; Liu & Pu, 1997; Liu, Pu, &Lee, 1996). For instance, the most documentedinformation mediator systems (e.g., Ariadne(Knoblock et al., 1998), CQ (Liu, Pu, & Tang,1999; Liu et al., 1998), Internet Softbots(Kushmerick, Weld, & Doorenbos, 1997),TSIMMIS (Garcia-Molina et al., 1997; Hammeret al., 1997), Araneus (Atzeni, Mecca, &Merialdo, 1997)) all assume a pre-wrapped setof Web sources. However, developing and main-taining wrappers by hand is labor intensive anderror-prone, due to technical difficulties suchas undocumented HTML/XML tags and subtlevariations in the content (small to the humanperception, but difficult for programs).

Tova Milo and Sagit Zohar (Milo & Zohar,1998) use schema matching to simplify the wrap-per generation, when both the source schemaand the result schema are available. They ob-serve that in many cases the schema of thedata in the source system is very similar to theresult schema. In such cases, much of the trans-lation work can be done automatically basedon the schema similarity. They define amiddleware schema, and each data source tobe used in their system needs a mapping of itsdata and schema to (or from) the middlewareformat. They develop an algorithm to matchand translate the objects in the source with theobjects in the result comparing the two in-stances of the middleware schema. Since mostWeb pages are still in schema-less HTML,schema matching does not work. However, ifmore XML information appears on the Web,this approach will speed up the wrapper gen-eration, since XML documents contain schemainformation.

SoftBot (Kushrnerick, 1997) developed awrapper generation system using inductivelearning techniques. Several generic wrapperclasses with adjustable parameters are pre-

defined in the wrapper generation system. Eachwrapper class can extract information from onedocument pattern. Wrapper developers high-light interesting sections in many sample docu-ments, and then a machine learning algorithmwill adjust those parameters to find a combina-tion of wrapper classes to extract the highlightedsections correctly. If such a combination is notavailable, the algorithm will return the best com-bination with the fewest mistakes. The devel-opers can either correct the best combinationmanually or add more wrapper classes fittingnew patterns to find a complete correct combi-nation.

NoDoSE also adopts the inductive learn-ing technique. Using a GUI, the user hierarchi-cally decomposes a plain text file, outlining itsregions of interest and then describing theirsemantics. The task is expedited by a miningcomponent that attempts to infer the grammarof the file from the information the user hasidentified so far.

XWRAPComposer is different from thosesystems in three aspects. First, we explicitlyseparate tasks of building wrappers that arespecific to a Web service from the tasks thatare repetitive for any service, thus the code canbe generated as a wrapper library componentand reused automatically by the wrapper gen-erator system. Second, we use inductive learn-ing algorithms that derive information flow anddata extraction patterns by reasoning aboutsample pages or sample specifications. Mostimportantly, we design a declarative rule-basedscript language for multi-page information ex-traction, encouraging a clean separation of theinformation extraction semantics from the in-formation flow control and execution logic ofwrapper programs.

CONCLUSIONBoth enterprise systems and science and

engineering integration applications requiregathering information from multiple, heteroge-neous information services. Although Web ser-vice technology such as WSDL, SOAP, andUDDI, has provided a standardized remote in-



vocation interface, there exist other types ofheterogeneity in terms of query capability, con-tent structure, and content delivery logics dueto inherent diversity of different services.

A popular approach to handling suchheterogeneity is to use wrappers to serve asmediators to facilitate the automation of col-lecting and extracting data from multiple diversedata providers.

We have described a service-orientedframework for development of wrapper codegenerators and a concrete implementation,called XWRAPComposer, to evaluate our frame-work in the context of bioinformatics applica-tions. Three unique features distinguishXWRAPComposer from existing wrapper devel-opment approaches. First, XWRAPComposer isdesigned to enable multi-stage and multi-pagedata extraction. Second, XWRAP-Composer isthe only wrapper generation system that pro-motes the distinction of information extractionlogic from request-respond flow control logic,allowing higher level of robustness againstchanges in the service provider’s Web site de-sign or infrastructure. Third, XWRAPComposerprovides a user-friendly plug-and-play inter-face, allowing seamless incorporation of exter-nal services and continuous changing serviceinterfaces and data format.

The XWRAPComposer project contin-ues along three dimensions. First, we are inter-ested in extending XWRAPComposer codegeneration capability to allow wrappers to begenerated for a wide variety of complex datasources. Second, we are interested in exploringdata provenance techniques for large scale dataintegration. In particular, we are interested inextracting data provenance information from thevast amount of data contents provided in manyscientific data service providers. We believethat the data provenance information is criticalfor facilitating the scientific data integrationprocess, and improving scientific data integra-tion quality. Third but not the least, we are work-ing on providing user-friendly GUI to supportfor interactive specification of interface,outerface and composer script.

ACKNOWLEDGMENTSThis work is a joint effort between the

Georgia Institute of Technology Team led byLing Liu and the LLNL team led by TerenceCritchlow. The work performed by the authorsfrom Georgia Tech was partially sponsored byDoE SciDAC, LLNL LDRD, NSF CSR, an IBMfaculty award, and an IBM SUR grant. The workby the authors from LLNL was performed un-der the auspices of the U.S. Department of En-ergy by University of California LawrenceLivermore National Laboratory under contractNo. W-7405-ENG-48, UCRL-JRNL-218270. Anyopinions, findings, and conclusions or recom-mendations expressed in the project materialare those of the authors and do not necessarilyreflect the views of the sponsors.

REFERENCESAltschul, S., Madden, T., Schaffer, A., Zhang,

J., Zhang, Z., Miller, W., et al. (1997). GappedBLAST and PSI-BLAST: A new generationof protein database search programs.Nucleic Acids Research, 25, 3389-3402.

Atzeni, P., Mecca, G., & Merialdo, P.(1997). Semi-structured and structured data in the Web:Going back and forth. Proceedings of ACMSIGMOD Workshop on Management ofSemi-Structured Data, Tucson, Arizona.

Baumgartner, R., Flesca, S., & Gottlob, G. (2001).Visual Web information extraction with Lixto.Proceedings of the 27th International Con-ference on Very Large Data Bases (VLDB),Rome, Italy.

Bayardo, R. J., Jr., Bohrer, W., Brice, R., Cichocki,A., Fowler, J., Helal, A., et al. (1997). Seman-tic integration of information in open anddynamic environments. Proceedings ofACM SIGMOD Conference, Tucson, Ari-zona.

Berkeley. (2003). Ptolemy group in EECS. Re-trieved from http://ptolemy.eecs.berkeley.edu/

Buttler, D., Liu, L., & Pu, C. (2001). A fully auto-mated object extraction system for the WorldWide Web. Proceedings of the 2001 Inter-national Conference on Distributed Com-



puting Systems ICDCS, Phoenix, Arizona.Critchlow, T., Fidelis, K., Ganesh, M., Musick,

R., & Slezak, T. (2000). Datafoundry: Infor-mation management for scientific data. IEEETransactions on Information Technology inBiomedicine, 4(1), 52-57.

Davidson, S., Buneman, O., Crabtree, J.,Tannen, V., Overton, G., & Wong, L. (1999).Biokleisli: Integrating biomedical data andanalysis packages. In S. Letovsky (Eds.),Bioinformatics: Databases and systems (pp.201-211). Norwell, MA: Kluwer Academic.

DBCAT. (1999). The public catalog of data-bases. Retrieved from http://www.infobiogen.fr/services/dbcat

DISL Group, Georgia Institute of Technology.(2000). XWRAP Elite Project. Retrieved fromhttp://www.cc.gatech.edu/projects/disl/XWRAPElite

DISL Group, Georgia Institute of Technology.(2003). XWRAPComposer. Retrieved fromhttp://www.cc.gatech.edu/projects/disl/XWRAPComposer/

Garcia-Molina, H., Papakonstantinou, Y., Quass,D., Rajaraman, A., Sagiv, Y., Ullman, J. D., etal. (1997). The TSIMMIS approach to me-diation: Data models and languages. Jour-nal of Intelligent Information Systems, 8(2),117-132.

Goble, C. A., Stevens, R., Ng, G., Bechhofer, S.,Paton, N., Baker, P. G., et al. (2001). Transpar-ent access to multiple bioinformatics infor-mation sources. IBM Systems Journal, 40(2),532-551.

Haas, L., Kossmann, D., Wimmers, E., & Yan, J.(1997). Optimizing queries across diversedata sources. Proceedings of the 23rd In-ternational Conference on Very Large Da-tabases (VLDB), Athens, Greece.

Haas, L., Schwarz, P., Kodali, P., Kotlar, E., Rice,J., & Swope, W. (2001). Discoverylink: A sys-tem for integrated access to life sciencesdata sources. IBM Systems Journal, 40(2),489-511.

Hammer, J., Brennig, M., Garcia-Molina, H.,Nesterov, S., Vassalos, V., & Yerneni, R.(1997). Template-based wrappers in thetsimmis system. Proceedings of ACM

SIGMOD Workshop on Management ofSemi-structured Data, Tucson, Arizona.

Knoblock, C. A., Minton, S., Ambite, J. L.,Ashish, P. J. M. N., Muslea, I., Philpot, A. G.,et al. (1998). Modeling Web sources for in-formation integration. Proceedings of theFifteenth National Conference on ArtificialIntelligence. Madison, WI.

Kushmerick, N., Weld, D. S., & Doorenbos, R.(1997). Wrapper induction for informationextraction. Proceedings of Int. Joint Con-ference on Artificial Intelligence (IJCAI),Nagoya, Japan.

Kushrnerick, N. (1997). Wrapper induction forinformation extraction. Unpublished doc-toral dissertation, University of Washing-ton.

LDRD Team. (2004). LDRD Project. Retrievedfrom http://www.cc.gatech.edu/projects/disl/LDRD

Li, C., Yerneni, R., Vassalos, V., Garcia-Molina,H., Papakonstantinou, Y., Ullman, J., et al.(1997). Capability based mediation in tsimiss.Proceedings of ACM SIGMOD Conference.Tucson, Arizona.

Liu, L., & Pu, C. (1997). An adaptive object-oriented approach to integration and accessof heterogeneous information sources. Dis-tributed and Parallel Databases: An Inter-national Journal, 5(2), 167-205.

Liu, L., Pu, C., & Han, W. (1999). XWrap: AnExtensibleWrapper construction system forInternet information sources. In Technicalreport.

Liu, L., Pu, C., & Lee, Y. (1996). An adaptiveapproach to query mediation across hetero-geneous databases. Proceedings of the In-ternational Conference on Cooperative In-formation Systems, Brussels, Belgium.

Liu, L., Pu, C., & Tang, W. (1999). Continualqueries for Internet-scale event-driven in-formation delivery. IEEE Knowledge andData Engineering, 11(4), 610-628.

Liu, L., Pu, C., Tang, W., Biggs, J., Buttler, D.,Han, W., et al. (1998). CQ: A personalizedupdate monitoring toolkit. Proceedings ofACM SIGMOD Conference, Seattle, Wash-ington.



McGinnis, S. (1998). Genbank user services,National Center for Biotechnology Informa-tion (NCBI), National Library of Medicine,US National Institute of Health. PersonalCommunication.

Milo, T., & Zohar, S. (1998). Using schemamatching to simplify heterogeneous datatranslation. Proceedings of the 24th Inter-national Conference on Very Large DataBases (VLDB), New York.

NCBI. (2003). National Center for Biotechnol-ogy Information. Retrieved from http://www.ncbi.nlm.nih.gov/BLAST/

Peterson, L. (2002). CLUSFAVOR. Baylor Col-lege of Medicine. Retrieved from http://mbcr.bcm.tmc.edu/genepi/

Quandt, K., Frech, K., Karas, H., Wingender, E.,& Werner, T. (1995). MatInd andMatInspector: New fast and versatile toolsfor detection of consensus matches in nucle-otide sequence data. Nucleic Acids Re-search, 23, 4878-4884.

Raggett, D. (1999). Clean up your Web pageswith HTML TIDY. Retrieved from http://www.w3.org/People/Raggett/tidy/

Sahuguet, A., & Azavant, F. (1999). WysiWygWeb Wrapper Factory (W4F). Proceedingsof World Wide Web (WWW) Conference, Or-lando, Florida.

Siepel, A. C., Tolopko, A. N., Farmer, A. D.,Steadman, P. A., Schilkey, F. D., Perry, B., etal. (2001). An integration platform for het-

erogeneous bioinformatics software compo-nents. IBM Systems Journal, 40(2), 570-591.

Stein, L. D., & Thierry-Mieg, J. (1999). Scriptableaccess to the Caenorhabditis elegans ge-nome sequence and other ACEDB databases.Genome Res., 8, 1308-1315.

W3C. (1999). Reformulating HTML in XML.Retrieved from http://www.w3.org/TR/WD-html-in-xml/

W3C. (2002). Web services. Retrieved from http://www.w3c.org/2002/ws/

W3C. (2003). Web Services Description Lan-guage (WSDL) version 1.2 part 1: Core lan-guage. Retrieved from http://www.w3c.org/TR/wsdl12/

Wei, H. (2003). Wrapper application genera-tion for Semantic Web: An XWRAP ap-proach. Unpublished doctoral dissertation,Georgia Institute of Technology.

Westbrook, J., & Bourne, P. (2000). Star/mmcif:An extensive ontology for macromolecularstructure and beyond. Bioinformatics, 16(2),159-168.

Wiederhold, G. (1992). Mediators in the archi-tecture of future information systems. IEEEComputer, 25(3), 38-49.

ENDNOTE1 A block here refers to a line of 256 charac-

ters or a transfer unit defined implicitly bythe HTTP protocol.

Ling Liu is an associate professor at the College of Computing, Georgia Institute of Technology.She directs the research programs in the Distributed Data Intensive Systems Lab (DiSL),examining research issues and technical challenges in building large scale distributedcomputing systems that can grow without limits. Dr. Liu and the DiSL research group have beenworking on various aspects of distributed data intensive systems, ranging from decentralizedoverlay networks, exemplified by peer to peer computing, data grid computing, to mobilecomputing systems and location based services, sensor network computing, and enterprisecomputing systems. She has published over 150 international journal and conference articles.Her research group has produced a number of software systems that are either open sources ordirectly accessible online, among which the most popular ones are WebCQ and XWRAPElite.Most of Dr. Liu’s current research projects are sponsored by NSF, DoE, DARPA, IBM, and HP.She is on the editorial board of several international journals, such as IEEE Transactions onKnowledge and Data Engineering, International Journal of Very large Database systems (VLDBJ),



and International Journal of Web Services Research. She has chaired a number of conferences asa PC chair, a vice PC chair, or a general chair, including IEEE International Conference onData Engineering (ICDE 2004, ICDE 2006, ICDE 2007), IEEE International Conference onDistributed Computing (ICDCS 2006), IEEE International Conference on Web Services (ICWS2004).

Jianjun Zhang is currently a PhD student in the College of Computing at the Georgia Instituteof Technology. His research interests include Web services, wide-area networked computingsystems and applications. Jianjun Zhang received his ME and BS from the Department ofComputer Science, Wuhan University, China (1999 and 1996, respectively). His PhD dissertationresearch is focused on efficient and reliable multicast techniques in large scale networkedcomputing systems.

Wei Han received his PhD in 2003 at College of Computing, Georgia Institute of Technology.His PhD research was on automated wrapper code generation. Wei Han received his BS fromTsinghua University, China (1997) and his MS of computer science and engineering fromOregon Graduate Institute (1999). Now, he is working for IBM Research, Almaden ResearchCenter.

Calton Pu holds the position of professor and John P. Imlay, Jr. chair in software at the GeorgiaInstitute of Technology. He has published extensively in operating systems, transactionprocessing, and Internet data management. His current research is mainly sponsored by NSF,DARPA, Department of Energy, HP and IBM. He received his PhD in computer science from theUniversity of Washington in 1986. He is a member of ACM, a senior member of IEEE, and afellow of AAAS.

James Caverlee is currently a PhD student in the College of Computing at the Georgia Instituteof Technology. His research interests are focused on Web information management and retrieval,Web services, and Web based data intensive systems and applications. James graduated magnacum laude from Duke University in 1996 with a BA in economics. He received an MS inengineering economic systems & operations research in 2000, and an MS in computer sciencein 2001, both from Stanford University. His PhD dissertation research is focused on Webinformation retrieval and Web data mining, including efficient and spam-resilient Web searchalgorithms.

Sungkeun Park was a master student of College of Computing at the Georgia Institute ofTechnology and a member of the Distributed Data Intensive Systems Lab (DiSL). Sungkeunhas worked on reputation trust in peer-to-peer systems and Web service enabled wrapper codegeneration systems.

Terence Crithchlow is the team lead for the BioEncyclopedia effort within the BiodefenseKnowledge Center (BKC) at the Lawrence Livermore National Lab (LLNL). In this capacity,Dr. Crithchlow is leading a large team of researchers and developers in an effort to integraterelevant biodefense information into a consistent environment for use by BKC analysts. Priorto working with the BKC, Dr. Critchlow led several research projects focusing on improvingscientists’ interactions with large data sets. He obtained a PhD in computer science from theUniversity of Utah in 1997 and has been a member of the Center for Applied Scientific Computingat LLNL ever since.



APPENDIX

XWRAPComposer Extraction Script Command

The first release of the XWRAPComposer supports seven categories of command. Theyare listed as follows.

(1) Document Retrieval CommandWe support the following Document Retrieval Commands in the first release of the

XWRAPComposer toolkit.

ConstructHTTPQueryThis command constructs an HTTP query that contains three components: URL, queryString,

and HTTP method. It has four properties: URL, queryString, httpMethod, and vars. The firstthree properties are actually templates with placeholders, “$$”. The last property is a list ofstrings to replace the placeholders.

David Buttler received his PhD in 2003 at the College of Computing, Georgia Institute ofTechnology. He has been working in the Data Science Group, Center for Applied ScientificComputing, Lawrence Livermore National Laboratory Lab (LLNL) since his PhD graduation.His research interests are in information management for distributed data intensive systems. Inparticular, he is interested in information discovery, update monitoring, and source selection.Dr. Buttler earned a BSc in computer science from the University of Alberta in 1998, and a BSin mathematics from Andrews University in 1995.

Dr. Matthew Coleman is currently a project leader in the Biomedical Division within Biosciencesat Lawrence Livermore National Laboratory (LLNL), working on understanding ionizingradiation effects by comparing gene expression and proteomic changes in exposed cells. Histechnical training is in molecular biology where he received his BA from the University ofMassachusetts in 1987, and his PhD in biophysical studies of proteins from Boston Universityin 1997. Dr. Coleman then completed a post-doctoral fellow at LLNL developing and applyinggenomic technologies. He has authored over 50 publications in peer-reviewed journals,published abstracts and book chapters covering a diverse breadth of molecular and biochemicalbiology as well as bioinformatics. He is the key domain scientist of the DoE SciDAC project andthe XWRAPComposer project.

Continued on the following pages



Example:

The result will be:

ConstructFileQueryThis command constructs a file request, which contains only one property, file name.

FetchDocumentThis command takes a file request or an HTTP request as input and returns the content of

the file or the Web page. It does not have any properties.

(2) Data ExtractionCurrently we support two types of data extraction commands. They are used for extracting

links or extracting content.

ExtractLinkThis command indicates to extract an HTTP request from the input. It usually needs to

extract URL, queryString, and httpMethod. If queryString and httpMethod are not extracted, thedefault values will be used. The default httpMethod is ”get” and queryString is ””.

ExtractContentThis command indicates to extract content from the input, which contains all kinds of data.

The extraction method needs to be specified with GrabFunctions and XML construction com-mands as well as other commands.



(3) Grab FunctionThe Grab function is designed to facilitate the text parsing and sting analysis during the

extraction process. We support the following four grab functions.

GrabSubstringAssuming the input is a text node, this command extracts a substring by two properties,

beginMatch and endMatch. It will return the string between beginMatch and endMatch. If thereare multiple result strings, we only choose the first one.

GrabXWrapEliteDataThis command applies an XWRAPElite wrapper to the input. The input should be a text

node that represents the content of a Web page. The properties are generated by the XWRAPElitetoolkit. It allows modification for fine-tuning.

GrabCommaDelimitedTextThis command extracts comma-delimited data into XML. It has the following properties.

• LineDelimiters: The delimiters to separate data into rows. The default is the system lineseparators.

• Delimiters: The delimiters to separate data in a row to a list of cells. The default is comma.

• StopStrings: String that will be ignored.

• Filters: It contains two subproperties, minColCount and maxColCount. We will filter out all therows whose column numbers are not in the range of [minColCount, maxColCount] inclusively.

• RowOutput: It specifies how to output the tabular data for each row in XML.

GrabConsecutiveLines.This command is to extract consecutive text lines from the input. It has three properties,

beginMatch, endMatch, and matchingMethod. The default matching method is to match the firststring of the beginning line and the ending line with beginMatch and endMatch properties.However, we might need some domain-specific matching method in some cases. Then we can useexternal functions as shown in NCBiDetail example.

(4) Boolean Comparisons

ContainSubstringThis command returns a Boolean value, which indicates if the input contains a substring. It

has one property, compSubstring.

Example:



The example demonstrates a command that checks if the text value of answerPage containsa string of “This page will automatically updated in”.

(5) Control Flow

While ... Do ...This command checks the conditions in the while clause, while the conditions are true,

repeat the do clause. The while clause contains Boolean commands and the do clause containsother extraction-related commands.

(6) Post Processing

ApplyStylesheetThis command applies a style sheet to the input XML. It has a property, StyleSheetFile.

(7) Process Management

SleepThis command pauses the process for a certain amount of time.

Usage: Sleep “<number of milliseconds>”

xwrapcomposer - ecology labfaculty.cse.tamu.edu/caverlee/pubs/liu06xwrap.pdfservice, such as ncbi...

Documents