[ieee 2010 10th ieee/acm international conference on cluster, cloud and grid computing - melbourne,...

Framework for Efficient Indexing and Searching of Scientific Metadata

Chaitali Gupta #1, Madhusudhan Govindaraju #2

# Department of Computer Science, SUNY BinghamtonP.O. Box 6000, Binghamton, NY 13902-6000, USA

1 [email protected] [email protected]

Abstract — A seamless and intuitive data reduction capability for the vast amount of scientific metadata generated by experiments is critical to ensure effective use of the data by domain specific scientists. The portal environments and scientific gateways currently used by scientists provide search capability that is limited to the pre-defined pull-down menus and conditions set in the portal interface. Currently, data reduction can only be effectively achieved by scientists who have developed expertise in dealing with complex and disparate query languages. A common theme in our discussions with scientists is that data reduction capability, similar to web search in terms of ease-of-use, scalability, and freshness/accuracy of results, is a critical need that can greatly enhance the productivity and quality of scientific research. Most existing search tools are designed for exact string matching, but such matches are highly unlikely given the nature of metadata produced by instruments and a user’s inability to recall exact numbers to search in very large datasets. This paper presents research to locate metadata of interest within a range of values. To meet this goal, we leverage the use of XML in metadata description for scientific datasets, specifically the NeXusdatasets generated by the SNS scientists. We have designed a scalable indexing structure for processing data reduction queries. Web semantics and ontology based methodologies are also employed to provide an elegant, intuitive, and powerful free-form query based data reduction interface to end users.

I. INTRODUCTION

A key requirement for domain scientists is to provide seamless data reduction capability for the vast amount of scientific metadata being generated by scientific instruments such as the Spallation Neutron Source [1]. Many of the datasets generated by the experiments at SNS are in the NeXus format [3], which allows for the creation of XML based metadata files. These metadata files can be created during the experiment, annotated by scientists at a later time, or automatically generated from the datasets. An important requirement for scientists is to abstract away the fundamental complexity of XML based scientific metadata, and provide an elegant, intuitive, and simple yet powerful free-form query based data reduction framework to end users for capitalizing on this common data format. An effective data reduction framework requires an efficient mechanism to generate an

index based on the metadata for the large number of scientific data files. It requires efficient tagging of the files with metadata and a feedback mechanism that allows for intelligent update of the metadata for a more refined and accurate data reduction so that insights derived from the vast amount of scientific data are more easily realized by domain specific scientists.

Grid portal environments in use today currently provide scientists with data reduction capability that is currently limited to search via pre-defined pull-down menus and conditions set in the portal interface. Domain scientists are required to either manually search the web pages for specific resources they are looking for, or write programs to interact with the often complex and disparate Grid middleware software stacks in use. An efficient data reduction capability for large scale scientific metadata generated by SNS, HFIR and others, similar to web-search in terms of ease-of-use, scalability, and freshness/accuracy of results, is a critical need that can greatly enhance the productivity and quality of scientific work.

II. MOTIVATION FOR XML BASED METADATA SEARCH

XML based scientific metadata, programming paradigms, and specialized query languages such as SPARQL [4] have steadily grown in complexity and the technical knowledge to manage them has also become difficult to master for domain scientists. Just as most computer users today do not have to write programs, domain scientists should be shielded from the low-level details of XML syntax and structure. The integration of Natural Language Processing (NLP) and Information Retrieval (IR) technologies in web search engines have made it possible for end-users to easily and effectively obtain information that is stored in billions of web pages. Users do not need professional programming expertise or technical knowledge of the structure and format in which web pages are stored by search engine servers. However, as there is little context to the information that is indexed and searched via web search engines, they typically return multiple links to the end user. A key difference for search in XML based scientific metadata information is that unlike web search engines that return multiple web pages, domain scientists require the exact data information in response to their queries. A fast and scalable metadata based data reduction system is critical for making such information easily accessible. To meet this requirement, we have built ontology models for the

2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing

978-0-7695-4039-9/10 $26.00 © 2010 IEEE

DOI 10.1109/CCGRID.2010.120

553

various concepts [19] in our specific search domain using semantic and ontology technologies such as RDF/OWL [2, 5], and WordNet [6].

III. SCOPE OF FREE-FORM QUERIES

The scope of free-form queries in our framework is based on expressing queries in plain English language, and scientists do not need to learn any formal expression syntax, just as in web search. Scientists can express search constraints using natural-language-like specifications. Our work adapts several techniques from Information Retrieval and Semantic Web to enable context-rich free-form queries. The problem of processing and acting upon arbitrary English is an extremely challenging research topic being actively addressed in the AI community. To serve a scientific query, however, it suffices for our system to understand a limited form of English, wherein the vocabulary is based on scientific terminology. There is no formal query specification language that scientists need to learn, just as in Web search. We expect the scientists to use our system and refine their queries based on the feedback from our framework on how the query was transformed and processed, and the results they obtain. Examples of the kinds of queries targeted by our framework include, “Instrument data files with temperature values between 155 and 213”, “atmospheric data using CCSM model for 2006”, “status of job ID 117”, and “all BSS instrument data conducted on SNS by PI Miller last week”.

IV. SCALABLE INDEXING SYSTEM ARCHITECTURE

The primary objective of any XML indexing technique is to reduce the time and space complexity for queries on XML-based scientific metadata. Without an indexing scheme, data reduction will require scanning every file in the repository, which is unacceptable as it requires extensive computing for each query and significantly increases the response time of every request. Web search engines have pre-determined criteria to decide the depth for the indexing algorithm and size of the stored index. Scientific metadata are stored in a flat or hierarchical file structure, unlike linked web documents, and the index depth is not a relevant criteria. The index is required to capture the information in all scientific metadata in the repository. However, similar to web search engines, the indexing program needs to be flexible enough to run at pre-determined times and also on-demand when a large experiment has completed and data reduction needs to be performed immediately. Another similarity with web search is the requirement of an inverted index to process user queries and map the terms to relevant documents and rank the results.

The aim of our indexing mechanism is to uniquely identify each element in a scientific metadata document, whose structure is specified by an XSD document, as is the case for the NeXus files in the target scientific domain. Our numbering mechanism is adapted from the Dewey [7] labeling scheme. In every level of the tree representation of the XSD file, each element is given a numbering code. The numbering scheme starts from 1 for each element at each level of the tree. As the root element is always the only element in its level and has no

ancestors, it is always denoted with the code “1”. Codes for each element are generated based on its parent code and its assigned number under its parent element. The advantage of this coding scheme is that each code also contains the id of its ancestor. The uniqueness of element codes also ensures that elements with similar names, but with different ancestry structures, are distinguishable. This feature is required as scientific metadata often have identical element names in different sub-trees.

As the first step of the index generation process, the XSD file for the given scientific metadata is parsed using the XPP XML parser [8] and each element is assigned a unique code. The codes are then stored in a code repository for the unique identification of each element. For each leaf-element in the XML data files (conforming to the XSD), a key-value pair is created, which we denote as the inverted index. The key is the data-value of the leaf element in the XML file. The value is a set containing all the file ids, because there is also the possibility of the same leaf element value being present in multiple XML files. The index files are generated for only the leaf nodes to avoid redundancy, because we can easily refer to child elements of a non-leaf node, based on our numbering scheme. When a free-form user query is received, the code is retrieved for a keyword present in the query, the corresponding file is selected, and the inverted index object is loaded in memory to serve the query. As the indexes are stored in separate index files, the search space for the keywords present in the query gets substantially reduced. An inverted index is created for each element value present in all the XML metadata instance files, which results in significantly faster query processing.

V. INDEXING PERFORMANCE

We present a performance analysis of our indexing scheme and query processing based on datasets generated from metadata files used by SNS scientists. The motivation of these tests is to exercise the memory constraints while processing index files and quantify the performance gain and limits when run on a standard desktop machine. We compare our system against Apache SOLR [9], which is an open-source, high-performance, full-text search engine built on top of theApache Lucene Java search library [10]. SOLR provides faceted search capability that offers free text search with structured querying. Apache SOLR returns XML snippets containing the term/terms queried as result values, whereas the requirement for metadata search is determining the XML files where the metadata content(s) resides/reside.

Fig. 1 compares the performance of index generation in our system with that of XML parsing time, as the number of metadata files increases. It can be observed that XML parsing does not scale well, compared to index generation, for large numbers of XML metadata documents. As a result, we tested the indexing mechanism for 10,000 files so that the tests could focus solely on indexing performance. Experiments beyond that size would also exercise system I/O constraints, along with scalability of the XML processor itself, which are beyond the scope of this paper. We currently use the XML Pull Parser

554

[8], which has been reported to have a small memory footprint and be efficient for XML documents whose structure is known [11]. It is to be noted that index generation is a one-time process, for a given set of XML metadata instance documents. For applications that need frequent index generation due to dynamic update in metadata, an option to reduce the XML parsing time is to use the Piccolo parser [12], in future work, as it outperforms XPP in some use-case scenarios [11].

Fig. 1. Index generation time and XML parsing time in our system expectedly increases with the number of XML metadata instance files

We compare the index size of our system with SOLR for varying similarity of XML metadata content in Fig. 2. We show the changes in index file sizes in our system for 0% and 100% similar XML metadata and compare them with SOLR. This is because these two cases (0% and 100% similarity) provide an insight on the worst and best case index size of the two systems for varying similarity of contents. We note from Fig. 2 that the index sizes for both SOLR and our system decrease with the increase in content similarity. In SOLR, the index size for all random datasets (0% similar metadata or equivalently 100% random metadata content) increases from 1.9 MB to 24 MB approximately, as opposed to from 1.15 MB to 9.9 MB in our system for dataset size of 7.4 MB to 43 MB respectively. On the other hand, for 100% similar metadata content, the index size of SOLR varies from 0.7 MB to 19.4 MB as opposed to 0.14 MB to 1.35 MB in our system. We note that the index size in SOLR compresses about 19.2% from completely random (0% similar) to identical (100% similar) data for approximately 43 MB metadata, compared to 86.3% in our system. The larger index size in SOLR is due to the fact that SOLR stores the word position and frequency of each indexed element/attribute value within the XML

metadata files. Although the term frequencies are useful in case of phrase search or proximity based search, we can easily derive them from the codes of the XML elements thus making our index size much smaller than SOLR.

Fig. 2. Comparison of Index size of our system with SOLR for increasing percentages of similar data. Decrease in index size is more pronounced in our system as compared to SOLR with the increase in the number of XML metadata files for increasing percentages of similar data

VI. RELATED WORK

The related work projects in XML indexing techniques can be broadly categorized as – (i) Structure-based, (ii) Sequence-based, (iii) Numbering-based, and (iv) Keyword-based.

1) Structure-based: This technique is primarily based on the bi-similarity concept where the nodes in XML documents are grouped into structurally similar concepts. These structural summaries are the reflection of the underlying database structures. For example, Milo et. al. [13] use template indexes or T-indexes for processing queries consisting of path expressions. By defining regular expressions in a general path of the form [P]x [P]y, called a path template, regular expressions can be accounted for in a query. T-indexing can be quite an expensive structure, but this can be considerably reduced in specific conditions. Qun et. al. [14] introduce the D(k)-index that is based on the concepts of bi-similarity. This indexing technique derives adaptive structural summaries from the XML data and serves as indexes for evaluating path expressions. The adaptive ability adjusts its structure according to the current query load. Update operations are efficient in case of D(k) indexing due to its inherently dynamic nature. Wang et. al. [18] propose diverse optimization methods geared towards improving XML processing efficiency. F&B is inefficient in the case of large documents, is memory intensive and does not inscribe itself as

555

a scalable solution. Processing is also significantly improved utilizing a better clustering technique using locality.

2) Sequence-based: In this indexing technique, both XML data in the database and user queries are transformed into sequences. The transformation is one-to-one between the tree and the sequences. The matching between the datasets and the queries is performed using subsequence matching technique and refinement stages. For example, Rao et. al. [15] address this issue by proposing an indexing scheme through Prüfer sequences. The idea behind this technique is to first transform XML documents and search patterns into Prüfer sequences. It is to be noted that matching is performed using refinement stages and the size of the sequence can grow quite considerably with the size of the tree processed.

3) Numbering-based: In this technique, XML is mapped from pre-order rank or post-order rank to co-ordinates in a two dimensional plane. XACC [16] is designed to support the evaluation of XPath queries on relational databases. They propose a database indexing structure that maps all elements and attributes onto 2-dimensional plane. The X-axis of a co-ordinate on the plane is the pre-order rank of the nodes and the Y-axis is the post order rank. The main advantage of this indexing scheme is that XPath traversal can start from arbitrary context nodes due to its 2-dimensional structure. This technique is especially useful for regular path expressions and is able to support all XPath axes.

4) Keyword-based: This indexing structure enables keyword search on XML data. This technique is useful when the users have no knowledge on the structure of the data being stored and the query language used. Hristidis et. al. [17] propose XKeyword that provides keyword proximity search in structured and semi-structured databases. XML data is modeled as labeled graphs, where the edges correspond to element-sub-element relationships and IDREF pointers. The XML elements are grouped into target objects that are stored in connection relations. Redundant connection relations are used to improve the performance of top keyword queries. The execution of the queries is optimized to offer fast response time.

Our indexing and search system is complementary to these approaches as it focuses just on matching free-form queries to the correct data file. Depending on the structure of the metadata, the appropriate scheme can be incorporated to augment the algorithms and data structures we have developed.

VII. CONCLUSION

The semantic framework along with the indexing scheme has the potential to provide a simple yet powerful interface to grid and e-science domain scientists to query and obtain the specific subset of files they are interested in. With our framework, scientists do not have to learn complex software stacks or formal query languages. As the vocabulary is restricted to the keywords commonly used in the domain, application of novel algorithms for matchmaking can result in a high percentage of accuracy. We provide the ease-of-use of popular Web search engines, along with the ability to retrieve

information related to user queries in the scientific domain. We have designed an indexing scheme that is tailored for use with a web-search like user interface. The size of the index file we generate scales well with the increase in the size of XML metadata.

VIII. FUTURE WORK

In future work, we will modify the indexing data structures to optimize storage of keys by combining frequently occurring or well-known ranges into a single key value. For much larger datasets that we gain access to in the future, the Numbering Code Table and Value Table can be distributed across the storage infrastructure, and accessed using the MapReduce processing model.

We also plan to incorporate effects of disk I/O and memory size for testing terabyte size data for indexing. We will conduct benchmark studies to determine the best XML parser suited to parse a large number of XML files for faster indexing of the data. We will also develop parallel algorithms for query processing with various independent constraints.

REFERENCES[1] Spallation Neutron Source (SNS).

http://neutrons.ornl.gov/aboutsns/aboutsns.shtml.[2] RDF. http://www.w3.org/RDF/.[3] "NeXus Data Format" Web Page. Available.

http://www.nexusformat.org/.[4] SPARQL. http://www.w3.org/TR/rdf-sparql-query/.[5] "OWL Web Ontology Language Overview" Web Page. Available:

http://www.w3.org/TR/owl-features/.[6] G. A. Miller, "WordNet: A Lexical Database for the English

Language" in Comm. ACM 1983.[7] Dewey. http://frank.mtsu.edu/~vvesper/dewey2.htm.[8] XML Pull Parser. http://www.xmlpull.org/.[9] Apache SOLR. http://lucene.apache.org/solr/.

[10] Apache Lucene. http://lucene.apache.org/java/docs/index.html.[11] Michael R. Head, Madhusudhan Govindaraju, Robert van Engelen,

Wei Zhang, “Benchmarking XML Processors for Applications in Grid Web Services” in Proceedings of SC|06 (Supercomputing): International Conference for High Performance Computing, Networking, and Storage, Tampa, FL, November 2006.

[12] Piccolo XML Parser for Java. http://piccolo.sourceforge.net/.[13] Tova Milo, Dan Suciu, "Index Structures for Path Expressions”, in

Processings of 7th International Conference on Database Theory (ICDT ‘99), Jerusalem, Israel, January 1999.

[14] C. Qun, A. Lim, and K.W.Ong, “D(k)-Index: An Adaptive Structural Summary for Graph-Structured Data”, in Proceedings of the 2003 ACM SIGMOD International Conference, San Diego, California, June 2003.

[15] P. Rao, B. Moon, “PRIX: Indexing and Querying XML Using PrüferSequences”, in Proceedings of the 20th International Conference on Data Engineering (ICDE 2004), Boston, 2004.

[16] T. Grust, "Accelerating XPath Location Steps", in ACM SIGMOD Conference, 2002.

[17] V. Hristidis, Y. Papakonstantinou, A. Balmin, “Keyword ProximitySearch on XML Graphs”, in ICDE, 2003.

[18] Wei Wang, Hongzhi Wang, Hongjun Lu, Haifeng Jiang, Xuemin Lin, Jianzhong Li, “Efficient Processing of XML Path Queries Using the Disk-based F&B Index”, in Proceedings of the 31st International Conference on Very Large Data Bases (VLDB 2005), Norway, 2005.

[19] Chaitali Gupta, Rajdeep Bhowmik, Madhusudhan Govindaraju, "Semantic Framework for Free-Form Search of Grid Resources and Services", in Proceedings of the 10th IEEE/ACM International Conference on Grid Computing (GRID 2009), Banff, Canada, October 2009.

556

[ieee 2010 10th ieee/acm international conference on cluster, cloud and grid computing - melbourne,...

Documents