[ieee 2010 international symposium on information science and engineering (isise) - shanghai, china...
TRANSCRIPT
![Page 1: [IEEE 2010 International Symposium on Information Science and Engineering (ISISE) - Shanghai, China (2010.12.24-2010.12.26)] 2010 Third International Symposium on Information Science](https://reader036.vdocuments.net/reader036/viewer/2022082509/5750a1df1a28abcf0c96d50a/html5/thumbnails/1.jpg)
Research on Dynamic Index of Streamline Dynamic Successive Trees
Lin Huo Department of Computer and Electronic Information Guangxi University Nanning, China [email protected]
Yinglu Tan
Department of Computer and Electronic Information Guangxi University Nanning, China [email protected]
Abstract Based on Streamline Dynamic Successive
Trees, the paper proposes a dynamic index update
method which update size is in character-level; The
document-level index update algorithm and the
character-level index update algorithm are based
on SDST and the performance of these 2
algorithms are compared; The efficiency of the
character-level index update method is proved by
experiment. The result shows that the performance
of character-level update method is better than
document-level. Key Words dynamic index; update size; character-level
INTRODUCTION
Dynamic update is an important research and application direction of the full text index search technology. With the rapid development of information technology, full-text search technology has been widely used in web search, digital libraries, application software help systems, e-mail searching and so on. In these researches, new data are emerging to replace old data, which need to insert, delete and modify information and data frequently. Therefore, dynamic update is felt to be a research of some significance.
RELATED WORK
The paper [1] introduced a retrieval system now commonly used four kinds of index update method: in-place, re-merge, re-build, and hybrid approach. And it evaluated the former three kinds of ways, the results show that: re-merge is always better than re-build and in-place; if the memory is sufficiently large enough to cache the documents, the update rate of re-merge would be the fastest.
The inverted index file structure has been improved to overcome the problem removing documents[2]. A deletion list of files is added to the structure. When deleting a document, its document ID is recorded in the deletion list, and then use it to filter the deleted documents included in the result set when querying. Base on that, the other improved method has been proposed by Büttcher[3] and others: deleted documents are directly moved out while querying in this method, documents that have been removed would avoid being queried, and the query efficiency is improved.
An approach, using the main index and the additional index to achieve update[4], was proposed, which updating index library is limited to additional index in the context of an update to improve the performance. In this approach, main index is rebuild periodically, high-capacity, and stores most of the index data,
Third International Symposium on Information Science and Engineering
978-0-7695-4360-4/10 $26.00 © 2010 IEEE
DOI 10.1109/ISISE.2010.35
33
Third International Symposium on Information Science and Engineering
978-0-7695-4360-4/10 $26.00 © 2010 IEEE
DOI 10.1109/ISISE.2010.35
33
Third International Symposium on Information Science and Engineering
978-0-7695-4360-4/10 $26.00 © 2010 IEEE
DOI 10.1109/ISISE.2010.35
33
![Page 2: [IEEE 2010 International Symposium on Information Science and Engineering (ISISE) - Shanghai, China (2010.12.24-2010.12.26)] 2010 Third International Symposium on Information Science](https://reader036.vdocuments.net/reader036/viewer/2022082509/5750a1df1a28abcf0c96d50a/html5/thumbnails/2.jpg)
while additional index is updated in real-time, small capacity, and processes of the document insert and remove, the modifying is deemed to be inserted after removing.
Literature [5] proposed merge index update approach based on dynamic blanching tree. This approach controls indies with similar size to merge by dynamic blanching tree, which not only reduces the overhead cost of the merger index, but also keep a good performance during index and query processing.
DYNAMIC INDEX DESIGN
From the discussion of relevant literature, we summarize the two main factors that impact on the index: First, the index I/O times, I/O operation itself consuming some time, will definitely increase the index update costs, and affects the efficiency of update directly if it is too frequent; Second, the index data structure, which determines the affected area of the index library when updating, and indirectly affects the efficiency of update.
In the previous study for the index update, it usually only concerned with the first factor, but less attention to the second one. In this section, the paper pays attention to the second factor, we propose a character-level dynamic index update method based on the Streamline Dynamic Successive Trees, which expand the structure of SDST into a high-dynamic one.
A. SDST Index Structure
In the SDST, each index entry is a sub-tree, while the index library is a forest constituted by these sub-trees. Each sub-tree in the forest includes the following information: Root: refer to word which locates in the root; Leaves: the succession of the root, refer to words which locate in the leaves; Doc ID: describe that which document the sub-tree belongs to;
Successive Position (SP): describe the position of the current leaf in the document; Relative Position (RP): describe the position of the successor of the current leaf in the document. Leaf information (LI): the DocID, SP, RP
form a leaf information.,which SP, RP incrementally take the integers.in the interval of [1, + �). Leaf information set: one or more than one leaf information compose a collection of information, recorded as LI.
In the SDST structure, a root connects to one or more than one leaf, while a leaf connects to an information set of leaf. The SDST logical structure shows in Figure 1:
Figure1 The logical structure of SDST
B. Dynamic Design
To facilitate the descript of the dynamic index update, the paper gives some relevant definitions first: Update size. The affected area of index library in one index update, is known as the update size.
According to the update size, we can classify the update index as Index-level, Document-level Character-level. In one index update, if the whole index library are affected, whose update size is index library level (IL); If the entire contents of the relevant documents that includ update section are affected, whose update size is document level (DL); If the partial contents of the relevant documents that include update section are affected, whose update size is character level (CL). After considering the update size and the features of SDST ,we propose a update method
343434
![Page 3: [IEEE 2010 International Symposium on Information Science and Engineering (ISISE) - Shanghai, China (2010.12.24-2010.12.26)] 2010 Third International Symposium on Information Science](https://reader036.vdocuments.net/reader036/viewer/2022082509/5750a1df1a28abcf0c96d50a/html5/thumbnails/3.jpg)
whose update size is in character-level. During updating, it generates sub-trees for the update sections, and then inserts them into the forest. Finally, it modifies the values of SP and RP in the sub-trees that affected by the update sections. SP, RP decreassively take integers in the interval of ( 1] when updating.
The following example describes how to update the SDST index file in the character-level update method: Example: There is a text “acbbda”, and its docID is 2. After inserting “dac” into the back of the third character, the text becomes “acbdacbda”. Figure 2 is the logical structure of CL adding based on SDST; The text becomes “aca” after deleting “bbd” behind the second character. Figure 3 describes the SDST delete operation on the logical structure of CL.
Figure 2 The logical structure of SDST with CL adding
Figure 3 The logical structure of SDST with CL deleting
DYNAMIC UPDATE ALGORITHM
In the three kinds of update size, the index library level update is equivalent to the reconstruction, so that this paper will not discuss its algorithm and performance. In this section, we will respectively describe the document-level and character-level update algorithms.
A. SDST_DL update algorithm
updateDocWithSDST(docID) Begin for each leafList in rootList do for each LeafNodeInfoList in leafList do
for each LeafNodeInfo in LeafNodeInfLis do
if(LeafNodeInfo.docID = docID) LeafNodeInfoList.remove(LeafNodeInfo);
end if end for
end for end for for each term in new document do if (rootList.containsKey(root)) if (leafList.containsKey(leaf)) leafNodeInfo.modifyLeafNodeInfo(docID,sp,rp) else
leafNodeInfo=LeafNodeInfo(docID,sp,rp); leafList.addLeafList(leaf,leafNodeInfo);
end if else leafNodeInfo=LeafNodeInfo(docID,sp,rp); leafList.addLeafList(leaf,leafNodeInfo); rootList.addRootList(root,leafList); end if
end for End
B. SDST_CL update algorithm
updatePartDocWithIF(part) Begin writeTmpState(positionInfo); if isAdd insertPart(part,positionInfo); else
if isDelete deletePart(part,positionInfo); else modifyPart(part, positionInfo); end if
end if End
PERFORMANCE ANALYSIS
In this section, we will compare the four algorithms in last section. For ease of analysis algorithms, only the key steps are considered.
353535
![Page 4: [IEEE 2010 International Symposium on Information Science and Engineering (ISISE) - Shanghai, China (2010.12.24-2010.12.26)] 2010 Third International Symposium on Information Science](https://reader036.vdocuments.net/reader036/viewer/2022082509/5750a1df1a28abcf0c96d50a/html5/thumbnails/4.jpg)
A. Symbol Definition
M The number of the roots in character set � that occure in the index library. N The number of the words in the new documents. U Update part. R The number of the update parts. m The number of the documents in the index library. n The number of the words in U.
B. Analysis of Document-level Update Algorithm
The essence of DL is to insert a new document into the index library after removing the old one. The complexity of DL algorithm equals the complexity of removing and insertion. In SDST_DL algorithm, removing documents consist of 3 layers : the first layer is to traverse each roots in the root-list. The times of this operation is the total number of root nodes M. The second layer is to traverse each leafs in the leaf-list. Known from the structure of SDST, a word that appears in the leaf will appear in the root, and a word that appears in the root will also appear in the leaf, which make the maximum of leaf nodes achieve M.Thus, the total operation times is M; The third layer is to traverse each leaf-informations in the leaf-information list, As the length of leaf-information list equals to the number of the documents, the operation times of this layer is m. We regard SDST_DL insertion as creating the index file for new documents, whose times is the number of the words in new documents N, so its complexity is O (N). From above analysis, we get the SDST_DL algorithm complexity which is O (M * M * m + N ).
In IF_DL algorithm, removing documents is two nested traversal operations. Firstly,, each words in the word vocabulary have to be traversed , whose operation times is the total
number of the words M. The second operation is to traverse the node-information, whose operation times is the total number of node-informations m. Therefor, the complexity removing documents is O (M * m). IF_DL insertion is regarded as creating the index for new documents, whose complesity is the same as that of creating, which is O ( N ). From the above we can conclude that the complexity of IF_DL algorithm is O ( M * m + N ) m and N is smaller, as well as M is constant if the number of new documents is fewer., we can easily conclud that O(M * M * m + N ) O(M * m + N ). The document-level update of IF is a little more efficient than that of SDST. With the increasing of rhe new documents, the growth of m will be smaller, while the growth of N will be larger. M is constant, there is O(M * M * m + N ) � O(M * m + N ). Therefor, the efficiency of SDST_DL is as better as that of IF_DL
C. Analysis of Character-level Update Algorithm
The essence of CL algorithm is first to record the context of the update point to a temporary state table, then deal with the update part U which will be insert to or delete from the document. The complexity of CL algorithm is the sum of recording and processing.
The operation times of recording is R, the number of update part U. The complexity of recording is O(R). For processing operation, the operation times is R, too. But each operation is linearly related with n, the number of word in the documents. The complexity of processing is O (R * n ). Therefore, O (R + R * n ) is the complexity of CL update algorithm. As R, the number of U, is much smaller than
m, the number of documents in the index library. The number of word in U is also far less than N, the number of word in new documents.We can conclude that the efficiency of CL update algorithm is much better than that of DL because of O R + R * n ) << O M * m + N ) < O M
363636
![Page 5: [IEEE 2010 International Symposium on Information Science and Engineering (ISISE) - Shanghai, China (2010.12.24-2010.12.26)] 2010 Third International Symposium on Information Science](https://reader036.vdocuments.net/reader036/viewer/2022082509/5750a1df1a28abcf0c96d50a/html5/thumbnails/5.jpg)
* m + N ).
EXPERIMENTAL RESULT
5 Chinese text data sets from Skynet (Table 1, Table 2) are used to verify the dynamic of SDST. Experimental environment is that the frequancy of processor is 3.0 GHz, and the memory size is 2G. The program language is Java, using the Eclipse 3.4 for development tools. The results show that based on IF or SDST, the performance of CL update algorithm is far superior to that of DL regardless of which index structure they are used at, IF or SDST. In Figure 4 and Figure 5, IF_DL means document-level update on IF, SDST_DL means document-level update on SDST, IF_CL means character-level update on IF, SDST_CL means character-level update on SDST.
TABLE TEXT SET OF EXPERIMENTATION
Text set U-1 U-2 U-3 U-4 U-5
Size MB 0.5 1.0 1.5 2.0 2.5
Text count 100 200 300 400 500
Figure 4 Time cost for the Index size fixed, the
number of documents increase
Figure 4 shows the experimental result that the index size (IS) is fixed at 6M, while the number of documents which have to update (UDN) is increasing. From Figure 4, we can see the update time cost of these 3 algorithm linearly grow with the UDN. However, the growth rate of CL is much less than that of DL, which is
consistent with the conclusions in 5.3 section.
TABLE TEXT SET OF EXPERIMENTATION
Text set TC-1 TC-2 TC-3 TC-4 TC-5
Size MB 2.0 4.0 6.0 8.0 10.0
Text count 100 200 300 400 500
Figure 5 Time cost for index size increase, the
number of documents fixed
Figure 5 shows the test result that the number of update documents is fixed at 60, while the index size is increasing. As showing in Figure 5, the update time cost of IF_DL and SDST_DL linearly but slowly grows with UDN. It is because that the increasion of IS makes M approach to a maximum, the total number of characters in the character set �, while m increases slowly. The value of UDN is unchaged so that N is also stable. The update time cost of IF_CL as well as SDST_CL is relatively stable because of the UDN fixed, which makes R and n relatively unchanged. The conclusion is consistent with the O R n ) in last session. The experimental results showing in Figure 4 and Figure 5 are the further evidence that dynamic of SDST is more superior than that of IF. At the same time, the CL update algorithm is more efficien than that of DL.
CONCLUSION
This paper presents an updating method whose updated size is character level, and
373737
![Page 6: [IEEE 2010 International Symposium on Information Science and Engineering (ISISE) - Shanghai, China (2010.12.24-2010.12.26)] 2010 Third International Symposium on Information Science](https://reader036.vdocuments.net/reader036/viewer/2022082509/5750a1df1a28abcf0c96d50a/html5/thumbnails/6.jpg)
extends SDST to a highly dynamic index structure. At the same time, we promote this method to IF which is used widely. Experimental results show that this method is efficient in the actual system, and significantly reduce the update time cost. CL update algorithm has a expansive application prospect, especially for those database that are large and have to update frequently. This methord can be further improved to be used to track documents from draft to transcript.
REFERENCES
[1] Lester N, Zobel J, Williams H E. In-place versus re-build
versus re-merge: index maintenance strategies for text
retrieval systems[C]. Proc of ACSC 2004, Darlinghurst,
Australia: Australian Computer Society, 2004: 15-23
[2] Chiueh T, Huang L. Efficient real-time index updates in
text retrieval systems [R]. New York: Stony Brook, 1998
[3] S, Clarke C. Indexing time vs. query time trade-offs in
dynamic information retrieval systems[C]. Proc of CIKM
2005, New York: ACM, 2005
[4] Guo Qijuan. Research on Full-Text Index Model Based
on Full-Text Database [D]. China Univerity of Petroleum,
2007.
[5] Guo Ruijie, Cheng Xueqi, Xu Hongbo, Wang Bin, Ding
Guodong. A Fast On-Line Index Construction Method
Based on Dynamic Balancing Tree [J]. Journal of
Computer Research and Development, 2008, (10)
[6] Li Dong, Shi Xiaodong. An On-Line Updates Inverted
Index Strategy Which Supports High Efficient Searching
[J]. Journal of The China Society for Scientific and
Technical Information, 2006, (01)
[7] L.Page,S.Brin.The anatomy of a large-scale hypertextual
web search engine[C]. Proceedings of the 7th Intl.WWW
Conf., 1998: 107 -117
[8] Lipyeow Lim, Min Wang, Sriram Padmanabhan.
Dynamic Maintenance of Web Indexes Using
Landmarks[C]. WWW2003, May 20 24, 2003,
Budapest, Hungary, ACM 1-58113-680-3/03/0005
[9] Jidong Chen,Xiaofeng Meng.Update-efficient indexing of
moving objects in road networks[J]. GeoInformatica,
Volume 13, Number 4, 2009: 397- 424
[10] Laurynas Biveinis, Simonas Šaltenis. Towards efficient
main-memory use for optimum tree index update[J].
Volume1, Issue2 August, 2008: 1617-1622
383838