a novel technique for information retrieval based on cloud

IPASJ International Journal of Information Technology (IIJIT) Web Site: http://www.ipasj.org/IIJIT/IIJIT.htm

A Publisher for Research Motivation ........ Email: [email protected] Volume 1, Issue 1, June 2013 ISSN 2321-5976

Volume 1, Issue 1, June 2013 Page 8

ABSTRACT The procedure of data Retrieval (IR)algorithmappears artfully modest once ascertained from the point of view of word rationalization. However, the implementation mechanism of the IR algorithmic rule is sort of difficult and notably once enforced to gratify the definite structure needs. during this analysis, the knowledge Retrieval algorithmic rule is developed mistreatment the MapReduce mechanism to retrieve the knowledge during a Cloud computing atmosphere. The MapReduce algorithmic rule was developed by Google for experimental evaluations. within the gift study, the algorithmic rule portrays the leads to terms of range of buckets needed to come up with the output from the big chunk of knowledge in Cloud computing. The algorithmic rule is that the a part of the entire Business Intelligence tool to be enforced and also the results to be delivered for Cloud computing design. Keywords: IR algorithmic rule, Cloud Computing, MapReduce, Business Intelligence, Name nodes, Data nodes, Main Server, Secondary Server, info Server.

1. INTRODUCTION Cloud computing is evolving as a unique image for terribly climbable, fault-tolerant, and compliant computing on huge clusters of computers. Cloud architectures offer extremely procurable storage and cypher capability through dissemination and replication. Cloud computing as a developing technology is anticipatedto reconstitute the knowledge retrieval procedures within the nearfuture. A typical cloud application would have knowledge|a knowledge|an information} owneroutsourcing data services to a cloud, wherever the information is storedin a keyword-value type, and users may retrieve the datawith many keywords [1]. owing to this reason, MapReduce mechanism finds its suitableness to style and implement the IR algorithmic rule. conjointly significantly, Cloud architectures adapt to dynamical needs by dynamically provisioning new (virtualized) cypher or storage nodes [2]. conjointly various services and dynamically scalablevirtualized resources area unit adscititious to the cloud [3] nearly at each instance of your time and Cloud computing makes the resources offered universally with better flexibility[4]. The need for enhancements in infoservices as well as information retrieval is currently mandatorydue to the rise of virtualized resources in cloud [4]. All the cloud resources aredistributed whereas the present search engines such asYahoo, Google, and MSN area unit centralized systems [5]. Centralized systems area unit sufferingfrom the various drawbacks as well as less quantifiability,frequent server failures and data retrieval issuesas mentioned by [6]. Document virtualization is additionally becomingpopular over the previous few years [7].Existing distributed IR models are unable to searchinside a virtualized physical node with multiple virtualsystems running in parallel within the variety of a grid. [5]proposed a distributed IRmodel to resolve the difficulty of correct and quick allocationof needed info however still several problems areunsolved.A changed IR model is that the want of the timewhich will work with efficiency with virtualized resources [4]. This paper is an effort to style the IR algorithmic rule with the utilization of MapReduce mechanism. The algorithmic rule is verified and simulated results area unit evaluated supported the subsequent criteria’s:

A Novel Technique for Information Retrieval based on Cloud Computing

Dr. Sanjay Mishra, Dr. Arun Tiwari

Assistant Professor, Department of Computer Science and Engineering,

Amity University, Dubai, UAE




1) The algorithmic rule takes the quantity of Search requests as input. 2) The algorithmic rule then breaks the Search requests into range of chunks needed for the knowledge retrieval

from the general public cloud. 3) Based on the 2 assumptions, the algorithmic rules will the mapping performalities and determines the quantity

of buckets needed to perform the scale back function of the algorithm. Thus, the most aim of the algorithmic rule is to manage the quantity of buckets (packets) needed to accomplish the MapReduce algorithmic rule with none deterrent. The algorithmic rule (as portrayed within the Annexure A) of the paper is being tested on the big range of requests supported totally different chunks of knowledge. The rest of the paper is split as follows: Section two elucidatesabout the MapReduce mechanism. Section three elaborates regarding Cloud computing design very well. Section 4outlines the elementary concerns for the IR algorithmic rule mistreatment MapReduce mechanism. Section 5describes the IR algorithmic rule and outline of the various functions employed in the particular Java code. Section 6illustrates the outcomes of the code execution. Section seven particularizes the logical thinking and commendations supported the experimentation. The paper conjointly includes Annexure A which incorporates the Java code snipping for IR algorithmic rule.

2. MAP REDUCE MECHANISM The thought of Map Reduce was introduced by Google in 2004 and is that the backbone of the many larger knowledge computations. Map Reduce is basically a divide and conquer algorithmic rule that breaks down the matter in to little parts and process it in parallel to accomplish economical computation on a bigger knowledge set. The MapReduce mechanism includes steps:

1. Map 2. Reduce

Map: In Map step, the most node acquires the input, partitions it up into smaller sub-problems, and distributes them to knowledge nodes. a knowledge node could try this over successively, resulting in a multi-level tree structure. The information node processes the smaller drawback, and passes the response back to its main node. Reduce: In scale back step, the most node then collects the responses to all or any the sub-problems and merges them in several ways to stipulate the output – the reply to the matter it absolutely was at first attempting to resolve. The overall structure of MapReduce mechanism is portrayed in Figure 1:

Figure 1: Map Reduce structure




3. CLOUD COMPUTING DESIGN The cloud computing design used for the experiment includes 3 differing types of servers, namely:

1) Main Server 2) Secondary Server 3) Database Server

The cloud design has each master nodes and slave nodes. during this enactment, a main server is one that gets shopper requests and handles them. The master node is gift in main server and also the slave nodes in secondary server.Search requests area unit forwarded to the MapReduce algorithmic rule gift in main server. MapReduce takes care of the looking out and compartmentalization procedure by instigating an oversized range of Map and scale back processes. Once the MapReduce method for a specific search key's completed, it returns the output worth to the most server and successively to the shopper. the entire design is portrayed in Figure two.

Figure 2: Implementation of data Retrieval (IR) algorithmic rule during a Cloud computing atmosphere

As mentioned in Figure two, the knowledge needed by the shopper is send on to the most Server. For simplicity, the most server is termed as Name node and stores the Meta knowledge regarding the knowledge. The Meta knowledge includes the scale of the file, actual location of the file, block locations amongst others. every of the knowledge (file) is replicated in range of Secondary Servers, named as knowledge nodes. knowledge nodes are literally accountable to trace the information from the information centers. The complete practicality of the MapReduce algorithmic rule operates as follows:

1) The shopper requests hit the most Node. 2) The Main node has the MapReduce algorithmic rule in situ and will the task of mapping. In shell, Name node

keepstrajectory of complete file directory structure and also the placement of chunks. so Name node is that the essential management purpose for the entire system. To scan a file, the shopper API can calculate the chunk index supported the offset of the file pointer and build asking to the Name node. The Name node can reply that knowledge nodes contains a copy of that chunk. From thispoint, the shopper contacts the information node directly while not hunting the Name node.

3) The shopper pushes its changes to all or any knowledge nodes, and also the amendment is hold on during a buffer of every knowledge node. once changes area unit buffered in any respect knowledge nodes, the shopper send a “commit” request, and shopper gets the response regarding the success.




The preceding 3 steps area unit portrayed in Figure three.

Figure 3: Operational Steps of the IR algorithmic rule mistreatment MapReduce during a Cloud Computing atmosphere

After accomplishment of the 3 steps explicit on top of, all modifications of chunk distribution associated information alterations are transcribed to an operation log file at the Name node. This log file preserves associate order list of operation that is critical for the Name node to recover its read once a crash. The Name node conjointly keeps its persistent state by often check-pointing to a file.

4. IR ALGORITHMIC RULE WITH AND WHILE NOT MAPREDUCE MECHANISM As the study conducted within the analysis is that the comparative analysis of performance of IR algorithmic rule with and while not MapReduce mechanism, this phase of the paper elaborate the flow diagram of implementation of each the algorithms very well. 4.1 flow diagram of IR algorithmic rule while not MapReduce mechanism The IR algorithmic rule implementation while not MapReduce works in 3 fold:

a) The requests area unit broken into range of elements. b) Each of those elements area unit processed in ordered order at totally different knowledge centers and response is

remit to the most server. c) The main server that has IR algorithmic rule joins every of the response and sends back to the user.

Figure 4: IR algorithmic rule while not MapReduce mechanism




4.2 flow diagram of IR algorithmic rule via MapReduce mechanism In this section, the IR algorithmic rule mistreatment the MapReduce implementation for the cloud computing atmosphere is being developed and dead. The projected algorithmic rule is employed in IR algorithmic rule to retrieve results from the planet Wide internet, and also the outcomes portrayed within the next section shows that MapReduce mechanism area unit wont to improve the celerity of data search. The projected algorithmic rule is associate reiterative technique that creates use of the 3 strategies, namely, map() reduce() and combine(), within the main server, to indicate the results. Categorization is employed to retrieve and order the results in line with the user option to modify the search.

Figure 5: IR algorithmic rule with MapReduce mechanism

5. RESULTS The Results of the complete experiment area unit portrayed during this phase of the paper. Few imperative points important here are: 1) Experiment is conducted between 5000 to 20000 requests/s. 2) The experiments represent the result for the pool of 4 Bucket sizes, 1000, 2000, 3000 and 4000. Table 1: Comparative study of IR algorithmic rule with and while not MapReduce mechanism Number of Requests/s

Bucket Size=1000 Bucket Size=2000 Bucket Size=3000 Bucket Size=4000

Choice of the IR Algorithm

Response time without MapReduce Algorithm (in s)

Response time via MapReduce Algorithm (in s)







5000 54213 53893 53922 53893 53922 53893 54798 53893 6000 66923 64883 65126 64893 65126 64893 66098 64882

7000 77343 75893 75898 75882 75898 75882 79876 75882




8000 87903 86893 87961 86893 87961 86882 87762 86893

9000 99123 97893 98956 97871 98956 97893 99877 97893

10000 129924 108894 119862 108894 119862 108894 110872 108872 11000 148264 120894 130700 120894 130700 120833 139813 120871

12000 159924 132894 149879 132894 149879 132894 158090 132894

13000 163434 144894 166973 144846 166973 144894 179898 144894

14000 156894 156894 176756 156894 176756 156894 190232 156882 15000 163268 168894 185683 168894 185683 168894 209873 168882

16000 192876 180894 192876 180894 192876 180894 219098 180894

17000 208734 192894 208342 192894 208342 192846 239098 192894

18000 229869 204894 238672 204894 238672 204894 248803 204882

19000 250980 216894 237803 216894 237803 216894 270892 216894 20000 277987 228894 249800 228894 249800 228894 308767 228894

6. EVALUATING THE PERFORMANCE Dissimilar sets of requests were delivered, every of altered size, and accomplished the MapReduce jobs in singlenode clusters. The corresponding times of execution were calculated and also the conclusion of death penalty the experiment was that running MapReduce in clusters is out and away the additional effectual for an oversized volume of requests. The two vital inferences from the study cause two obviousresults: • In a cloud atmosphere, the MapReduce structure upsurges the skillfulness of output for giant range of requests. In distinction, one would not unescapably see such a rise in output during a non-cloud system. • When the information set is tiny, MapReduce don't affectsubstantial increase in output during a cloud system. Therefore, think about a mix of MapReduce-style {parallel methoding|multiprocessing|data processing} once aiming to process an oversized quantity of requests within the cloud system. References [1.] Bordogna, G & Pasi, G. “A fuzzy linguistic approach generalizing Boolean information retrieval: a model and its

evaluation”, Journal of the American Society for Information Science, 44(2), 1993, pp: 70-82 [2.] Belew, R., "Adaptive information retrieval", Proceedings of the Twelfth Annual International ACM/SIGIR

Conference on Research and Development in Information Retrieval, 1989, pp: 11-20 [3.] Blair, D.C. & Maron, M.E. “An evaluation of retrieval effectiveness for a full text document-retrieval system”,

Communications of the ACM, 28(3), 1985, pp: 289-299 [4.] Bookstein, A. “Probability and fuzzy-set applications to information retrieval”, Annual Review of Information

Science and Technology, 20, 1985, pp: 117-151 [5.] Chen, H., & Dhar, V., "Cognitive process as a basis for intelligent retrieval systems design", Information

Processing and Management, 27, 1991, pp: 405-432 [6.] Goldberg, D.E. “Genetic Algorithms in Search, Optimization and Machine Learning”, Reading M.A.: Addison-

Wesley, 1989 [7.] Gordon, M.D. “Probabilistic and genetic algorithms for document retrieval”, Communications of the ACM,

31(10), 1988, pp: 1208-1218 [8.] Gordon, M.D. “User-based document clustering by redescribing subject descriptions with a genetic algorithm”,

Journal of the American Society for Information Science, 42, 1991, pp: 311-322 [9.] Harman, D., "An experimental study of factors important in document ranking", in Proceedings of the ACM

SIGIR, 1986, pp: 186-193 [10.] Holland, J.H. “Adaptation in Natural and Artificial Systems”, Ann Arbor: The University of Michigan Press, 1975

a novel technique for information retrieval based on cloud

Documents