[ieee 2013 20th international conference on telecommunications (ict) - casablanca...

Medical Content Based Image Retrieval by Using the HADOOP Framework

Said Jai-Andaloussi, Abdeljalil Elabdouli, Abdelmajid Chaffai, Nabil Madrane, Abderrahim Sekkaki

Abstract— Most medical images are now digitized and storedin large image databases. Retrieving the desired images becomesa challenge. In this paper, we address the challenge of contentbased image retrieval system by applying the MapReducedistributed computing model and the HDFS storage model. Twomethods are used to characterize the content of images: the firstis called the BEMD-GGD method (Bidimensional EmpiricalMode Decomposition with Generalized Gaussian density func-tions) and the second is called the BEMD-HHT method (Bidi-mensional Empirical Mode Decomposition with Huang-HilbertTransform HHT). To measure similarity between images wecompute the distance between signatures of images, for thatwe use the Kullback-Leibler Divergence (KLD) to compare theBEMD-GGD signatures and the Euclidean distance to comparethe HHT signatures. Through the experiments on the DDSMmammography image database, we confirm that the results arepromising and this work has allowed us to verify the feasibilityand efficiency of applying the CBIR in the large medical imagedatabases.

I. INTRODUCTIONNowadays, medical imaging systems produce more

and more digitized images in all medical fields. Mostof these images are stored in image databases. There isa great interest to use them for diagnostic and clinicaldecision such as case-based reasoning [1]. The purpose isto retrieve desired images from a large image databasesusing only the numerical content of images. CBIR system(Content-Based Image Retrieval) is one of the possiblesolutions to effectively manage image databases [2].Furthermore, fast access to such a huge database requiresan efficient computing model. The Hadoop framework isone of the findings based on MapReduce [3] distributedcomputing model. Lately, the MapReduce framework hasemerged as one of the most widely used parallel computingplatforms for processing data on terabyte and petabytescales. Google, Amazon, and Facebook are the biggestusers of the MapReduce programming model and it’s beenrecently adopted by several universities. It allows distributedprocessing of data intensive computing over many machines.In CBIR systems, requests (the system inputs) are imagesand answers (outputs/results) are all the similar images inthe database. A typical CBIR system can be decomposedin three steps: firstly, the characteristic features for eachimage in the database are extracted and are used to indeximages; secondly, the features vector of a query image iscomputed; and thirdly, the features vector of the query

Said Jai-Andaloussi, Nabil Madrane and Abderrahim Sekkakiare with LIAD Lab, Casablanca, Kingdom of Morocco.;[email protected]

Said Jai-Andaloussi, Abdeljalil Elabdouli, Abdelmajid Chaffai, NabilMadrane and Abderrahim Sekkaki are with Faculty of science Ain-chok,Casablanca, Kingdom of Morocco.

image is compared to those of each image in the database.For the definition and extraction of image characteristicfeatures, many methods have been proposed, includingimage segmentation and image characterization usingwavelet transform and Gabor filter bank [4, 5]. In this work,we used MapReduce computing model to extract features ofimages by applying the BEMD-GGD and BEMD-HHT [2],then we write the features files into HBase [6] (HBase is anopen-source, distributed, versioned, column-oriented storemodeled after Google’s Bigtable), the Kullback-Leiblerdivergence (KLD) and Euclidean distance are used tocompute the similarity between features of images.

The setup of the paper is as follows: Section II-a de-scribes the database we used for evaluation. In section II-bwe present the components of Hadoop framework. SectionII-c describes the BEMD, BEMD-GGD and BEMD-HHTmethods. In section II-d we present the similarity methods.Section III describes the architecture of CBIR system basedon Hadoop framework and results are given in section IV.We end with a discussion and conclusion in section V.

II. MATERIAL AND METHODS

A. DDSM Mammography database

the DDSM project [7] is a collaborative effort involvingthe Massachusetts General Hospital, the University of SouthFlorida and Sandia National Laboratories. The databasecontains approximately 2,500 patient files. Each patient fileincludes two images of each breast (4 images for one patient,10 000 images in total), along with some associated patientinformation (age at time of study, ACR breast density rating)and image information (scanner, spatial resolution). Imageshave a definition of 2000 by 5000 pixels. The database isclassified in 3 levels of diagnosis (’normal’, ’benign’ or’cancer’). An example of image series is given in figure 1.

Fig. 1. Image series from a mammography study

B. Hadoop Framework

Hadoop is a distributed master-slave architecture1 thatconsists of the Hadoop Distributed File System (HDFS)for storage and MapReduce for computational capabilities.Traits intrinsic to Hadoop are data partitioning and parallelcomputation of large datasets. Its storage and computationalcapabilities scale with the addition of hosts to a Hadoopcluster, and can reach volume sizes in the petabytes onclusters with thousands of hosts [8].

1) MapReduce: MapReduce is a batch-based, distributedcomputing framework modeled after Google’s paper onMapReduce 3. It allows you to parallelize work over a largeamount of raw data. MapReduce decomposes work submittedby a client into a small parallelized map and reduce workers,as shown in figure 2 (figure 2 is taken from [8]). The mapand reduce constructs used in MapReduce are borrowed fromthose found in the Lisp functional programming language,and use a shared-nothing model 4. to remove any parallelexecution interdependencies that could add unwanted syn-chronization points.

2) HDFS: HDFS is the storage component of Hadoop.It’s a distributed filesystem that’s modeled after the GoogleFile System (GFS)2. HDFS is optimized for high through-put and works best when reading and writing large files(gigabytes and larger). To support this throughput HDFSleverages unusually large (for a filesystem) block sizes anddata locality optimizations to reduce network input/output(I/O). Scalability and availability are also key traits of HDFS,achieved in part due to data replication and fault tolerance.HDFS can create, move, delete or rename files like traditionalfile systems but the difference is the method of storagebecause it includes two actors which are the NameNode andthe DataNode. A DataNode stores data in the Hadoop FileSystem and the NameNode is the centerpiece of an HDFSfile system. It keeps the directory tree of all files in the filesystem, and tracks where across the cluster the file data iskept.

C. Numerical image characterization: signatures

The BEMD [9, 10] is an adaptive decomposition whichdecomposes any image into a set of functions denoted BIMFand a residue, these BIMFs are obtained by means of analgorithm called sifting process [11]. This decompositionallows to extract local features (phase, frequency) of inputimage. In this work, we describe the image by generating anumerical signature based on BIMFs contents [12, 13].

The usual approach used in CBIR system to characterizean image in a generic way, is to define a global representationof the whole image, or by computing the statistical param-eters such as co-occurrence matrix and Gabor filter bank

1A model of communication where one process called the master hascontrol over one or more other processes, called slaves

3See MapReduce: Simplified Data Processing on Large Clusters,http://research.google.com/archive/mapreduce.html.

4A shared-nothing architecture is a distributed computing concept thatrepresents the notion that each node is independent and self-sufficient.

2See the Google File System, http://research.google.com/archive/gfs.html.

Fig. 2. A client submitting a job to MapReduce

Fig. 3. Extraction process of the image signature using BEMD-GGD andBEMD-HHT

[5]. These parameters repesent the signature or the indexof the image. The most widely used technique is to buildimage signatures based on the information content of colorhistograms. In this work, we use the Bidimensional EmpiricalMode Decomposition (BEMD), Generalized Gaussian Den-sity function (GGD) and Huang-Hilbert transform to generatethe image signature.

1) BEMD-GGD signature: The gaussian generalized lawis derived from the normal law and parameterized by:• α: a scale factor, it corresponds to the standard deviation

of the classical Gaussian law.• β: a shape parameter.So the law density is defined as.

p(x;α, β) =β

2αΓ( 1β )e−( |x|

α )β

(1)

Where Γ(.) is the gamma function,

Γ(z) =∫∞0e−ttz−1dt, z > 0

We propose to characterize images by couples (α, β),determined by using a maximum likelihood estimator (α̂, β̂)of the distribution law for coefficients of each BIMF in theBEMD decomposition [12]. The image vector signature isformed by the set of couples (α̂, β̂) derived from each BIMFand the histograme of residue.

2) BEMD-HHT siganture: in the second method we applythe Huang-Hilbert transform [11] to each BIMF, and extractinformation from transformed BIMFs. Given an analyticsignal z(t) (equation (2)) of a real signal s(t). The imaginarypart of z(t) is equal to the Hilbert transform of the real part(equation (3)).

z(t) = s(t) + iy(t). (2)

y(t) = H(s(t)) = v.p

∫ +∞

−∞

x(τ)

π(t− τ)dτ. (3)

Where p indicates the Cauchy principal value.We propose to characterize image by using the statistic

features (mean, standard deviation) extracted from the am-plitude matrix A, phase matrix θ and instantaneous frequencymatrix W of each BIMF [13].We give below in figure 3 the extraction process of the imagesignature using BEMD-GGD and BEMD-HHT.

D. Distance

1) BEMD-GGD similarity: To compute the similaritydistance between two BIMFs (Generalized Gaussian) and,according to [14] Kullback-Leibler distance is used (seeequation (4)).

KLD(p(X; θq)||p(X; θi)) =

∫p(X; θq) log

p(X; θq)

p(X; θi)dx

(4)The distance between two images I and J is the sum of

the weighted distance between BIMFs (see equation (5)).

D(I, J) =

K∑k=1

KLD(P (X,αkI , βkI ), P (X,αkJ , β

kJ) (5)

2) BEMD-Hilbert similarity: The distance between twoimages I and J is the sum of the weighted distance betweenBIMFs (see equation (6)).

D(I, J) =K∑k=1

d(BIMF Ik , BIMF Jk ) (6)

where λk are the adjustment weights.The BIMFk represents the feature vector of an image.

The distance between two BIMFs in the same level of

decomposition, is defined as :

D(BIMFI , BIMFJ) =

∣∣∣∣µIA − µJAα(µA)

∣∣∣∣+

∣∣∣∣µIθ − µJθα(µθ)

∣∣∣∣+

∣∣∣∣µIW − µJWα(µW )

∣∣∣∣+

∣∣∣∣σIA − σJAα(σA)

∣∣∣∣+

∣∣∣∣σIθ − σJθα(σθ)

∣∣∣∣+

∣∣∣∣σIW − σJWα(σW )

∣∣∣∣(7)

α(µ) and α(σ) are the standard deviations of the respectivefeatures over the entire database, and are used to normalizethe individual feature components.

III. ARCHITECTURE OF CBIR SYSTEM BASED ONHADOOP FRAMEWORK

Content based image retrieval (CBIR) is composed oftwo phases: 1) offline phase, 2) online phase. In the offlinephase, the signature vector is computed for each image indatabases and they will be stored. In the online phase, thequery is constructed by computing the vector signature ofinput image. Then, the query signature is compared withsignatures of images in the database,

A. Offline phase: applying the Mapreduce in extraction ofthe image signature

MapReduce is known for its ability to handle largeamounts of data. In this work, we use the open sourcedistributed cloud computing framework Hadoop and its im-plementation of the MapReduce model to extract vectors fea-tures of images. The implementation method of distributedfeatures extraction and image storage is given in figure 4.Storage is the base of CBIR system, given the amount ofimages data produced daily by the medical services, retrieveand processed these images need important computationtime. Therefore, parallel processing is necessary. For thisreason, we adopt Mapreduce computing model to extract thevisual features of images and then write the features andimage files all into HBase. HBase partitions the key space.Each partition is called a Table. Each table declares one ormore column families. Column families define the storageproperties for an arbitrary set of columns [6].The given table in figure 5 shows the structure of ourHbase table, the row key of our Hbase table is assignedto the ID of image and families are files and features.Label ”source” and ”class” are added under family ”file”,representing for source image and class of image (the DDSMdatabase is classified in 3 levels of diagnosis (’normal’,’benign’ or ’cancer’)) respectively. Under family ”features”,label ”feature BEMD-GGD Alpha”, ”feature BEMD-GGDBeta”, ”feature BEMD-HHT mean”,”feature BEMD-HHTstandard deviation”, ”feature BEMD-HHT phase”, ”featureBEMD-residue histogram” are added, representing featuresextracting by using BEMD-GGD and BEMD-HHT methods.

Fig. 4. Offline phase: applying the Mapreduce in extraction of the imagesignature

Fig. 5. Table 1. Structure of Hbase table for image features storage

B. Online phase: applying the Mapreduce in image retrieval

In the figure (given below), we describe the online retrievalphase. This phase is divided into 7 steps:

1) The user sends a query image to SCL, then the imagewill be stored temporarily in HDFS.

2) Run a map-reduce job to extract features from queryimage

3) Store image features in HDFS4) The similarity/distance between the features vectors of

the query image in HDFS and the target images in theHBASE are computed.

5) A reduce collect and combines all the result from allthe map function.

6) The reducer stores the result into HDFS.7) Send the result to the user

IV. RESULT

The method is tested on the DDSM database (see II-A).We made experiments on mean precision at 20, which is theratio between the number of pertinent images retrieved and

Fig. 6. Online phase: applying the Mapreduce in image retrieval

the total images retrieved. We give below the principle ofour retrieval method.

• Each image in the database is used as a query image.• The algorithm finds the twenty first images of the

database closest to the query image.• Precision is computed for this query.• Finally, we compute the mean precision.

In performances testings of image retrieval, we comparedthe local method with the parallel method based on Hadoopframework. A diagram of time consumed to retrieve im-ages in parallel and in local way is given in figure 7.The horizontal axis represents the size of image databases,vertical axis represents the retrieval time (in milliseconds).We can see that when the size of image data is small(100<y<1000), image retrieval in local way will take lesstime. Local retrieval presents a performance better thanthe retrieval method based Map/Reduce, it’s because theMap/Reduce has needs to prepare the setting of cluster nodes,source file splitting, etc. With the increase in the size ofdata (1000<size<3000), retrievals in both parallel and localway will have similar processing capability. When the sizeof data is even larger (size>6000), the time consumed byMap/Reduce retrieval is less than that consumed by localway. The result shows that the use of MapReduce mode intoCBIR is very suitable for large image databases. It improvesthe efficiency significantly.

Fig. 7. Comparison of the image retrieval methods (local and MapReduce)in terms of time consumed

V. CONCLUSIONSIn this paper, we used the Hadoop distributed computing

environment to content based image retrieval. For that, weproposed two methods to characterize the numerical contentof medical images: the first method is BEMD-GGD and thesecond method is BEMD-HHT. Hadoop framework is used tostore images and their features in column-oriented databaseHBase, and utilizes MapReduce computing model to im-prove the performance of image retrieval among massiveimage data. Furthermore, we compared the proposed methodwith local method, image retrieval based on MapReducedistributed computing model are more efficient when targetimage data is large. Our method needs to be validated onlarger image databases such as PACS (Picture Archiving andCommunication System). In the near future, we think thatthe application of this approach into PACS will change themedical diagnosis aid, since we can index the amount of datastored on PACS in few seconds.

REFERENCES

[1] G. Quellec, M. Lamard, L. Bekri, G. Cazuguel, B. Cochener, C. Roux”Recherche de cas mdicaux multimodaux l’aide d’arbres de dcision”IRBM 2008;29(1):35-43

[2] S. Jai andaloussi, M.Lamard, G.Cazuguel, H.Tairi, M.Meknassi,B.Cochener, C.Roux ”Content Based medical Image Retrieval basedon BEMD: Optimization of a similarity metric”. 2010 Annual Interna-tional Conference of the IEEE, Buenos Aires, Argentina, 2010, 3069-3072.

[3] Jeffrey Dean, Sanjay Ghemawat. ”Map Reduce: Simplified DataProcessing on Large Cluster[C]”. OSDI, 2004.

[4] M. Lamard, G. Cazuguel, G. Quellec, L. Bekri, C. Roux, B. Cochener,”Content Based Image Retrieval based on Wavelet Transform coeffi-cients distribution” in Proceedings of the 29th Annual InternationalConference of the IEEE EMBS, Lyon, France August 23-26, 2007.

[5] B.S. Manjunath, P. Wu, S. Newsam, and H.D. Shin ”A texturedescriptor for browsing and similarity retrieval”. Journal of SignalProcessing: Image Communication, 16(1-2) :33-43, 2000.

[6] Jing Zhang, Xianglong Liu, Junwu Luo, Bo Lang ”DIRS: DistributedImage Retrieval System Based on MapReduce ” (ICPCA), Lanzhou,China 2010 1-3 Dec. 2010.

[7] M. Heath, K. Bowyer, and D. K. et al, ”Current status of the digitaldatabase for screening mammography,”Digital Mammography, KluwerAcademic Publishers, pp. 457-460, 1998.

[8] A. HOLMES ”Hadoop in Practice” book, October, 2012[9] J. C. Nunes, Y. Bouaoune, E. Delechelle, O. Niang, Ph. Bunrel, ”Image

analysis by bidimensional empirical mode decomposition,” Image andVision Computing, 2003; Vol. 21:1019-1026.

[10] Bhuiyan, S.M.A, Adhami, R.R. Khan, J.F, ”A novel approach of fastand adaptive bidimensional empirical mode decomposition”,EURASIPJournal on Advances in Signal Processing, Vol (2008).

[11] Huang and al, ”The empirical mode decomposition and the Hilbertspectrum for non linear and non-stationary time series analysis” Proc.Roy. Soc. London A, Vol. 454, pp. 903-995, 1998.

[12] S.Jai-Andaloussi, M.Lamard, G.Cazuguel, H.Tairi, M.Meknassi,B.Cochener, C.Roux, ”Content Based Medical Image Retrieval Basedon BEMD: use of Generalized Gaussian Density to model BIMFscoefficients”, GVIP, 2010;10(2):29-38.

[13] S.Jai-Andaloussi, M.Lamard, G.Cazuguel, H.Tairi, M.Meknassi,B.Cochener, C.Roux, ” Recherche d’images mdicales par leur contenunumrique : utilisation de signatures construites partir de la BEMD etla transform de Hilbert”, JDTIC’09, 16-18 Juillet 2009, Rabat-Maroc.

[14] G. V. Wouwer, P. Scheunders, and D. V. Dyck, ”Statistical texturecharacterization from discrete wavelet representations,” IEEE Trans.Image Processing,1999;vol. 8:592-598.

[ieee 2013 20th international conference on telecommunications (ict) - casablanca...

Documents