journal of computing::cloud hadoop map reduce for … · map stages or reduce stages, the...

VOL. 3, NO. 4, April 2012 ISSN 2079-8407 Journal of Emerging Trends in Computing and Information Sciences

©2009-2012 CIS Journal. All rights reserved.

http://www.cisjournal.org

637

Cloud Hadoop Map Reduce For Remote Sensing Image Analysis Mohamed H. Almeer

Computer Science & Engineering DepartmentQatar University Doha, QATAR [email protected], [email protected]

ABSTRACT Image processing algorithms related to remote sensing have been tested and utilized on the Hadoop MapReduce parallel platform by using an experimental 112-core high-performance cloud computing system that is situated in the Environmental Studies Center at the University of Qatar. Although there has been considerable research utilizing the Hadoop platform for image processing rather than for its original purpose of text processing, it had never been proved that Hadoop can be successfully utilized for high-volume image files. Hence, the successful utilization of Hadoop for image processing has been researched using eight different practical image processing algorithms. We extend the file approach in Hadoop to regard the whole TIFF image file as a unit by expanding the file format that Hadoop uses. Finally, we apply this to other image formats such as the JPEG, BMP, and GIF formats. Experiments have shown that the method is scalable and efficient in processing multiple large images used mostly for remote sensing applications, and the difference between the single PC runtime and the Hadoop runtime is clearly noticeable. Keywords: Cloud, Hadoop, HPC, Image processing, Map reduce.

1. INTRODUCTION The remote sensing community has recognized the challenge of processing large and complex satellite datasets to derive customized products, and several efforts have been made in the past few years towards incorporation of high-performance computing models. This study analyzes the recent advancements in distributed computing technologies as embodied in the MapReduce programming model and extends that for image processing of collected remote sensing images of Qatar. We make two contributions, which are the alleviation of the scarcity of parallel algorithms for processing large numbers of images using the parallel Hadoop MapReduce framework and the implementation for Qatari remote sensing images. Our research has been conducted to find an efficient programming method for customized processing within the Hadoop MapReduce framework and to determine how this can be implemented. Performance tests for processing large archives of Landsat images were performed with Hadoop. The following eight image analysis algorithms were used: Sobel filtering, image resizing, image format conversion, auto contrasting, image sharpening, picture embedding, text embedding, and image quality inspection. The findings demonstrate that MapReduce has a potential for processing large-scale remotely sensed images and solving more complex geospatial problems. Current approaches to processing images depend on processing a small number of images having a sequential processing nature. These processing loads can almost fit on a single computer equipped with a relatively small memory. Still, we can observe that more disk space is needed to store

the large-scale image repository that usually results from satellite-collected data.

The large-scale TIFF images of the Qatar Peninsula taken from satellites are used in this image processing approach. Other image formats such as BMP, GIF, and JPEG can also be handled by the Hadoop Java programming model developed. That is, the model can function on those formats as well as the original TIFF. The developed approach can also convert other formats, besides TIFF, to JPEG. 2. RELATED WORKS Kocakulak and Temizel [3] used Hadoop and MapReduce to perform ballistic image analysis, which requires that a large database of images be compared with an unknown image. They used a correlation method to compare the library of images with the unknown image. The process imposed a high computational demand, but its processing time was reduced dramatically when 14 computational nodes were utilized. Li et al. [4] used a parallel clustering algorithm with Hadoop and MapReduce. This approach was intended to reduce the time taken by clustering algorithms when applied to a large number of satellite images. The process starts by clustering each pixel with its nearest cluster and then calculates all the new cluster centers on the basis of every pixel in one cluster set. Another clustering of remote sensing images has been reported by Lv et al. [2], but a parallel K-means approach was used. Objects with similar spectral values were clustered together without any former knowledge. The parallel environment was assisted by the Hadoop MapReduce approach taken, since the execution of this algorithm is both time- and memory-consuming. Golpayegani and Halem [5] used a Hadoop MapReduce framework to operate on a large number of Aqua satellite




638

images collected by the AIRS instruments. A gridding of satellite data is performed using this parallel approach. In all those cases, with the increase of image data the parallel algorithm conducted by MapReduce exhibited superiority over a single machine implementation. Moreover, by using higher performance hardware the superiority of the MapReduce algorithm was better reflected.

In the research conducted on image processing operated on top of the Hadoop environment, which is a relatively new field started for working on satellite images, the number of successfully taken approaches has been few. That is why we decided to use the same eight techniques found in the ordinary discipline of image processing and apply the algorithms to remote sensing data on Qatar by means of a Java programming tool implemented in a Hadoop MapReduce environment within a cluster of computing machines. 3. HADOOP - HDFS – MAPREDUCE Hadoop is an Apache software project that includes challenging subprojects such as the MapReduce implementation and the Hadoop distributed file system (HDFS) that is similar to the main Google file system implementation.

In our study and in others as well, the MapReduce programming model will be actively working with this distributed file system. HDFS is characterized as a highly fault-tolerant distributed file system that can store a large number of very large files on cluster nodes. MapReduce is built on top of the HDFS file system but is independent.

HDFS is an open source implementation of the Google file system (GFS). Although it appears as an ordinary file system, its storage is actually distributed among different data nodes in different clusters. HDFS is built on the principle of the master-slave architecture. The master node (name node) provides data service and access permission to the slave nodes, while the slaves (data nodes) serve as storage for the HDFS. Large files are distributed and further divided among multiple data nodes. The map processing jobs located on all nodes are operated on their local copies of the data. It can be observed that the name node stores only the metadata and the log information while the data transfer to and from the HDFS is done through the Hadoop API. MapReduce is a parallel programming model for processing large amounts of metadata on cluster computers with unreliable and weak communication links. MapReduce is based on the scale-out principle, which involves clustering a large number of desktop computers. The main point of using MapReduce is to move computations to data nodes,

rather than bring data to computation nodes, and thus fully utilize the advantage of data locality. The code that divides work, exerts control, and merges output in MapReduce is entirely hidden from the application user inside the framework. In fact, most of the parallel applications can be implemented in MapReduce as long as synchronized and shared global states are not required.

MapReduce allows the computation to be done in two stages: the map stage and then the reduce stage. The data are split sets of key–value pairs and their instances are processed in parallel by the map stage, with a parallel number that matches the node number dedicated as slaves. This process generates intermediate key–value pairs that are temporary and can later be directed to reduce stages. Within map stages or reduce stages, the processing is conducted in parallel. The map and reduce stages occur in a sequential manner by which the reduce stage starts when the map stages finishes. 3.1 Non-Hadoop Approach The current processing of images goes through ordinary sequential ways to accomplish this job. The program loads image after image, processing each image alone before writing the newly processed image on a storage device. Generally, we use very ordinary tools that can be found in Photoshop, for example. Besides, many ordinary C and Java programs can be downloaded from the Internet or easily developed to perform such image processing tasks. Most of these tools run on a single computer with a Windows operating system. Although batch processing can be found in these single-processor programs, there will be problems with the processing due to limited capabilities. Therefore, we are in need of a new parallel approach to work effectively on massed image data. 3.2 Hadoop Approach to Image Processing In order to process a large number of images effectively, we use the Hadoop HDFS to store a large amount of remote sensing image data, and we use MapReduce to process these in parallel. The advantages of this approach are three abilities: 1) to store and access images in parallel on a very large scale, 2) to perform image filtering and other processing effectively, and 3) to customize MapReduce to support image formats like TIFF. The file is visited pixel by pixel but accessed whole as one record. The main attraction of this project will be the distributed processing of large satellite images by using a MapReduce model to solve the problem in parallel. In past decades, the remote sensing data acquired by satellites have greatly assisted a wide range of applications to study the surface of the Earth. Moreover,




639

these data are available in the form of 2D images, which assisted us to manipulate them easily using computers.

Remote sensing image processing consists in the use of various algorithms for extracting information from images and analyzing complex scenes. In this study, we will be focusing on spatial transformations such as contrast enhancement, brightness correction, and edge detection, which are based on the image data space. A common approach in such transformations is that new pixel values are generated on the basis of mathematical operations performed on the current pixel in terms of the surrounding ones. These spatial transformations that will be used are characterized as local transforms of a local image within a small neighborhood of a pixel. 4. SOFTWARE IMPLEMENTATION In this project, it is not our intention to use both mappers and reducers simultaneously. We simply use each map job to read an image file, process it, and finally write the processed image file back into an HDFS storage area. The name of the image file is considered the Key and its byte content is considered the Value. However, the output pair that is directed to the reducer job will not be used, because reducers will not be used at all, since what we need can be accomplished in the map phase only. This scheme is applied to all seven image processing algorithms used in this research, except the algorithm to detect image quality. Here, the Key and the Value for the input parameters of the Map job will remain the same, but the output Key value will be used exclusively. This Key value will be assigned to the name of the corrupted image file. Reducers in this exception will collect all the file names that have undergone defect detection and will finally be grouped in a single output file stored in the HDFS output storage area.

In order to customize the MapReduce framework, which is essentially designed for text processing, some changes in Java classes are inevitable for image processing needs. The first thing to consider when using Hadoop for image processing algorithms is how to customize its data types, classes, and functions to read, write, and process binary image files as can be seen in sequential Java programs running on a PC. We can do that by changing the API of Hadoop to allow images in the BytesWritable class instead of the TextWritable class and stop the splitting of input files into chunks of data. 4.1 Whole File Input Class and Splitting In Hadoop, we use a WholeFileInputFormat class that allows us to deal with an image as a whole file without splitting it. This is a critical modification, since the contents of a file must be processed solely by one Map instance. That is why maps should have access to the full contents of a file. This can be carried out by changing the RecordReader class

that delivers the file contents as a value of the record. This particular change can be seen in the WholeFileInputFormat class that defines a format in which the keys are not used, represented by NullWritable, and the values that are the file contents, represented by BytesWritable instances. The alterations will define two methods. In the first method, the isSplitable() is overridden to return false and thus ensure that input files are never split. In the second method, the getRecordReader() returns a custom implementation of the usual RecordReader class . While the InputFormat is responsible for creating the input splits and dividing those into records, the InputSplit is a Java interface responsible for representing the data to be processed by one Mapper. The RecordReader will read the input splits and then divide the file into records. The record size is equal to the size of the split specified in the InputSplit interface, which specifies the key and value for every record as well. Splitting of image files is not accepted, since it converts the images into binary files with pixels in each split. This splitting of binary files belonging to a whole image will definitely cause a loss of image features leading to image noise or a poor-quality representation. Therefore, we enforce that the files be processed in parallel and as a single record. Control of splitting is important so that a single Mapper can process a single image file. The method for controlling the process will be implemented by overriding the IsSplitable() method to return false, thus ensuring that each file will be mapped by one Mapper only and will be represented by one record only.

Now each mapper has been guaranteed to read a single image file, process it, and then write the newly processed image back into the HDFS. 4.2 Mappers

The mapper that contains our image processing algorithms is applied to every input image represented by the key–value pair to generate an intermediate image and store it back into the HDFS. The key of the Mapper is NullWritable (null) and the value is of BytesWritable type, which ensures that images are read one record at a time into every mapper. All the processing algorithms will be carried out in the Mapper and no reducer will be required. A typical Map function for one of the image filtering algorithms is listed in Table 1.




640

Table 1: Map Function

Input: public void map(Null Writable key, Bytes Writable value, Output Collector <Text, Text> output, Reporter reporter)

Output: (Null, Null) • Set the input path and the output path

• Set the type Input Format to whole File Input Format class and set the output format to Null Output Format class.

• The image is read from the value that comes from the Mapper, and then the Mapper will read the image bytes for storage as Bytes Array Input Stream.

• Read source buffered Image using ImagIO.read.

• Create FS Data Output Stream instance variable to write images directly to File System.

• Create destination Buffered Image with the height and width parameters equals to the source height and width.

• Create kernel instance, specify the matrix size to be 3 × 3, and set an array of floats, which is the kernel value.

• Create BufferedI mageOp instance and assign the kernel to Convolveop method.

• Callfilter method from Buffered ImageOp to apply filter on the image.

• Encode the destination image in JPEG format using JPEG Encoder.

• Store the image in the HDFS. 4.3 Reducers The algorithm to test image quality is one newly developed in our experimentation and has a different implementation of map–reduce. The Mapper task in this implementation is to obtain the names of processed images and pass those through a Key–Value pair to Reducers. The Key–Value pair generated by each Mapper will be the Null and the file name where a degree of content corruption is detected and evaluated. The reducer will collect the names of files and join those in a single output file. The following is a list of the other classes that were altered: • RecordReader:

•

The RecordReader is another class that has been customized for image file access. This is the one responsible for loading the data from its source and converting it into (key, value) pairs suitable for Mapper processing. By altering this to read a whole input image as one record passed to Mappers, we ensure no splitting of the image content as is usually done for text files. BufferedImage:

Raster of image data. In order to process an image, BufferedImage must be used to read the image.

The BufferedImage is introduced here as a way to contain an image file and ensure that its pixels can be accessed. This is a subclass that describes an Image with an accessible buffer of image data. A BufferedImage is comprised of a ColorModel and a

• JPEGImageEncoder:

5. IMAGE PROCESSING ALGORITHMS USED IN REMOTE SENSING 5.1 Color Sharpening Sharpening is one of the most frequently used transformations that can be applied to an image for enhancement purposes, and we will use the ordinary methods of sharpening here to bring out image details that were not apparent before. Image sharpening is used to enhance both the intensity and the edge of the image in order to obtain the perceived image.

The sharpening code is implemented using convolution, which is an operation that computes the destination pixel by multiplying the source pixel and its neighbors by a convolution kernel. The kernel is a linear operator that describes how a specified pixel and its surrounding pixels affect the value computed in the destination image of a filtering operation. Specifically, the kernel used here is represented by 3 × 3 matrixes of floating-point numbers. When the convolution operation is performed, this 3 × 3 matrix is used as a sliding mask to operate on the pixels of the source image. To compute the result of the convolution for a pixel located at coordinates (x, y) in the source image, the center of the kernel is positioned at these coordinates. To compute the value of the destination pixel at (x, y), a multiplication is performed on the kernel values with their corresponding color values in the source image. 5.2 Sobel Filter The Sobel edge filter is used to detect edges and is based on applying horizontal and vertical filters in sequence. Both filters are applied to the image and summed to form the final result. Edges characterize boundaries; therefore, edges are fundamental in image processing. Edges in images are areas with strong intensity contrasts. Detecting the edges in an image significantly reduces the amount of data and filters out useless information while preserving the important structural properties in an image.

The JPEGImageEncoder is used extensively to encode buffers of image data as JPEG data streams, the preferred output format used to store images. Users of this interface are required to provide image data in a Raster or a BufferedImage, set the necessary parameters in the JPEGEncodeParams object, and successfully open the OutputStream that is the destination of the encoded JPEG stream. The JPEGImageEncoder interface can encode image data as abbreviated JPEG data streams that are written to the OutputStream provided to the encoder.




641

The Sobel operator applies a 2-D spatial gradient measurement to an image and is represented by an equation. It is used to find the approximate gradient magnitude at each point in an input grayscale image. The discrete form of the Sobel operator can be represented by a pair of 3 × 3 convolution kernels, one estimating the gradient in the x-direction (columns) and the other estimating the gradient in the y-direction (rows). The Gx kernel highlights the edges in the horizontal direction, while the Gy kernel highlights the edges in the vertical direction. The overall magnitude of the outputs, |G|, detects edges in both directions and is the brightness value of the output image.

(1)

(2) In detail, the kernel mask is slid over an area of the input image, changes the value of that pixel, and then is shifted one pixel to the right. It continues toward the right until it reaches the end of the row, and then it starts at the beginning of the next row. The center of the mask is placed over the pixel that will be manipulated in the image, and the file pointer is moved using the (i, j) values so that multiplications can be performed. After taking the magnitude of both kernels, the resulting output detects edges in both directions. 5.3 Auto-Contrast Automatic contrast adjustment (auto-contrast) is a point operation whose task is to modify the pixel intensity with respect to the range of values in an image. This is done by mapping the current darkest and brightest pixels to the lowest and highest available intensity values, respectively, and linearly distributing the intermediate values.

Let us assume that a low and a high are the lowest and highest pixel values found in the current image, whose full intensity range is [amin, amax]. We first map the smallest pixel value alow to zero, subsequently increase the contrast by the factor (amax − amin)/(ahigh − alow), and finally shift to the target range by adding amin. According to this mathematical formula, the auto-contrast factor is calculated.

(3)

Provided that ahigh ≠ alow and for an 8-bit image, amin = 0 and amax = 255.

The aim of this image process is to brighten the darkest places in satellite images by a certain factor. For example, in the images captured from the satellite the sea is represented by a very dark color and does not look obvious;

therefore, using this process greatly helps to reveal the image details by increasing their brightness. 6. EXPERIMENTATION 6.1 Environment Setup Qatar University has recently received a dedicated cloud infrastructure called the QU Cloud. This is running a 14-blade IBM BladeCenter (cluster) with 112 physical cores and 7 TB of storage. The cloud has advanced management software (IBM Cloud 1.4) with a Tivoli provisioning manager that allows cloud resources to be provisioned through software. Our experiments were performed on this cluster equipped with Hadoop. This project has been provisioned with one NameNode and eight DataNodes. The NameNode was configured to use two 2.5-GHz CPUs, 2 GB of RAM, and 130 GB of storage space.

Each DataNode was configured to use two 2.5-GHz CPUs, 2 GB of RAM, and 130 GB of disk storage. Besides this, all the computing nodes were connected by a gigabit switch. Red Hat Linux 5.3, Hadoop 0.20.1, and Java 1.6.0_6 were installed on both the NameNode and the DataNodes. Table 2 shows the master-slave hardware configuration, while Tables 4 and 5 show the cluster hardware and software configurations, respectively.

Table 2: Master/Slave Configuration

Table 3: Single Machine Configuration

PC Details

Personal Computer

Intel Core 2 Duo CPU, 2.27 GHz, RAM 4 GB, and total storage 500 GB. Operating System: Windows Vista

Name Number Details Master 1 2 × 2.5 GHz CPU

Node, 2 GB RAM, 130 GB disk space

Slaves 8 2 × 2.5 GHz CPU Node, 2 GB RAM, 130 GB disk space




642

Table 4: Cluster Hardware Configuration

Name Number Details

Name Node 1 2 × 2.5 GHz CPU Node, 2 GB RAM

Name Node 8 2 × 2.5 GHz CPU Node, 2 GB RAM

Network – 1 Gbit switch connecting all nodes

Storage 9 130 GB

Table 5: Software Configuration for Cluster Computers

Name Version Details

Hadoop 0.20.1 Installed on each node of the computer

Xen Red Hat Linux

5.2 Pre-configured with Java and the Hadoop SDKs.

IBM Java SDK

6 Used in programming image processing algorithms

6.2 Data Source The data source was a real color satellite image of Qatar supplied by the Environmental Studies Center at Qatar University. Subsequent processing was done on those images to change their sizes so that the Java compiler with its limited heap size can open them successfully. The collection of remote sensing images taken from Landsat satellites consists of 200 image files, each of which has 16669 × 8335 pixels resolution in TIF format. The spatial resolution was 30.0 meters per pixel, and in our experiments, only a small part of the images was used. The developed Java program reads the JPEG, TIFF, GIF, and BMP standard image formats. Although the JPEG, and GIF image formats are the most commonly used, the TIF format is the standard universal format for high-quality

images and is used for archiving important images. Most satellite images are in TIF format because TIFF images are stored with their full quality; hence, the size of TIFF images is very large in uploading, resizing, and processing.

When the processing finishes, all the images will have been encoded using the JPEG format by calling the JPEGEncoder method to perform the required JPEG compression for the image. Then, each image will be saved in a JPEG file. The JPEG format is used when an image of high quality is not necessary (e.g., to prove the processing here). JPEG compression reduces images to a size that is suitable for processing and non-archival purposes as needed in this study. 7. EXPERIMENTAL RESULTS &

ANALYSIS In this section, we test the performance of some parallel image filtering algorithms applied to remote sensing images available in the Environmental Studies Center. We mainly consider the trend in time consumption with the increase in the volume of data, and we try to show the difference in run time between a single PC implementation and the parallel Hadoop implementation. Tables 3–5 list the different configurations of the cluster and single PC computing platforms used to obtain the experimental results. We repeated each experiment three times and the average reading was finally recorded. We implemented the image sharpening, Sobel filtering, and auto-contrasting by using Java. The programs ran well when the size of the input image was not large. Figures 1–3 show run time against file size for the two configurations; namely, the single machine and the cloud cluster. Three different image processing algorithms were used for experimentation, and different timings were recorded because each algorithm was uniquely different in its numerical processing overhead. These useful image processing algorithms were chosen in the Hadoop project for their frequent use in remote sensing and for their diverse computational loads. We paid more attention to the processing time than the number of images processed. Moreover, we considered the elapsed time as the sole performance measure. We compared the times taken by the single PC and the cluster to observe the speedup factors and finally analyzed the results.




643

Figure 1: Auto-contrast chart.

Figure 2: Sobel filtering chart.

Figure 3: Sharpening chart.

Figure 4: Sample image output for auto-contrast, Sobel filtering, and sharpening (Khor Al Udaid Beach).

8. CONCLUSION AND FUTURE WORK In this paper, we presented a case study for implementing parallel processing of remote sensing images in TIF format by using the Hadoop MapReduce framework. The experimental results have shown that the typical image processing algorithms can be effectively parallelized with acceptable run times when applied to remote sensing images. A large number of images cannot be processed efficiently in the customary sequential manner. Although originally designed for text processing, Hadoop MapReduce installed in a parallel cluster proved suitable to process TIF format images in large quantities. For companies that own massive images and need to process those in various ways, our implementation proved efficient and successful. In particular, the best results show that the auto-contrast algorithm reached a sixfold speedup when implemented on a cluster rather than a single PC. Moreover, the sharpening algorithm reached an eightfold reduction in run time, while the Sobel filtering algorithm reached a sevenfold reduction.




644

We have observed that this parallel Hadoop implementation is better suited for large data sizes than for when a computationally intensive application is required. In the future, we might focus on using different image sources with different algorithms that can have a computationally intensive nature. REFERENCES [1] G. Fox, X. Qiu, S. Beason, J. Choi, J. Ekanayake,

T. Gunarathne, M. Rho, H. Tang, N. Devadasan, and G. Liu, “Biomedical case studies in data intensive computing,” in Cloud Computing (LNCS vol. 5931), M. G. Jaatun, G. Zhao, and C. Rong, Eds. Berlin: Springer-Verlag, 2009, pp. 2–18.

[2] Z. Lv, Y. Hu, H. Zhong, J. Wu, B. Li, and H. Zhao,

2010. “Parallel K-means clustering of remote sensing images based on MapReduce,” in Proc. 2010 Int. Conf. Web Information Systems and Mining (WISM ’10), pp. 162–170.

[3] H. Kocakulak and T. T. Temizel, “A Hadoop solution

for ballistic image analysis and recognition,” in 2011 Int. Conf. High Performance Computing and Simulation (HPCS), Istanbul, pp. 836–842.

[4] B. Li, H. Zhao, Z. H. Lv, “Parallel ISODATA

clustering of remote sensing images based on

MapReduce,” in 2010 Int. Conf. Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), Huangshan, pp. 380–383.

[5] N. Golpayegani and M. Halem, “Cloud computing for

satellite data processing on high end compute clusters,” in Proc. 2009 IEEE Int. Conf. Cloud Computing (CLOUD ’09), Bangalore, pp. 88–92.

[6] T. Dalman, T. Döernemann, E. Juhnke, M. Weitzel,

M. Smith, W. Wiechert, K. Nöh, and B. Freisleben, “Metabolic flux analysis in the cloud,” in IEEE 6th Int. Conf. e-Science, Brisbane, 2010, pp. 57–64.

[7] Hadoop. http://hadoop.apache.org/. [8] MapReduce. http://hadoop.apache.org/mapreduce/. [9] HDFS. http://hadoop.apache.org/hdfs/. [10] J. Dean and S. Ghemawat, “MapReduce: simplified

data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–114, 2008.

http://hadoop.apache.org/�

http://hadoop.apache.org/mapreduce/�

http://hadoop.apache.org/hdfs/�

journal of computing::cloud hadoop map reduce for … · map stages or reduce stages, the...

Documents