classification of single-food images by combining local hsv-akaze features and global features

International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842 Issue 1, Volume 2 (January 2015) www.irjcs.com

_________________________________________________________________________________________________ © 2015, IRJCS- All Rights Reserved Page -12

Classification of Single-Food Images by Combining Local HSV-AKAZE Features and Global Features

Yusuke Kajiwara * Munehiro Nakamura Haruhiko Kimura

Ritsumeikan University Kanazawa Institute University Kanazawa University Abstract— This paper presents a system for assisting nutrition management for solitary elderly persons. Since dealing with diseases is one of the important issues for solitary elderly, their health control in daily life has been in focus in recent years. As preprocessing to develop a nutrition management system for solitary elderly, systems for discriminating the category of a food image have been proposed. However, classification of food images is still a challenging task due to the variety of their shape and color. In order to improve the performance on the classification, we propose three regions of interests extracted by HSV-AKAZE. The three regions are used to extract various local features such as AKAZE, HSV-AKAZE, and color information, enhances the classification performance. Evaluation experiments for 2000 food images in 50 categories have shown that the classification accuracy has increased by 8% compared with the existing system.

Keywords— HSV-AKAZE, Machine Learning, Solitary Elderly Persons, ROI, Single Food Image

I. INTRODUCTION Since solitary elderly persons are inferior to young people regarding the power of chewing, swallowing, bowel

peristalsis, and a sense of taste, they are often suffered from undernutrition due to a lack of protein and energy. The undernutrition is one of the geriatric syndromes that cause reduction of muscle strength, body fat, immune strength, and so on and even leads to bedridden state and long-term care [1]. As a solution to this problem, systems for assisting nutritional management have been required for solitary elderly persons.

In order to develop such systems, methods for classifying food images have been proposed. The existing methods can be divided into two approaches. One focuses on extracting structural features of food images. For example, Yang et al. [2] proposed the pairwise local feature distribution that considers the combination of ingredients such as meat and vegetables. Zong et al. [3] proposed a method of learning structural features extracted by Scale-Invariant Feature Transform (SIFT) [6] and local binary patterns (LBP) [7]. Joutou et al. [4] [5] proposed a method of combining multiple kernels that learn features extracted by SIFT, CSIFT [8], Gabor feature [10], and Histograms of Oriented Gradients (HOG) [9]. However, classification of food images is still a challenging task because the overall color and shape of a food image in each category is often similar to each other.

This paper is organized as follows. First, we explain how to extract features from food images and how to learn them in section 2. In section 3, the proposed method is compared with the one without the extraction of the three regions. Both methods are applied to food images in 50 categories. In section 4, conclusion and future works are addressed.

II. PROPOSED METHODS In general, the overall color and shape of food images is depended on positions of ingredients and sauces. In order to

learn the color and shape of food images effectively, we propose a method of extracting local features from three regions localed near boundaries between ingredients and sauces. In the proposed method, the food image is firstly divided into hue image, saturation image, and value image. Second, HSV-AKAZE [9] is used to extract the three regions from each of the images. Finally, the machine learning is executed to learn various features extracted from each of the regions. In HSV-AKAZE, the color format is converted from RGB to HSV (hue H, saturation S, and value V) as below

BMAXifMINMAXGR

GMAXifMINMAXRB

RMAXifMINMAXBG

H

rgbrgb

rgbrgb

rgbrgb

　　

　　

　　

)(60

)(60

)(60 (1)

　rgb

rgbrgb

MAXMINMAX

S

(2)

rgbMAXV (3)



where MAXrgb is the maximum value in RGB, MINrgb is the minimum value in RGB. In the term (1), the value 360 is added to H when H is negative.

In HSV-AKAZE are applied to each of the hue, saturation, and value image independently, and key points are extracted from regions where changes of each color are significant. The extracted key points are robust against changes of scale, luminance, and angle.

Next, three local regions called hue region, saturation region, and value region are extracted from each of the images as below. 1. Key points are extracted hue image, saturation image, and value image using HSV-AKAZE as shown in Fig.1. 2. In Fig.2, each of hue region, saturation region, and value region is represented as the rectangle area where the left

upper is defined as (xl, yu), the left bottom is defined as (xl, yd), the right upper is defined as (xr, yu), and the right bottom is defined as (xr, yd). xl and xr are the most left and right side of edge in the key points respectively, calculated from the two-sided 90% confidence interval of the probabilistic density for x-axis. yl and yr are the most upper and bottom of the edge in the key points, calculated from the two-sided 90% confidence interval of the probabilistic density for y-axis.

3. In this paper, we define features extracted from the hue region as H-feature, features extracted from the saturation region as S-feature, and features extracted from the value region as V-feature, features extracted from the original image as O-feature.

III. EVALUATION ENVIRONMENT We prepared food images in 50 categories. Table.1 shows all the categories. These food images were searched by each

food name using the Google Images, and 40 food images were obtained from the top of the list in the search of each category. However, we have removed the food images that do not match with their food name.

Fig. 1 Original image is divided into hue region, saturation region, and value region by HSV-SIFT. The circles in each images is the key points detected by HSV-SIFT.

Fig. 2 Each of hue region, saturation region, and value region is represented as the rectangle area.



The existing systems [4] [5] use the features extracted from CSIFT, SIFT, and HOG, color information, and Gabor feature. Hence, the proposed method uses the same features except the ones extracted by CSIFT and SIFT. This is because the proposed method divides the three regions by HSV-AKAZE and AKAZE. In AKAZE, HSV-AKAZE, and color information, BoF (Bag of Features) [13] creates frequency histograms.

In BoF, the number of visual words was configured as 50. After converting frequency histograms into a vector, the spatial pyramid matching is applied to the vector for adding location information. The spatial pyramid divides a color image into 1×1, 2×2, 3×3 blocks where a frequency histogram is calculated block by block. In HOG, the number of quantization is configured with 9, the number of cells in an image is configured with 8 8, and the number of cells in a block is configured with 3×3. In the extraction of Gabor feature, the number of resolution is configured with 4, the number of direction is configured with 6.

We compared two cases: (a) Only the O-feature is used to build each of the classifiers, (b) The HSV-AKAZE features are used to build each of the classifiers. We call the case (a) as O-feature and the case (b) as HSV-feature in this paper. The classifiers used in each of the cases are Linear SVM, Random Forest, and Naive Bayes. Evaluation performance was conducted by 10-fold cross validation where F-measures (harmonic mean of recall and precision) were calculated. As classification algorithms, we implemented Naive Bayes, Random Forest, Linear kernel SVM and Radial Basis Function kernel (RBF) SVM in the R library for statistical computing. Parameters for all the classifiers were configured as default in the R library. The experiments were executed by a computer with Intel Core i7 870 2.93Ghz and 16G memory.

IV. EXPERIMENTAL RESULTS First, Table.2 shows results of the classification. From the table, we can see that the average classification accuracy in

O-feature has increased by 3%, 4%, 15%, and 9%, respectively. Moreover, the average classification accuracy obtained by HSV-feature is 8% higher than that of O-feature. Next, Fig.3 shows results of the classification of food images in 50 categories. From the figure, compared with O-feature, the average classification accuracy in HSV-feature has increased by 8%. These result show that the proposed method works well on food images. From the figure and Table 1, we can find that the classification accuracy for kimpira gobo, corn soup, tempra udon has increased by more than 20%. This result shows that HSV-feature works well on the classification of food images.

TABLE I ALL THE 50 FOOD CATEGORIES FOR THE CLASSIFICATION.

(1)broiled eel and rice (2)shrimps with chili sauce (3)oden (4)omelet (5)savoury pancake with various ingredients (6)Udon (7)pork cutlet on rice (8)curry and rice (9)kimpira gobo (10)gratin (11)croissant (12)corn soup (13)croquette. (14)rice (15)zaru soba (16)sandwiches (17)Pacific saury (18)stew (19)sukiyaki (20)spaghetti (21)rice fried with chicken. (22)fried rice (23)toast (24)hamburger on a bun (25)hamburger (26)pizza (27)bibimbap (28)hot dog (29)potato salad (30)Ramen (31)Stuffed cabbage (32)Sushi (33)yakisoba (34)rice topped with chicken and eggs (35)sweet‐and‐sour pork (36)Chawan-mushi (37)fried chicken (38)Tempura Udon (39)Tianjin rice bowl (40)Tendon (41)niku-jyaga (42)Natto (43)Mabo-tofu (44)Miso soup (45)sunny-side up (46)vegetable tenpura (47)sauteed vegetables (48)Hiyashi chuka (49)cold tofu (50)Gyoza

TABLE II

MEAN OF RECALL, PRECISION, AND F-MEASURES OBTAINED IN THE CLASSIFICATION..

Machine learning Original HSV-ROI Recall Precision F-measure Recall Precision F-measure

Naïve Bayes 0.46 0.43 0.44 0.49 0.45 0.47 Random Forest 0.58 0.57 0.57 0.62 0.61 0.61 Linear SVM 0.38 0.34 0.36 0.52 0.51 0.51 RBF SVM 0.31 0.25 0.28 0.42 0.34 0.37



Fig.4 shows details of the eight categories above. In the figure, the vertical axis represents the difference value calculated as +1 when HSV-feature could correctly discriminate a food image that was incorrectly discriminated by O- feature and as -1 when O-feature could correctly discriminate a food image that was incorrectly discriminated by HSV-feature, and the horizontal axis represents the category incorrectly discriminated by HSV-feature or O-feature in the calculation of the difference values. From the figure, we can find that the incorrect discriminations such as curry and rice, mabo tofu, and sauteed vegetables in the category of kimpira gobo in the category of croissant, pacific saury and Ramen in the category of cornsoup, and oden, udon, stew in the category of tempura udon have decreased remarkably by O-feature.

V. DISCUSSION As described in 3.2, the classification performance was significantly improved on some food images by HSV-feature

and O-feature. By focusing on the images, this section describes the effect of the proposed method. Here, we show pork cutlet on rice and pizza as a representative example where RGB and density values are quite

similar to each other in both of the original images. For example, in the case of pork cutlet on rice, changes of the color information are significant on the boundary between the white plate and the yellow pork, and the green trefoil and the yellow pork. In the color space of hue, the HSV-AKAZE algorithm extracts key points where changes of the color

Fig. 3 Mean of F-measures for each of the categories in Random Forest.

Fig. 4 Comparison between HSV-ROI and O-feature with respect to the number of the food images incorrectly discriminated in the classification.



information are significant. Therefore, many key points are located near the borderline between the white plate and the yellow pork, and the green trefoil and the yellow pork as shown in Fig.6. On the other hand, in the image of pizza,

various ingredients such as shrimp and squid are put on. As a result, key points are extracted from the whole image because color values are not stable.

The HSV-feature also contributes to increase classification performance by learning the local features extracted from the three regions located around borderlines between ingredients and sauces. Fig.5 shows an example of food images correctly classified by the proposed method. In this figure, while each of the three regions are located at different place, there is an overlap of the three regions. For example, key points in each regions were located near borderlines between ingredients in the image of chawan-mushi, sauces and the meat in the image of the curry sauce and the rice in the image of curry and rice. The hue, saturation, and value are also unstable around the borderlines. These characteristic have been found in the images between curry rice, and tempura udon.

Fig.6 shows some of the food images incorrectly discriminated by both of HSV-feature and O-feature. From this figure, we can see that the images between sukiyaki and bibimbap, chawan-mushi and stew, and kimpira gobo and yakisoba are similar to each other with respect to the color and shape of all the three regions.

Fig. 5 Example of food images correctly discriminated by HSV-ROI.

Fig. 6 Example of food images incorrectly discriminated by HSV-feature.



The reason is probably because each of the original images has similar color and shape, and places of the ingredients are similar to each other. In order to classify the images correctly, we need to consider other information such as situation, smell, and amount of food.

VI. CONCLUSION As preprocessing to develop a nutrition management system for solitary elderly, this paper has presented a method of

discriminating single food images. Experimental results on discriminating 50 food categories have shown that the mean classification rate has increased by 5.5% compared with the existing system. Besides, we showed that the proposed method improves classification accuracy to images whose color and density information are similar to each other. On the other hand, we found that it is still difficult to classify images that have similar information between the three regions.

As future works, we need to consider other information such as situation, smell, and amount of food. Moreover, we would like to develop a system of displaying nutritional energy by measuring amount of food with a 3D camera.

REFERENCES [1] Yukawa, H., Longitudinal study on dietary intake and health by the elderly in an urban community: The Japan

Association for the Integrated Study of Dietory Hobits. 2005, 16(2):100-103. [2] Yang, S., Chen, M., Pomerleau, D., Sukthankar R., Food recognition using statistics of pairwise local features:

Proc.IEEE Computer Vision and Pattern Recognition, 2010, 2249-2256. [3] Zong, Z., Nguyen, D.T., Ogunbona, P., Li, W., On the combination of local texture and global structure for food

classification: IEEE International Symposium on Multimedia, 2010, 204-211. [4] Joutou, T., Hoashi, H., Yanai, K., 50-Class Food-Image Recognition Employing Multiple Kernel Learning: The

Transactions of the Institute of Electronics Information and Communication Engineers (IEICE). 2010, J93- D(8):1397-1406.

[5] Hoashi, H., Yanai, K., Recognition of multi-food images by detecting candidate regions. The Transactions of the Institute of Electronics Informationand Communication Engineers(IEICE), 2012, J95-D(8):1554-1564.

[6] Lowe, D., Distinctive Image Features from Scale-Invariant Keypoints: International Journal of Computer Vision. 2004, 60(2):91-110.

[7] Ojala, T., ainen, M.P., Harwood, D., A comparative study of texture measures with classification based on featured distributions: Pattern Recognition. 1996, 29:51-59.

[8] Abdel-Hakim, A.E., Farag, A.A., Csift:A sift descriptor with color invariant characteristics: IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2006, 2:1978-1983.

[9] P. F. Alcantarilla and J. Nuevo and A. Bartoli, “Fast Explicit Diffusion for Accelerated Features in Nonlinear Scale Spaces.”, British Machine Vision Conf. (BMVC), pp.1-11(2013)

[10] Dalal, N., Triggs, B., Histograms of oriented gradients for human detection: Proc.of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2005, 886-893.

[11] Jones, J., Palmer, L., An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex: J Neurophysiol. 1987, 58(6):1233-1258.

[12] Varma, M., Ray, D., Learning the discriminative power-invariance trade- off: ICCV 2007, 2007, 1-8. [13] Bosch, A., Zisserman, A., Munoz, X., Scene Classification Using a Hybrid Generative/Discriminative Approach:

IEEE Transactions on Pattern Analysis and Machine Intelligence. 2008, 30(4):712-727. [14] Csurka, G., Dance, C.R., Fan, L., Bray, C., Visual Categorization with Bags of Keypoints: European Conference on

Computer Vision, 2004, 1-22. [15] Lazebnik, S., Schmid, C., Ponce, J., Beyond bags of features:Spatial pyramid matching for recognizing natural

scene categories: Proc.IEEE Computer Vision and Pattern Recognition, 2006, 2169-2178. [16] Arthur, D., Vassilvitskii, S., k-means++: the advantages of careful seeding: Proceedings of the eighteenth annual

ACM-SIAM symposium on discrete algorithms, 2007, 1027-1035. [17] Freund, Y., Schapir, R.E., A Decision-Theoretic Generalization of on-Line Learning and an Application to

Boosting: Journal of Computer and System Sciences. 1997, 55(1):119-139. [18] Breiman, L., Random Forests: Machine Learning. 2001, 45(1):5-32. [19] Snoek, C.G.M., Worring, M., Smeulders, A.W.M., Early versus late fusion in semantic video analysis: Proceedings

of the 13th annual ACM international conference on Multimedia, 2005, 399-402.

classification of single-food images by combining local hsv-akaze features and global features

Documents