a delicious recipe analysis framework for...

A Delicious Recipe Analysis Framework for ExploringMulti-Modal Recipes with Various A�ributes

Weiqing Min1, Shuqiang Jiang1,2, Shuhui Wang1, Jitao Sang3,4, Shuhuan Mei5,11 Key Lab of Intelligent Information Processing, Institute of Computing Technology, CAS, Beijing, 100190, China

2 University of Chinese Academy of Sciences, Beijing, 100049, China3 National Lab of Pattern Recognition, Institute of Automation, CAS, Beijing, 100190, China

4 State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210093, China5 Shandong University of Science and Technology, Qingdao, 266590, China

{minweiqing,sqjiang}@ict.ac.cn;[email protected];{shuhui.wang,shuhuan.mei}@vipl.ict.ac.cn

ABSTRACTHuman beings have developed a diverse food culture. Many fac-tors like ingredients, visual appearance, courses (e.g., breakfast andlunch), �avor and geographical regions a�ect our food perceptionand choice. In this work, we focus on multi-dimensional food anal-ysis based on these food factors to bene�t various applications likesummary and recommendation. For that solution, we propose adelicious recipe analysis framework to incorporate various typesof continuous and discrete attribute features and multi-modal in-formation from recipes. First, we develop a Multi-Attribute ThemeModeling (MATM) method, which can incorporate arbitrary typesof attribute features to jointly model them and the textual con-tent. We then utilize a multi-modal embedding method to buildthe correlation between the learned textual theme features fromMATM and visual features from the deep learning network. Bylearning attribute-theme relations and multi-modal correlation, weare able to ful�ll di�erent applications, including (1) �avor analysisand comparison for better understanding the �avor patterns fromdi�erent dimensions, such as the region and course, (2) region-oriented multi-dimensional food summary with both multi-modaland multi-attribute information and (3) multi-attribute orientedrecipe recommendation. Furthermore, our proposed frameworkis �exible and enables easy incorporation of arbitrary types of at-tributes and modalities. Qualitative and quantitative evaluationresults have validated the e�ectiveness of the proposed method andframework on the collected Yummly dataset.

KEYWORDSMulti-dimensional food analysis, multi-attribute theme modeling,�avor analysis, food summary, recipe recommendation

1 INTRODUCTIONFood is an integral part of our life. Many aspects like ingredients, vi-sual appearance, �avor, meals and geographical regions play impor-tant roles in food perception, choice and consumption. For example,

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor pro�t or commercial advantage and that copies bear this notice and the full citationon the �rst page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior speci�c permission and/or afee. Request permissions from [email protected]’17, October 23–27, 2017, Mountain View, CA, USA.© 2017 ACM. 978-1-4503-4906-2/17/10. . . $15.00DOI: https://doi.org/10.1145/3123266.3123272

the obese subjects demonstrated changes in brain activity elicitedby food-related visual cues [34]. Asians use soy sauce widely intheir recipes while the ingredients from the Hungarian cuisine usu-ally contain the paprika and lard [2]. Therefore, multi-dimensionalfood modeling based on these aspects can bene�t applications likeculinary habit exploration [9, 35], food perception and health [3, 31].In this work, we focus on modeling these food factors into a uni�edframework, and then exploit it for various multimedia applications.

The proliferation of food-shared websites (e.g., Instagram, Yumm-ly and Yelp) has provided rich data for food-oriented research, suchas food recognition [22, 33], recipe retrieval [8] and culinary prac-tice understanding [2, 35]. For example, Rich et al. [33] conductedthe large scale content analysis of food images from Instagram.Sajadmanesh et al. [35] presented a study of recipes with the ingre-dients and �avor information from Yummly to explore the culinaryhabits. However, little work has investigated the problem of takingvarious types of attributes and multi-modal information into a uni-�ed framework, which are critical to enable food-related analysisand understanding.

The food attributes are diverse. For example, each recipe fromYummly is associated with di�erent attributes, such as the cuisine,course and �avor information. Furthermore, the distributions ofmany attribute features are di�erent. For instance, the cuisine andcourse attributes are discrete or categorical while the value of each�avor attribute is continuous. Therefore, a multi-dimensional foodanalysis framework should support arbitrary types of attributes.In addition, besides the text-based food research (e.g., the ingre-dients) [2, 35], some work like [4, 43] have found the importanceof visual information in food related analysis. In many cases, thevisual information is able to reduce the high cognitive load foreasy understanding [43]. For example, Camacho et al. [4] proposeda framework to enhance each review by recommending relevantfood images in Yelp. Therefore, di�erent modalities should also besupported for a multi-dimensional food analysis framework.

In this paper, we propose a delicious recipe analysis framework,which is capable of incorporating various types of continuous anddiscrete attribute features and multi-modal information to meetthese requirements. Particularly, we take the recipes from Yummlyin our study. As shown in Fig.1, given the multi-modal recipe inputwith three types of attribute features, including continuous �avorattribute features, discrete cuisine and course attribute features,we �rst develop a Multiple Attribute Theme Modeling (MATM)method to model the correlation between the ingredients and thesefood attributes. We then utilize the Multi-Modal Embedding (MME)

Session: Fast Forward 2 MM’17, October 23–27, 2017, Mountain View, CA, USA

402

Data InputMulti-Attribute Theme Modeling and

Multi-Modal Embedding

Applications

(1) Flavor analysis and comparison (2) Region-oriented multi-dimensional food summary (3) Multi-attribute oriented recipe recommendation

Flavor distribution on American Flavor distribution on courses

Multi-Attribute Theme Modeling

…

VGG-16 CNN

Multi-Model Embedding

Figure 1: The proposed delicious recipe analysis framework

method to create joint representation based on learned ingredien-t theme features from MATM and visual features from the deeplearning network. After mining attribute-theme patterns and cor-relating di�erent modalities, we can conduct di�erent tasks: (1)�avor analysis and comparison for better understanding the �avorpatterns from di�erent dimensions, such as the region and course,(2) region-oriented multi-dimensional food summary with bothmulti-modal and multi-attribute information and (3) multi-attributeoriented recipe recommendation for di�erent attribute queries.

The contributions of this paper can be summarized as follows:• We propose a delicious recipe analysis framework, which uti-

lizes various types of attributes and multi-modal informationto enable multi-dimensional food analysis and applications.• We propose a multiple attribute theme modeling method, which

can incorporate arbitrary attributes to model the correlationbetween the content and attributes.• We present a wide variety of applications, including 1) �avor

analysis and comparison, 2) region-oriented multi-dimensionalfood summary, and 3) multi-attribute oriented recipe recom-mendation.

2 RELATEDWORKRelated work includes food recognition [7, 23, 26, 33, 41, 42], reciperetrieval and recommendation [8, 30], and culinary culture un-derstanding [2, 20, 35, 38]. (1) Food recognition. Bossard et al.[7] used the random forest method to mine discriminative partsof food images for recognition. Some work [23, 41] utilized theconvolutional neural network to extract deep visual features forrecognition. Xu et al.[42] further introduced the geo-location infor-mation to improve the performance of dish recognition. In contrast,Rich et al.[33] performed the large scale content analysis of foodimages from Instagram taken in-the-wild. (2) Recipe retrieval

and recommendation. Some work [12, 24] employed the tradi-tional matrix factorization methods for recipe recommendation.Recently, Chen et al. [8] exploited visual features, ingredients andcategories for recipe retrieval. Min et al. [30] further utilized bothcategorical cuisine and course attributes for retrieval. Di�erentfrom them, we incorporate various types of discrete and continu-ous attribute features into a uni�ed framework. Furthermore, weapply the proposed framework into di�erent applications, suchas multi-dimensional summary and multi-attribute based recom-mendation. (3) Culinary culture understanding. Sajadmaneshet al. [35] exploited the ingredients and �avor information fromYummly to understand worldwide culinary culture. Ahn et al. [2]constructed a �avor network to understand the culinary practice.Simas et al. [38] utilized this �avor network to analyse the food-bridging hypothesis behind the traditional cuisine. Howell et al.[20] relied on the food’s name and nutritional content to predictthe food’s taste. Di�erent from [20], we analyzed and compareddi�erent �avor patterns based on the ingredients and various foodattributes. Furthermore, we proposed a uni�ed framework to en-able other food analysis and applications including summary andrecommendation.

In addition, our work is also related to the probabilistic topicmodel [5, 17], such as Latent Dirichlet Allocation (LDA) [6] and Re-stricted Boltzmann Machine (RBM)[11], which have been applied tomany tasks such as classi�cation and recommendation [17, 29, 40].In addition, some work incorporated additional feature informa-tion like the location and time into LDA for user interest modeling[36], landmark analysis [28] and social event analysis [32]. Srivas-tava et al. [39] proposed RBM based deep network for multi-modaldata modeling.In this work, we extend the topic model to incor-porate di�erent types of continuous and discrete recipe attributefeatures,ingredients and recipe images for multi-dimensional foodanalysis and applications.


403

Figure 2: The multi-attribute theme model

3 DELICIOUS RECIPE ANALYSIS3.1 Multi-Attribute Theme ModelingWe develop a theme modeling method to build the correlationbetween the ingredients and di�erent types of attributes. Giventhe recipe set D, each recipe d ∈ D is a tuple [wd , ld , cd , td ]. wdis the ingredient vector. ld is a L-dimensional vector containingfeatures that encode cuisine values. ld includes 1 in the positionsfor each cuisine from the recipe d , and 0 otherwise. Similarly for C-dimensional vector cd and T -dimensional vector td . L,C,T are theset of cuisines, courses and �avors, respectively. The goal of Multi-Attribute Theme Modeling (MATM) is to utilize the ingredients wdand three types of attributes ld , cd , td to learn the theme space ofthe ingredients {φφφk }k ∈K . K is the set of themes. In addition, weintroduce the background distributionϕϕϕ. As presented in Fig. 2, thegenerative process of MATM is:

(1) Draw a background distribution over ingredients ϕϕϕ ∼Dir (βββ)

(2) For each theme k ∈ {1, ...,K},(a) draw λλλk ∼ N(0,σ 2I).(b) draw φφφk ∼ Dir (βββ).

(3) For each recipe d ∈ {1, ...,D}(a) Draw πππd ∼ Beta(γγγ )(b) for each theme k , let αdk = exp(fTd λλλk )(c) Draw θθθd ∼ Dir (αααd )(d) for each ingredient wd,n ∈ wd

(i) Draw a switch variable sd,n ∼ Binomial(πππd )

(ii) if sd,n = 0, draw an ingredientwd,n ∼ Multi(ϕϕϕbд)(iii) if sd,n = 1, draw a theme zd,n ∼ Multi(θθθd ),

draw an ingredient wd,n ∼ Multi(φφφzd,n )

where Beta(·), Binomial(·), Multi(·), Dir (·), N(·) denote the Betadistribution,binomial distribution, multinomial distribution, Dirich-let distribution and normal distribution, respectively. I is the identitymatrix. σ 2I is a diagonal matrix with the diagonal element σ 2.

Similar to LDA, we assume the symmetric Dirichlet priors withβββ , that is βw = β , w ∈ {1, ...,W }, where W is the size of thevocabulary. However, the theme distributionθθθd of each recipe is nolonger drawn from a Dirichlet prior with �xed hyper-parameters α .Instead, the elements ofαααd ,αdk = exp(fTd λλλk ), whereλλλ is the featurematrix with K × (L +C +T + 1). fd is the concatenation of di�erentattribute vectors. fd = [ld ; cd ; td ; 1]. Note that the representation ofld and cd are discrete while td is continuous.The score sdt for each�avor t is in [0, 1]. Introducing exp(fTd λλλz ) is able to incorporatearbitrary types of observed continuous and discrete features [27].The last 1 is a default feature to account for the mean value ofeach theme. Therefore, for di�erent combination of attributes, the

resulting ααα values are distinct. The theme distributions extractedfrom the recipe set are induced by both the ingredient patterns andvarious attribute features.

3.1.1 Model Inference and Parameter Estimation. We train thismodel using a Stochastic Expectation Maximization (SEM) method[10] and use an iterative procedure, which alternates between (i)Stochastic E step and (ii) M step.

(i) Stochastic E step: at them iteration, sample zm and sm givencurrent estimate λλλm−1. We use the collapsed Gibbs sampling [15]for model inference.p(zmi = k, s

mi = 1|zm¬i , s

m¬i ,w¬i,wi = w,ααα

m−1d , β ,γ ) ∝

nd,1,¬i + γ∑snd,s,¬i + 2γ

nd,k,¬i + exp(fTd λλλm−1k )∑

k (nd,k,¬i + exp(fTd λλλm−1k ))

(nk,w,¬i + β)∑w ′nk,w ′,¬i +W β

p(si = 0|zm¬i , sm¬i ,w¬i,wi = w,ααα

m−1d , β,γ ) ∝

nbдwi ,¬i

+ β∑w ′n

bдw ′,¬i

+W β

nd,0,¬i + γ∑snd,s,¬i + 2γ

(1)where i = (d,n) is the current index. The superscript ¬i denotesa counting variable that excludes the i-th ingredient index in thecorpus. nd,k,¬i is the number of times that theme k is assigned tothe recipe d . nk,w,¬i is the number of times that the ingredient wis assigned to the theme k . nbдwi ,¬i

is the number of times that theingredient wi is assigned to the background distribution. nd,si ,¬iis the number of times of the ingredient in the recipe d assigned tothe background distribution si = 0 and themes si = 1, respectively.

(ii)Mstep: After stochastic E step, we obtainnmd = {nmd,1, ...,n

md,k ,

...,nmd,K }. In this step, we estimateλλλm by maximizing the likelihood

function L = ln(D∏d=1

p(wd , fd ,λλλ,αααd , z, s)). We compute ∂L∂λmkj

= 0,

∂L

∂λmkj=

D∑d=1

exp(fTd λλλmk )fd j (Ψ(

K∑k ′=1

exp(fTd λλλmk ′)) − Ψ(

K∑k ′=1

exp(fTd λλλmk ′)

+ nmd ) + Ψ(exp(fTd λλλ

mk ) + n

md,k ) − Ψ(exp(f

Td λλλ

mk ))) −

λmkj

σ 2(2)

where Ψ(·) is the Digamma function and is de�ned as the derive ofthe log gamma function.

Similar to [27], we use the L-BFGS optimizer to compute λλλm .After the model inference, we can obtain the theme-attribute

feature matrix {λ̂λλk }Kk=1 and the following parameters:

π̂d,s =nd,s + γ∑1

s ′=0 nd,s′ + 2γ

φ̂k,w =nk,w + β∑W

w ′=1 nk,w′ +W β

ϕ̂bдw =

nbдw + β∑W

w ′=1 nbдw ′+W β

θ̂d,k =nd,k + exp(fTd λ̂λλk )∑K

k=1(nd,k + exp(fTd λ̂λλk ))

(3)

3.2 Multi-Modal EmbeddingWe obtain the theme distribution θ̂θθd = (θ̂d,1, ..., θ̂d,k , ..., θ̂d,K ) foreach recipe d via MATM. In order to correlate multi-modal con-tent through the learned theme space, we resort to RBM based


404

Figure 3: The multi-modal embedding

model for joint learning because of the powerful inference of RBM.Particularly, we utilize multi-modal DBM [39] to learn commonspace.

As shown in Fig. 3, there are two pathways: text pathway andimage pathway, and each pathway contains M layers. For the textpathway, since θ̂θθd is multinomial distribution, the connectionsbetween θ̂θθd and h(1v) are modeled as Multinomial RBM [37]. Theconditional probability of the input unit is

P(θ̂dk = 1|h(1s)) =exp(bk +

∑jW(1s)k j h

(1s)j )∑

k ′ exp(bk ′ +∑jW(1s)k ′ j

h(1s)j )

(4)

where the weight matrix W(1s) = {W (1s)k, j } is associated with theconnection between the visible units θ̂̂θ̂θd and the hidden units h(1s) ={h(1s)j }, as well as bias weights b = {bk } for the visible units and

c(1s) = {c(1s)j } for the hidden units.For the image pathway, the input units vd are visual features,

and therefore the connections between vd and h(1v) are modeledas Gaussian RBM [18]. The conditional probability of the input unitis

vdi = 1|h(1v) ∼ N(bi + σi∑

jW(1v)i j h

(1v)j ,σ 2i ) (5)

where vd is the visual feature vector of reciped .σ 2i is the variance ofthe Gaussian distribution. The remaining layers from two pathwaysare the standard binary RBM [11]. The conditional distributions arewritten similarly and omitted here for the space limit.

Similar to [39], we train the network with stochastic gradientdescent using the greedy layer-wise pre-training strategy. Aftertraining, for given recipe-theme vector θ̂θθd as the input, we canuse the mean-�eld method to sample the hidden modalities fromp(v|θ̂θθd ) by updating each hidden layer given the states of the adja-cent layers. Finally, we obtain the visual representation based onthe learned ingredient representation.

4 APPLICATIONS4.1 Flavor Analysis and ComparisonIn this task, we analyze the �avor patterns from di�erent attributes,such as the region and course.

For �avor distributions across regions, we compute p(t |l):

p(t |l) =∑

kp(t |k)p(k |l) (6)

We obtain the cuisine representation p(k |l) as

p(k |l) = ψl,k =exp(fTl λ̂λλk )∑k ′ exp(f

Tl λ̂λλk

′ )(7)

where for fl , the position corresponding to this cuisine feature l is1 and 0, otherwise. The proportion of theme k over the cuisine l isexp(fTl λ̂λλk ).

Similarly, we obtain the �avor representation p(k |t) as

p(k |t) = υt,k =exp(fTt λ̂λλk )∑k ′ exp(f

Tt λ̂λλk ′ )

(8)

We then compute p(t |k)

p(t |k) =p(t ,k)∑

t ′∈T p(k |t ′)p(t ′)=

υt,kp(t)∑t ′∈T υt ′,kp(t

′)(9)

where p(t) =∑d sdt∑

t ′∑d sdt ′

.Therefore, the �avor distribution on certain cuisine p(t |l) is

p(t |l) =∑k


′)ψl,k (10)

Similarly, the �avor distributions from di�erent courses p(t |c) iscomputed as

p(t |c) =∑k


′)ξc,k (11)

where ξc,k =exp(fTc λ̂λλk )∑k′ exp(f

Tc λ̂λλk′ )

.

4.2 Region-Oriented Multi-dimensional FoodSummary

In this task, we summarize regional foods based on representa-tive ingredients, representative recipe images, �avor and coursepatterns.

The representativeness score of ingredient w for region lq ismeasured according to the expected number of times ingredient w ,which is projected onto the theme space.

Replq (w) = n(lq ,w)

∑k φ̂kwψlq,k∑

k φ̂kwψlq,k + ϕ̂bдw

,p(w |lq ) =Replq (w)∑w ′Replq (w

′)

(12)where n(lq ,w) is the frequency of w in recipes with the cuisine lq ,ψlq,k is calculated by Eqn.7.

We can obtain a list of region-representative ingredients by rank-ing the ingredients according to p(w |lq ).

In order to select representative recipe images from one region,we consider both textual and visual information. We obtain therepresentationψψψ lq for ingredients using Eqn. 7, and then use MME


405

to infer the visual representation vlq based onψψψ lq . The image issorted by the weighted sum of cosine similarity between lq and theimage Id :

sim(lq , Id ) = τψψψTlq

θ̂θθd

‖ψψψ lq ‖‖θ̂θθd ‖+ (1 − τ )

vTlqvd

‖vlq ‖‖vd ‖(13)

where ld represents the region information of the recipe d . The �rstterm is the semantic similarity and the second term is the visualsimilarity. τ is the weight parameter.

We compute p(t |lq ) and p(c |lq ) to obtain the �avor distributionand course distribution using Eqn.10-11, respectively.

4.3 Multi-Attribute Oriented RecipeRecommendation

Given the multiple query attributes including the cuisine l , coursec and �avor t , this task is to recommend relevant recipes to matchthese query attributes.

The attribute representation ϑϑϑ fq = {p(k |fq )}k ∈K with speci�edcuisine, course and �avor information is calculated as

p(k |fq ) =exp(fTq λ̂λλk )∑k exp(fTq λ̂λλk )

(14)

For each recipe d with the ingredient information, we computethe representation of the recipe p(k |d) through the Gibbs samplerby maximizing

L(wd ) =

Nd∏i=1[πd,0ϕ̂wi + πd,1

∑k ∈K

p(k |d)φ̂k,wi ] (15)

where Nd is the count of ingredients in the recipe d .We rank the results according to the similarity of the query

attributes and the recipes from the dataset, which is computed as:

JsSim(ϑϑϑ f ,θθθd ) = exp{−D js (ϑϑϑ f | |θθθd )} (16)

where D js (·| |·) denotes the Jensen-Shannon divergence and θθθd ={p(k |d)}k ∈K .

5 EXPERIMENT5.1 Experimental Settings

5.1.1 Dataset. The experimental dataset is crawled from Yumm-ly. Each recipe includes the ingredients, the image and three at-tributes: cuisine, course and �avor. There are 44,204 recipe items, 10cuisines, 14 courses1 and 6 �avors. Table 1 lists the values of di�er-ent attributes. Note that we use course names and their abbreviatedones interchangeably. The name in the brackets is the abbreviatedone of the course. Each cuisine represents one region. The valueof each �avor is continuous and they are in [0, 1]. The other twoattributes are categorical. We preprocess each ingredient line usingthe method [30]. For the ingredients with more than two words, werepresent them by concatenating them using “-”. For example, weuse chili-powder to represent the ingredient “chili powder”. Afterpreprocessing, the vocabulary of ingredients is 2,407.

1Di�erent from prede�ned 13 courses in Yummly, we consider Lunch and Snacks astwo kinds of courses.

Table 1: Values of di�erent attributes

] Type ]Value

Cuisine American, Italian, Mexican, Indian, French,Thai, Chinese, Spanish, Greek, Japanese

Course

Main Dishes (MD), Desserts (DE), Side Dishes (SD),Salads (SA),Lunch (LU), Snacks (SN), Soups (SO),Afternoon Tea (AT),Condiments and Sauces (CS),Breads (BR), Breakfast and Brunch (BB),Beverages (BE), Cocktails (CO), Appetizers (AP)

Flavor Piquant, Sour, Salty, Sweet, Bitter, Meaty

5.1.2 Implementation Details. For each image, we use the VGG-16 deep network to extract the 4,096-D features according to [16].For MATM, β = 0.01. γ = 1.0. Similar to [27], the variance σ 2 isset to 0.5 for all attribute features and 10.0 for the default features.For the initialization of SEM, the parameter α = 1.0/K . We run2,000 iterations for training MATM. After an initial burn-in periodof 200 iterations, we optimize λλλ every 50 iterations. For MME, thelearning rate of multinomial layer and Gaussian layer is 0.10 and0.001. The learning rate of other layers is 0.01. For Gaussian-RBM,each Gaussian visible unit is empirically set to have unit variance toguarantee the stability of training [39]. In the inference, the timesfor mean-�eld inference is 100. In Eqn. 13, τ is empirically set as0.6.

5.2 Evaluation of MATM5.2.1 Theme Number K Selection. In theme modeling, the selec-

tion of theme number K is important. We resort to the perplexity[6] as the metric, which is a standard measure for estimating howwell one generative model �ts the data. The lower the perplexity is,the better the performance. In MATM, the perplexity of the test setis de�ned as:

perplexity(Dtest ) = exp(−∑d ∈Dtest logp(wd , fd |Dtrain )∑

d ∈DtestNd)

(17)p(wd , fd |Dtrain ) =

Nd∏i=1[πd,0ϕ̂wi + πd,1

∑k ∈K

nd,k + exp(xTd λ̂λλk )∑Kk ′=1(nd,k

′ + exp(xTd λ̂λλk ′ ))φ̂k,wi ] (18)

where Dtrain is the training set and Dtest is the test set.A Gibbs sampler is run on the test data Dtest to calculate the

counts nd,k and πd given φ̂k,wi and λ̂λλ, learned from Dtrain .For the dataset, we randomly select 90% of the recipe items as

the training data and the remaining 10% as the test data. We setK ∈ {40, 60, 80, 100, 120, 140}. The perplexity scores over di�erenttheme number for di�erent iterations are shown in Fig. 4. We cansee that (1) The perplexity decreases slowly and converges to astable level after 1,000 iterations. (2) The perplexity decreases muchslower when K >= 100. Therefore, we choose K = 100 in thefollowing experiments.

5.2.2 Illustration of Discovered Themes. In addition, we providesome discovered themes in Table 2, where “#j” denotes the j-ththeme index. Each theme is represented by 10 top-ranked ingredi-ents, sorted by φ̂. We observe that some themes denote the classic


406

110

115

120

125

130

135

140

145

150

155

160

1 500 1000 1500 2000

K=40 K=60 K=80

K=100 K=120 K=140

Number of Iterations

Per

ple

xity

Figure 4: The perplexity score over di�erent theme numberfor di�erent iterations.

combination of ingredients, such as Theme #11. Theme #12 de-notes the ingredients of fruit drinks while Theme #44 denotes theItaly-style ingredients. From these examples, we can see that thesethemes have a reasonable interpretation as the ingredient base.

Table 2: Some examples of discovered themes.

Theme ]11 Theme ]12 Theme ]44egg 0.366milk 0.178salt 0.111butter 0.060�our 0.044salted-butter 0.032melted-butter 0.016egg-white 0.015bread 0.014honey 0.012

honey 0.094mango 0.073fresh-lime-juice 0.064chopped-fresh-mint 0.041fresh-mint 0.041juice 0.039navel-orange 0.036pineapple 0.035orange 0.029strawberry 0.027

mozzarella-cheese 0.124parmesan-cheese 0.110ricotta-cheese 0.098egg 0.079lasagna-noodle 0.058marinara-sauce 0.053pasta-sauce 0.044italian-seasoning 0.043italian-sausage 0.037tomato-sauce 0.030

5.3 Evaluation of MMEWe investigate how many hidden layers M should we put be-tween the recipe textual ingredients and visual modalities. Asshown in Fig. 3, we could have one intervening layer, creatingan RBM (theme input|joint hidden layer|image input) as 1-layer.A two-layered MME would have 3 intervening layers (theme in-put|theme hidden 1|joint hidden|image hidden 1|image input) as2-layer, and so on. The set of units of di�erent layers for the image-pathway is {4096, 1024, 1024, 1024, 1024} and the text pathway is{100, 100, 100, 100, 100}. The units of the joint layer under the di�er-ent selection of layers from the image-pathway and text-pathwayare all 1024 + 100 = 1124 for fair comparison.

MME is used to generate the visual modality given speci�edtheme features, we therefore select the cross-modal retrieval taskfor evaluation. Since there is only one groundtruth match betweenthe recipe ingredients and the recipe image, we use Top R % forevaluation [21], which is the relative number of images correctlyretrieved in the �rst R% of the ranked list. Speci�cally, we set R ∈{20, 40, 60, 80}. Based on the split in the evaluation of MATM, wefurther select 10% from the training data as the validation set. Basedon the learned representation, the cosine similarity is calculated to

Table 3: The performance for di�erent layers with Top R%.

Method R=20 R=40 R=60 R=801-layer 0.199 0.399 0.598 0.7982-layer 0.202 0.399 0.600 0.8003-layer 0.207 0.411 0.615 0.8104-layer 0.201 0.401 0.600 0.8005-layer 0.203 0.404 0.603 0.799

obtain the ranked results. Table. 3 shows the results of these models.Comparing the performance of MME with di�erent layers, we cansee that the 3-layer MME leads to the best performance. Therefore,we select M = 3 for the following experiment.

5.4 Evaluation of Applications5.4.1 Flavor Analysis and Comparison. The results of �avor dis-

tributions in di�erent regions are demonstrated in Fig. 5. For thespace limit, we show the �avor distributions of four countries in-cluding American, French, Chinese and Mexican. We can analyzeand compare di�erent �avor patterns in one country or amongdi�erent countries from these results. For example, American isthe country with the high consumption of meat foods, with theprobability of meaty �avor 0.217, when compared to foods of oth-er �avors from this country. As we all known, American is onecountry, where meat is one of most-eat food. Besides the meatyfood, French also likes sweet foods, with the probability of 0.130,which is the most proportion compared with other countries. Thesigni�cant di�erence in the �avor distributions between these fourcountries is the high consumption of salty �avor labeled foods a-mong Chinese, with the probability of 0.31. This is because Chineseis one of countries with the highest consumption of salty foods inthe world [19].

In addition, we show the �avor di�erences among di�erent cours-es in Fig. 6. We can see that some courses such as Beverages (BE),Side Dishes (SD) and Snacks (SN) are observed to have compara-tively higher proportions of sweet foods compared to other courses,perhaps re�ecting the prevalence of sugary items, like fruit juices.Meanwhile, salty and meaty foods dominate some courses, suchas Main Dishes (MD) and lunch (LU), when more meat might beconsumed.

The results can be insightful for future exploration to enable dif-ferent applications like exploring culinary habits [2] and providingbetter recommendation service.

5.4.2 Region-Oriented Multi-dimensional Food Summary. As itis subjective to evaluate the summarized results, we resort to theuser study and consider the following baselines for comparison.• MATM based Textual Summary (MATM-TS). MATM-TS uses

Eqn.12 derived from MATM to select representative ingredientsfor summary.• MATM based Multi-modal Summary (MATM-MS). MATM-MS

uses both representative ingredients and images for summa-ry. This baseline selected the recipe images based on the thesimilarity of ingredients.


407

0.085

0.159

0.207

0.116

0.216

0.217

piquant

sour

salty

sweet

bitter

meaty

(a) American

0.075

0.119

0.278

0.13

0.178

0.220

piquant

sour

salty

sweet

bitter

meaty

(b) French

0.121

0.112

0.311

0.096

0.185

0.175

piquant

sour

salty

sweet

bitter

meaty

(c) Chinese

0.143

0.159

0.235

0.079

0.233

0.151

piquant

sour

salty

sweet

bitter

meaty

(d) Mexican

Figure 5: Flavor distributions in di�erent regions

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MD DE SD SA AT SO LU SN CS BR BB BE CO AP

piquant sour salty sweet bitter meaty

P(t|c)

Figure 6: Flavor distributions on di�erent courses.

2.48

3.48

3.71

4.43

2.29

3.443.7

4.44

3.22

3.733.91 3.97

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

MATM-TS MATM-MS MATM-MME-MS Our Method

Relevance Diversity Satisfaction

Figure 7: Subjective evaluation for food summary.

• MATM and MME based Multi-modal Summary (MATM-MME-MS). This baseline also uses both the representative ingredientsand images for summary. However, it considers both textualand visual information using Eqn.13 to select representativeimages.

Compared with MATM-MME-MS, our framework additionaly in-troduced the �avor and course distribution. We asked 15 graduatestudents to assess the summarized results using 1 to 5 ratings fromthree aspects including (1) Relevance (i.e., to what extent the sum-mary is depicting the regional food), (2) Diversity (i.e., depictingthe regional food from multiple aspects), and (3) Satisfaction. Howsatis�ed are you with the summarized results? They can consult

the internet (e.g., Wikipedia) and other cooking books to help makejudgement.

We show 50 top-ranked ingredients in a tag-cloud way, generat-ed by wordcloud2 and 10 top-ranked food images. We averaged allthe scores as the �nal rating. The results are shown in Fig.7. We cansee (1) After incorporating the visual information into the selectionof images, MATM-MME-MS has higher scores in all aspects thanMATM-MS. We further observed that there are some recipe images,which are not the dishes corresponding to the ingredients, but cook-ing books or other irrelevant information. MATM-MME-MS canexclude these images and improve the ranked results; and (2) Notethat for the relevance, the di�erence of MATM-MME-MS and ourmethod is small. Because introducing the �avor and course distribu-tion does not a�ect the relevance score, while in other two aspects,our method exhibits advantages due to the introduced �avor andcourse distributions from our method. In addition, two examplesummaries are illustrated in Fig. 8 for Chinese and Mexican, respec-tively. The recipe name is showed below each image. From suchmulti-dimensional regional food summary, users could understandthe local culinary characteristics of a region e�ciently and com-prehensively with the textual information, visual information anddi�erent attribute patterns. For example, we can see that Mexicanlikes the food with sour-and-hot �avor, since their ingredients in-clude chili-powder, sour-cream and so on. Correspondingly, thetotal probability of piquant and sour is the highest compared withother �avor propositions. From the representative recipe images,we can further easily understand their culinary habits.

5.4.3 Multi-A�ribute Oriented Recipe Recommendation. We di-vided the test set into two parts: one part contains the course, cuisineand �avor information, and the other one contains the recipe idwith the ingredients. Our goal is to use the �avor, course and cuisineattributes as the query to retrieve recipes. We choose S@H [44] asthe metric, which is the success rate of �nding the target withinthe top-H recommendation results and H ∈ {1, 3, 5}. As a compari-son, we consider LDA [6] as the baseline. This baseline is trainedwithout incorporating the attribute information. For the attributerepresentation, it is the average of the theme representation oftrained recipes annotated by the query attributes. As shown in Fig.9, we can see that our method outperforms the baseline due to theintroduction of the attributes. Because some recipes have the sameattributes, we further choose MAP@5 for evaluation. The MAP@5

2https://github.com/amueller/word_cloud


408

(a) Images

ShumaiClassic Beef Fried

RiceCantonese Fried

NoodlesPenang Hokkien

Char RecipeMoo-Shu

Pork

Trini-Chinese Chicken

Scallion Beef Stir-fry

Sesame Ginger Chicken

Soy Sauce Chow Mein

Chinese Bean Curd Rolls

(a) Images

Butternut Squash & Mushroom Enchiladas

New Mexican Posole

Quick and Easy Enchiladas

Lime Chicken Chimichangas

Grilled Tequila Lime Chicken

Mexican Lasagna with Fire-Roasted

Tomato Sauce

Roasted Butternut Squash and Black Bean Enchiladas

Healthy Chicken Chili with Barley

Two-Bean and Corn Chipotle Turkey

Chili

Slow-Cooker Shredded Beef Tacos

(b) ingredients (c) flavor distribution (d) course distribution(b) Ingredients (c) Flavor distribution (d) Course distribution

Chinese Mexican

Figure 8: Multi-dimensional summary for Chinese and Mexican food.

0.522

0.5710.575

0.536

0.5740.577

0.5

0.52

0.54

0.56

0.58

0.6

1 3 5

LDA MATM

H

S@H

Figure 9: The evaluation of recipe recommendation.

Figure 10: Some recommendation examples from MATM.

of LDA and MATM are 0.571 and 0.578, respectively. Our methodagain outperforms the baseline. Fig. 10 shows some examples ofrecommendation results from MATM, where the recipe image andname associated with each recipe id are present.

6 CONCLUSIONSThis work has presented a recipe analysis framework to incorpo-rate multi-modal information, various types of continuous anddiscrete attribute features for multi-dimensional food analysis. Amulti-attribute theme modeling method is proposed to jointly mod-el the ingredients and arbitrary types of recipe attributes, such ascontinuous �avor attribute, discrete cuisine and course attributes.The derived attribute-theme representation and multi-modal cor-relation has demonstrated its e�ectiveness via our proposed threeapplications in �avor analysis, food summary and recipe recom-mendation. We hope that this framework could further the agendaof food-relative study and meanwhile form a contribution to other�elds, such as computational gastronomy and food science [1].

There are four directions needing further investigation. The �rstone is to enlarge our data set for more comprehensive and deepermulti-dimensional food analysis. The second one is how to selectthe recipe images with higher quality for commerical applications.For example, Alex M. [25] has attempted to understand the photoqualities from Yelp for cover photo sorting application. As the thirddirection, we have demonstrated the potential of our frameworkvia three applications. Based on the proposed framework, more realapplications could be designed, such as food tourism [14] and soon. The fourth one is on a product dimension for building betterpersonalized food recommendation system. For example, Ge et al.[13] has designed a recipe recommender system, that can suit boththe user’s preference and their health. Such system should o�erhealthier options but still satisfy the individual preferences.

ACKNOWLEDGMENTThis work was supported in part by the National Natural ScienceFoundation of China (61532018, 61322212, 61602437 and 61672497),in part by the Beijing Municipal Commission of Science and Tech-nology (D161100001816001), in part by Beijing Natural ScienceFoundation (4174106), in part by the Lenovo Outstanding YoungScientists Program, in part by National Program for Special Sup-port of Eminent Professionals and National Program for Support ofTop-notch Young Professionals, and in part by China PostdoctoralScience Foundation (2016M590135, 2017T100110).


409

REFERENCES[1] Yong Yeol Ahn and Sebastian Ahnert. 2013. The Flavor Network. Leonardo 46, 3

(2013), 272–273.[2] Yong-Yeol Ahn, Sebastian E Ahnert, James P Bagrow, and Albert-László Barabási.

2011. Flavor network and the principles of food pairing. Scienti�c reports 1(2011).

[3] Kiyoharu Aizawa and Makoto Ogawa. 2015. FoodLog: Multimedia Tool forHealthcare Applications. Multimedia IEEE 22, 2 (2015), 4–8.

[4] Roberto Camacho Barranco, Laura M. Rodriguez, Rebecca Urbina, and M. Shahri-ar Hossain. 2016. Is a Picture Worth Ten Thousand Words in a Review Dataset?arXiv:1606.07496 (2016).

[5] David M Blei. 2012. Probabilistic topic models. Communications of The ACM 55,4 (2012), 77–84.

[6] David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation.the Journal of machine Learning research 3 (2003), 993–1022.

[7] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101–miningdiscriminative components with random forests. In European Conference onComputer Vision. 446–461.

[8] Jingjing Chen and Chong-Wah Ngo. 2016. Deep-based ingredient recognition forcooking recipe retrieval. In Proceedings of the 2016 ACM onMultimedia Conference.32–41.

[9] Munmun De Choudhury, Sanket Sharma, and Emre Kiciman. 2016. Charac-terizing Dietary Choices, Nutrition, and Language in Food Deserts via SocialMedia. In ACM Conference on Computer-Supported Cooperative Work and SocialComputing. 1157–1170.

[10] JosÃľ G. Dias and Michel Wedel. 2004. An empirical comparison of EM, SEM andMCMC performance for problematic Gaussian mixture likelihoods. Statisticsand Computing 14, 4 (2004), 323–332.

[11] Asja Fischer and Christian Igel. 2014. Training restricted Boltzmann machines:An introduction. Pattern Recognition 47, 1 (2014), 25–39.

[12] Peter Forbes and Mu Zhu. 2011. Content-boosted matrix factorization for rec-ommender systems: experiments with recipe recommendation. In Proceedings ofthe �fth ACM conference on Recommender systems. 261–264.

[13] Mouzhi Ge, Francesco Ricci, and David Massimo. 2015. Health-aware FoodRecommender System. In Proceedings of the Conference on Recommender Systems.333–334.

[14] Andrea Giampiccoli and Janet Hayward Kalis. 2012. Tourism, Food, and Cul-ture: Community-Based Tourism, Local Food, and Community Development inMpondoland. Culture and Agriculture 34, 2 (2012), 101–123.

[15] T. L. Gri�ths and M. Steyvers. 2004. Finding scienti�c topics. Proceedings of theNational academy of Sciences of the United States of America 101, Suppl 1 (2004),5228–5235.

[16] Jack Hessel, Nicolas Savva, and Michael J. Wilber. 2015. Image Representationsand New Domains in Neural Image Captioning. Computer Science (2015).

[17] Geo�rey E Hinton and Ruslan Salakhutdinov. 2009. Replicated softmax: anundirected topic model. In Advances in neural information processing systems.1607–1614.

[18] Geo�rey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimension-ality of data with neural networks. Science 313, 5786 (2006), 504–507.

[19] D. B. Hipgrave, S. Chang, X. Li, and Y. Wu. 2016. Salt and Sodium Intake inChina. Jama 315, 7 (2016), 703.

[20] Patrick D Howell, Layla D Martin, Hesamoddin Salehian, Chul Lee, Kyler M East-man, and Joohyun Kim. 2016. Analyzing Taste Preferences From CrowdsourcedFood Entries. In International Conference on Digital Health Conference. 131–140.

[21] Yangqing Jia, Mathieu Salzmann, and Trevor Darrell. 2011. Learning cross-modality similarity for multinomial data. In Computer Vision, IEEE InternationalConference on. 2407–2414.

[22] Hokuto Kagaya, Kiyoharu Aizawa, and Makoto Ogawa. 2014. Food Detectionand Recognition Using Convolutional Neural Network. In Proceedings of theACM International Conference on Multimedia. ACM, 1085–1088.

[23] Hokuto Kagaya, Kiyoharu Aizawa, and Makoto Ogawa. 2014. Food detectionand recognition using convolutional neural network. In Proceedings of the ACMInternational Conference on Multimedia. 1085–1088.

[24] Chia-Jen Lin, Tsung-Ting Kuo, and Shou-De Lin. 2014. A content-based matrixfactorization model for recipe recommendation. In Advances in KnowledgeDiscovery and Data Mining. 560–571.

[25] Alex M. 2016. Finding Beautiful Yelp Photos Using Deep Learn-ing. https://engineeringblog.yelp.com/2016/11/�nding-beautiful-yelp-photos-using-deep-learning.html (2016).

[26] Austin Meyers, Nick Johnston, Vivek Rathod, Anoop Korattikara, Alex Gorban,Nathan Silberman, Sergio Guadarrama, George Papandreou, Jonathan Huang,and Kevin P Murphy. 2015. Im2Calories: towards an automated mobile visionfood diary. In Proceedings of the IEEE International Conference on Computer Vision.1233–1241.

[27] David Mimno and Andrew Mccallum. 2012. Topic Models Conditioned on Arbi-trary Features with Dirichlet-multinomial Regression. University of Massachusetts- Amherst 2008 (2012), 411–418.

[28] Weiqing Min, Bing Kun Bao, and Changsheng Xu. 2014. Multimodal Spatio-Temporal Theme Modeling for Landmark Analysis. IEEE Multimedia 21, 3 (2014),20–29.

[29] Weiqing Min, Bing Kun Bao, Changsheng Xu, and M. Shamim Hossain. 2015.Cross-Platform Multi-Modal Topic Modeling for Personalized Inter-PlatformRecommendation. IEEE Transactions on Multimedia 17, 10 (2015), 1787–1801.

[30] Weiqing Min, Shuqiang Jiang, Jitao Sang, Huayang Wang, Xinda Liu, and LuisHerranz. 2017. Being a Super Cook: Joint Food Attributes and Multi-ModalContent Modeling for Recipe Retrieval and Exploration. IEEE Transactions onMultimedia 19, 5 (2017), 1100 – 1113.

[31] Ferda O�i, Yusuf Aytar, Ingmar Weber, Raggi Al Hammouri, and Antonio Torralba.2017. Is Saki delicious? The Food Perception Gap on Instagram and Its Relationto Health. arXiv preprint arXiv:1702.06318 (2017).

[32] Shengsheng Qian, Tianzhu Zhang, and Changsheng Xu. 2016. Multi-modal Multi-view Topic-opinion Mining for Social Event Analysis. In ACM on MultimediaConference. 2–11.

[33] Jaclyn Rich, Hamed Haddadi, and Timothy M Hospedales. 2016. Towards Bottom-Up Analysis of Social Food. In Proceedings of the 6th International Conference onDigital Health Conference. ACM, 111–120.

[34] M Rosenbaum, M Sy, K Pavlovich, R. L. Leibel, and J Hirsch. 2008. Leptin reversesweight loss-induced changes in regional neural activity responses to visual foodstimuli. Journal of Clinical Investigation 118, 7 (2008), 2583–2591.

[35] Sina Sajadmanesh, Sina Jafarzadeh, Seyed Ali Ossia, Hamid R Rabiee, HamedHaddadi, Yelena Mejova, Mirco Musolesi, Emiliano De Cristofaro, and GianlucaStringhini. 2016. Kissing Cuisines: Exploring Worldwide Culinary Habits on theWeb. arXiv preprint arXiv:1610.08469 (2016).

[36] Ruslan Salakhutdinov, Andriy Mnih, and Geo�rey Hinton. 2007. RestrictedBoltzmann machines for collaborative �ltering. In Machine Learning, Proceedingsof the Twenty-Fourth International Conference. 791–798.

[37] R. Salakhutdinov, J. B. Tenenbaum, and A. Torralba. 2013. Learning withHierarchical-Deep Models. IEEE Transactions on Pattern Analysis and MachineIntelligence 35, 8 (2013), 1958.

[38] Tiago Simas, Michal Ficek, Albert DiazGuilera, Pere Obrador, and Pablo R. Ro-driguez. 2017. Food-bridging: a new network construction to unveil the principlesof cooking. arXiv preprint arXiv:1610.08469 (2017).

[39] Nitish Srivastava and Ruslan Salakhutdinov. 2014. Multimodal learning withdeep Boltzmann machines. The Journal of Machine Learning Research 15, 1 (2014),2949–2980.

[40] Chong Wang, D. Blei, and Fei Fei Li. 2009. Simultaneous image classi�cation andannotation. In IEEE Computer Society Conference on Computer Vision and PatternRecognition. 1903–1910.

[41] Hui Wu, Michele Merler, Rosario Uceda-Sosa, and John R Smith. 2016. Learningto Make Better Mistakes: Semantics-aware Visual Food Recognition. In ACM onMultimedia Conference. 172–176.

[42] Ruihan Xu, Luis Herranz, Shuqiang Jiang, Shuang Wang, Xinhang Song, andRamesh Jain. 2015. Geolocalized Modeling for Dish Recognition. Multimedia,IEEE Transactions on 17, 8 (2015), 1187–1199.

[43] Longqi Yang, Yin Cui, Fan Zhang, John P Pollak, Serge Belongie, and DeborahEstrin. 2015. PlateClick: Bootstrapping Food Preferences Through an AdaptiveVisual Interface. In Proceedings of the 24th ACM International on Conference onInformation and Knowledge Management. 183–192.

[44] Wanlei Zhao, Yu Gang Jiang, and Chong Wah Ngo. 2006. Keyframe Retrieval byKeypoints: Can Point-to-Point Matching Help? 4071 (2006), 72–81.


410

a delicious recipe analysis framework for...

Documents