classiﬁcation of photographic images based on perceived ... · classiﬁcation of photographic...

Classification of photographic images based onperceived aesthetic quality

Jeff Hwang [email protected]

Department of Electrical Engineering, Stanford University

Sean Shi [email protected]

Department of Electrical Engineering, Stanford University

AbstractIn this paper, we explore automated aes-thetic evaluation of photographs using ma-chine learning and image processing tech-niques. We theorize that the spatial dis-tribution of certain visual elements withina given image correlates with its aestheticquality. To this end, we present a novel ap-proach wherein we model each photographas a set of tiles, extract visual features fromeach tile, and train a classifier on the re-sulting features along with the images’ aes-thetics ratings. Our model achieves a 10-fold cross-validation classification successrate of 83.60%, corroborating the efficacyof our methodology and therefore showingpromise for future development.

1. Introduction

Aesthetics in photography are highly subjective. Theaverage individual may judge the quality of a photo-graph simply by gut feeling; in contrast, a photogra-pher might evaluate a photograph he or she capturesvis-a-vis technical criteria such as composition, con-trast, and sharpness. Towards fulfilling these crite-ria, photographers follow many rules of thumb. Theactual and relative visual impact of doing so for thegeneral public, however, remains unclear.

In our project, we show that the existence of cer-tain characteristics does indeed make an image moreaesthetically-pleasing in general. We achieve this bybuilding a machine learning pipeline that trains a hy-pothesis capable of classifying images as either ex-hibiting high levels of aesthetic quality or not.

The potential impact of building a system to solvethis problem is broad. For example, by implement-

ing such a system, websites with community-sourcedimages can programmatically filter out bad imagesto maintain the desired quality of content. Cam-eras can provide real-time visual feedback to helpusers improve their photographic skills. Moreover,from a cognitive standpoint, solving this problemmay lend interesting insight towards how humansperceive beauty.

We begin by identifying visual features that we be-lieve correlate with the aesthetic quality of a photo-graph. We then build a learning pipeline that ex-tracts these features from images on a per-tile basisand uses them along with the images’ aesthetics rat-ings to train a classifier. In this manner, we endow theclassifier with the freedom to infer spatial relation-ships amongst features that correlate with an image’saesthetics.

2. Related work

There have been several efforts to tackle this problemfrom different angles within the past decade. Pogac-nik et al [1] believed that the features depended heav-ily on identification of the subject of the photograph.Datta et al [2] evaluated the performance of differentmachine learning models (support vector machines,decision trees) on the problem. Ke et al [3] focusedon extracting perceptual factors important to profes-sional photographers, such as color, noise, blur, andspatial distribution of edges.

Also, in contrast to our approach, it is interesting tonote that these studies have focused primarily on ex-tracting features that attempt to capture prior beliefson the spatial orientation of visual elements withinthe image. For example, Datta et al attempted tomodel rule-of-thirds composition by computing theaverage hue, saturation, and luminance of the innerthirds rectangle, and Pogacnik et al defined features

Figure 1. Tiling scheme applied to image by learningpipeline.

that assessed adherence to a multitude of composi-tional rules as well as the positioning of the subjectrelative to the image’s frame.

3. Dataset

Our learning pipeline downloads images and theiraverage aesthetic ratings from two separate datasets.

The first is an image database hosted by photo.net, aphoto sharing website for photographers. The indexfile we use to locate images was generated by Dattaet al. Members of photo.net can upload and critiqueeach others photographs and rate each photographwith a number between 1 and 7, with 7 being the bestpossible rating. Due to the wide range and subjectiv-ity of ratings, we choose to only use photographs withratings above 6 or below 4.2, which yields a datasetcontaining 1700 images split evenly between positivelabels and negative labels.

The second comprises images scraped from DPChal-lenge, another photo sharing website for photogra-phers. The index file we used to locate images wasgenerated by Murray et al [4]. Following guidelinesfrom prior work, we choose to use photographs withratings above 7.2 or below 3.4, resulting in a datasetcontaining 2000 images split evenly between positiveand negative labels.

4. Feature extraction

Prior to extracting features, we partition each imageinto five-by-five equally-sized tiles (Figure 1). By ex-tracting features on a per-tile basis, the learning algo-rithm can identify regions of interest and infer rela-tionships between feature-tile pairs that indicate aes-thetic quality. For example, in the case of the imagedepicted in Figure 1, we surmise that the learning al-

gorithm would be able to discern the well-composedframing of the pier from the features extracted fromits containing tiles with respect to those extractedfrom the surrounding tiles.

Below, we describe the features we extract from eachimage tile.

Subject detection: Strong edges distinguish the sub-ject from the image’s background. To quantify thedegree of subject-background separation, we apply aSobel filter to each image tile, binarize the result viaOtsu’s method, and compute the proportion of pixelsin the tile that are edge pixels:

fsd =∑

(x,y)∈Tile

1{I(x, y) = Edge}

Color palette: A photograph’s color composition candramatically influence how a person perceives a pho-tograph. We capture the color diversity of a pho-tograph using a color histogram that subdivides thethree dimensional RGB color space into 64 equallysized bins. Since each pixel can take on one of 256 dis-crete values in each color channel, this results in eachbin being a cube with 16 possible values in each di-mension. We normalize each bin’s count by the totalpixel count so that it is invariant to image dimensions.

Detail: Higher levels of detail are generally desirablefor photographs, particularly for its subject. To ap-proximate the amount of detail, we compare the num-ber of edge pixels of a Gaussian filtered version ofthe image tile to the number of edge pixels within theoriginal image tile, i.e.

fd =

∑(x,y)∈Tile 1{Ifiltered(x, y) = Edge}∑

(x,y)∈Tile 1{I(x, y) = Edge}

For an image tile that is exceptionally detailed, manyof the higher-frequency edges in the region would beremoved by the Gaussian filter. Consequently, wewould expect fd to be closer to 0. Conversely, for atile that lacks detail, since few edges exist in the re-gion, applying the Gaussian filter would impart littlechange to the number of edges. In this case, we wouldexpect fd to be closer to 1.

Hue: To the human eye, certain color combinationsare more appealing than others. To capture this, foreach image tile, we compute the proportion of pix-els that correspond to a particular hue. We discretizehues into five regions, corresponding to red, yellow,green, blue, and purple.

Saturation: Saturation measures the intensity of acolor. We extract the average saturation value for eachimage tile.

Figure 2. Block diagram of learning pipeline.

Contrast: Contrast is the difference in color or bright-ness amongst regions in an image. Generally, thehigher the contrast, the more distinguishable objectsare from one another. We approximate the contrastwithin each image tile by calculating the standard de-viation of the grayscale intensities.

Blur: Depending on the image region, blurriness mayor may not be desirable. Poor technique or camerashake tends to yield images that are blurry across theentire frame, which is generally undesirable. On theother hand, low depth-of-field images with blurredout-of-focus highlights (“bokeh”) that complementsharp subjects are often regarded as being pleasing.

To efficiently estimate the amount of blur within animage, we calculate the variance of the Laplacian ofthe image. Low variance corresponds to blurrier im-ages, and high variance to sharper images.

Noise: The desirability of visual noise is contextual.For most modern images and for images that con-vey positive emotions, noise is generally undesirable.For images that convey negative semantics, however,noise may be desirable to accentuate their visual im-pact. We measure noise by calculating the image’s en-tropy.

Saliency: The saliency of the subject within a photo-graph has a significant impact on the perceived aes-thetic quality of the photograph. We post-processeach image to separate the salient region from thebackground using a center-vs-surround approach de-scribed in Achanta et al [5]. We then sum the numberof salient pixels per image tile and normalize by thetile size.

5. Methods

Figure 2 depicts a high-level block diagram of thelearning pipeline we built. The pipeline comprisesthree main components: an image scraper, a bank offeature extractors, and a learning algorithm.

For each of the features we describe in Section 3, thereexists a feature extractor function that accepts an im-age as an input, calculates the feature value, and in-serts the feature-value mapping into a sparse featurevector allocated for the image. We rely on image pro-cessing algorithms implemented in the scikit-imagelibrary for many of these functions.

After the pipeline generates feature vectors for all im-ages in the training set, it uses them to train a clas-sifier. For the learning algorithm, we experimentedwith support vector machines (SVM), random forests,and gradient tree boosting.

SVM: The SVM learning algorithm with `1 regu-larization involves solving the primal optimizationproblem

minγ,w,b

1

2||w||2 + C

m∑i=1

ξi

subject to y(i)(wT x(i) + b) ≥ 1− ξi, i = 1, ...,m

, the dual of which is

maxα

m∑i=1

αi −1

2

m∑i,j=1

y(i)y(j)αiαj〈x(i), x(j)〉

subject to 0 ≤ αi ≤ C, i = 1, ...,mm∑i=1

αiy(i) = 0

Accordingly, provided that we find the values of αthat maximize the dual optimization problem, the hy-pothesis can be formulated as

h(x) =

{1 if

∑mi=1 αiy

(i)〈x(i), x〉+ b ≥ 0−1 otherwise

Note that since the dual optimization problem andhypothesis can be expressed as inner products be-tween input feature vectors, we can replace each in-ner product with a kernel applied to the two in-put vectors, which allows us to train our classifierand perform classification in a higher-dimensionalfeature space. This characteristic of SVMs makesthem well-suited for our problem since we speculatethat non-linear relationships amongst multiple fea-tures influence image aesthetic quality. For our sys-tem, we choose to use the Gaussian kernel K(x, y) =exp

(γ||x− y||22

), which corresponds to an infinite-

dimensional feature mapping.

Random forest: Random forests comprise collectionsof decision trees. Each decision tree is grown by se-lecting a random subset of input variables to serve ascandidates for splitting at a particular node. Predic-

tion then involves taking the average of the predic-tions of all the constituent trees:

h(x) = sign

(1

m

m∑i=1

Ti(x)

)

Because of the way each decision tree is constructed,the variance of the average prediction is less than thatof any individual prediction. It is this characteris-tic that makes random forests more resistant to over-fitting than decision trees, and, thus, generally havemuch higher performance.

Gradient tree boosting: Boosting is a powerful learn-ing method that sequentially applies weak classifica-tion algorithms to reweighted versions of the trainingdata, with the reweighting done in such a way that,between every pair of classifiers in the sequence, theexamples that were misclassified by the previous clas-sifier are weighted higher for the next classifier. Inthis manner, each subsequent classifier in the ensem-ble is forced to concentrate on correctly classifying theexamples that were previously misclassified.

In gradient tree boosting, or gradient-boosted regres-sion trees (GBRT), our weak classifiers are decisiontrees. After fitting the trees, the predictions from allthe decision trees are weighted and combined to formthe final prediction:

h(x) = sign

(m∑i=1

αiTi(x)

)

In literature, tree boosting has been identified as be-ing one of the best learning algorithms available [6].

6. Experimental results and analysis

For each learning algorithm, we measure the per-formance of our classifier using 10-fold cross valida-tion on the photo.net dataset and the DPChallengedataset. We run backward feature selection to elimi-nate ineffective features to improve classification per-formance.

For SVM, we tuned our parameters using grid search,which ultimately led us to use C = 1 and γ = 0.1. Forrandom forest, we used 300 decision trees. We deter-mined this value by empirically finding the asymp-totic limit to the generalization error with respectto the number of decision trees used. For gradienttree boosting, we used 200 decision trees and a sub-sampling coefficient of 0.9. Using a sub-sampling co-efficient smaller than 1 allows us to trade off vari-ance for bias, which thereby mitigates overfitting andhence improves generalization performance.

SVM RF GBRTphoto.net 78.71% 78.58% 80.88%

DPChallenge 82.62% 82.85% 83.60%

Table 1. 10-fold classification accuracy

Actuallabel

Predicted label1 0

1 TP80.12%

FN19.88%

0 FP18.35%

TN81.65%

Figure 3. Confusion matrix for 10-fold cross validation withGBRT on photo.net dataset.

Table 1 shows our 10-fold cross-validation accuracyfor each of the learning algorithms. For both datasets,we got the highest performance with GBRTs, with ac-curacies of 80.88% and 83.60%. That we see similarquality of results for both datasets signifies that ourmethodology is sound.

Figure 3 shows the confusion matrix for 10-fold cross-validation using GBRTs on the photo.net dataset.The true positive and false negative rates are ap-proximately symmetric with the true negative andfalse positive rates, respectively, which signifies thatour classifier is not biased towards predicting a cer-tain class. This also holds true for the DPChallengedataset.

To analyze the deficiencies of our methodology, weexamine images that our classifier misclassified.

Figure 4 shows an example of a negative image fromthe photo.net dataset that the classifier mispredictedas being positive. Note that the image is compo-sitionally sound – the subject is clearly distinguish-able from the background, fills most of the frame, iswell-balanced in the frame, and has components thatlie along the rule-of-thirds axes. The hot-pink back-ground, however, is incredibly jarring, and the subjectmatter is mundane and lacks significance. Unfortu-nately, because it discretizes color features so coarsely,the classifier is likely not able to effectively differen-tiate between different shades of colors, such as theartificial pink shade of this image’s background andthe warm red shade of a beautiful sunset. More-over, it has no way of gleaning meaning from images.We therefore believe that it is primarily due to theseshortcomings that our classifier misclassified this par-ticular image.

Figure 4. Negative image classified as positive by themodel.

Figure 5. Positive image classified as negative by the model.

Figure 5 exhibits a photograph from the DPChal-lenge dataset where our classifier predicts a false neg-ative. While the photograph follows good composi-tion techniques, the subject has few high frequencyedges, and most of the tiles are considered blurry.Our blur feature carries heavy weight with regards tothe prediction for the DPChallenge dataset. Further-more, the current method of detecting the salient re-gion is not consistently reliable, so despite this photo-graph’s having a distinct salient region, the classifiermay deemphasize the contributions of this feature.We believe that improving our salient region detec-tion accuracy across all images may enable the classi-fier to utilize the saliency feature more effectively, andthus correctly classify this photograph.

Another image our classifier mispredicts as beingnegative is shown in Figure 6. The key visual elementof this image is the strong leading lines that draw at-tention to the hiker – the subject of the image. Lead-ing lines, however, are global features that are notwell-captured by our tiling methodology, and, thus,are not considered by the classifier.

In sum, although our classifier performs respectablywell, examining the images it mispredicts revealsmany potential areas of improvement.

Figure 6. Positive image classified as negative by the model.

7. Future work and conclusions

We have demonstrated that modeling an image asa set of tiles, extracting certain visual features fromeach tile, and training a learning algorithm to inferrelationships between tiles yields a high-performingsystem that adapts well to different datasets. Thus,our methodology lays a sound foundation for futuredevelopment. In particular, we believe we can fur-ther improve the accuracy of our system by deriv-ing global visual features and parsing semantics fromphotographs. Our model should also apply to regres-sion for use cases where numerical ratings are de-sired. Finally, augmenting the system with the abil-ity to choose a classifier depending on the identifiedmode of a photograph, e.g. portrait or landscape,may lead to more accurate classification of aestheticquality.

References[1] D. Pogacnik, R. Ravnik, N. Bovcon, and F.

Solina. Evaluating Photo Aesthetics Using Ma-chine Learning. University of Ljubljana.

[2] R. Datta, D. Joshi, J. Li, J.Z. Wang. Studying Aes-thetics in Photographic Images Using a Compu-tational Approach.

[3] Y. Ke, X. Tang, F. Jing. The Design of High-Level Features for Photo Quality Assessment.School of Computer Science, Carnegie Mellon.Microsoft Research Asia.

[4] N. Murray, L. Marchesotti, F. Perronnin. AVA: ALarge-Scale Database for Aesthetic Visual Analy-sis. In Proc. IEEE Conf. Computer Vision and Pat-tern Recognition.

[5] R. Achanta, S. Hemami, F. Estrada and S.Susstrunk, Frequency-tuned Salient Region De-tection, IEEE International Conference on ComputerVision and Pattern Recognition.

[6] T. Hastie, R. Tibshirani, J. Friedman (2001). TheElements of Statistical Learning. New York, NY,USA: Springer New York Inc..

classiﬁcation of photographic images based on perceived ... · classiﬁcation of photographic...

Documents