classiﬁcation of photographic images based on perceived...

Classification of photographic images based onperceived aesthetic quality

Jeff Hwang [email protected]

Department of Electrical Engineering, Stanford University

Sean Shi [email protected]

Department of Electrical Engineering, Stanford University

AbstractIn this paper, we explore automated aes-thetic evaluation of photographs using ma-chine learning and image processing tech-niques. We theorize that the spatial distri-bution of certain visual elements within animage correlates with its aesthetic quality.To this end, we present a novel approachwherein we model each photograph as a setof image tiles, extract visual features fromeach tile, and train a classifier on the re-sulting features along with the images’ aes-thetics ratings. Our model achieves a 10-fold cross-validation classification successrate of 85.03%, corroborating the efficacyof our methodology and therefore showingpromise for future development.

1. Introduction

Aesthetics in photography are highly subjective. Theaverage individual may judge the quality of a photo-graph simply by gut feeling; in contrast, a photogra-pher might evaluate a photograph he or she capturesvis-a-vis technical criteria such as composition, con-trast, and sharpness. Towards fulfilling these crite-ria, photographers follow many rules of thumb. Theactual and relative visual impact of doing so for thegeneral public, however, remains unclear.

In our project, we show that the existence, arrange-ment, and combination of certain visual characteris-tics does indeed make an image more aesthetically-pleasing in general. To understand why this may betrue, consider the two images shown in Figure 1. Al-though both are of flowers, the left photograph hasa significantly higher aesthetic rating than the rightphotograph. One might reason that this is so sim-ply because the right photograph is blurry. This con-

Figure 1. A photograph with a high aesthetic rating (left)and a photograph with a low aesthetic rating (right).

jecture, however, is imprecise because the left photo-graph is, on average, blurry as well. In fact, the major-ity of the image is even blurrier than the right photo-graph. Here, it seems that the juxtaposition of sharpsalient regions with blurry regions and their locationsin the frame positively influence the perceived aes-thetic quality of the photograph.

Accordingly, we begin by identifying generic visualfeatures that we believe may affect the aesthetic qual-ity of a photograph, such as blur and hue. We thenbuild a learning pipeline that extracts these featuresfrom images on a per-image-tile basis and uses themalong with the images’ aesthetics ratings to train aclassifier. In this manner, we endow the classifier withthe ability to infer spatial relationships amongst fea-tures that correlate with an image’s aesthetics.

The potential impact of building a system to solvethis problem is broad. For example, by implement-ing such a system, websites with community-sourcedimages can programmatically filter out bad imagesto maintain the desired quality of content. Cam-eras can provide real-time visual feedback to helpusers improve their photographic skills. Moreover,from a cognitive standpoint, solving this problemmay lend interesting insight towards how humansperceive beauty.

2. Related work

There have been several efforts to tackle this prob-lem from different angles within the past decade.Pogacnik et al [1] believed that the features dependedheavily on identification of the subject of the pho-tograph. Datta et al [2] evaluated the performanceof different machine learning models (support vectormachines, decision trees) on the problem. Ke et al [3]focused on extracting perceptual factors important toprofessional photographers, such as color, noise, blur,and spatial distribution of edges.

Also, in contrast to our approach, it is interestingto note that these studies have focused on extract-ing global features that attempt to capture prior be-liefs on the spatiality of visual elements within high-quality images. For example, Datta et al attempted tomodel rule-of-thirds composition by computing theaverage hue, saturation, and luminance of the innerthirds rectangle of the image, and Pogacnik et al de-fined features that assessed adherence to a multitudeof compositional rules as well as the positioning ofthe subject relative to the image’s frame.

3. Dataset

Our learning pipeline downloads images and theiraverage aesthetic ratings from two separate datasets.

The first is an image database hosted by photo.net, aphoto sharing website for photographers. The indexfile we use to locate images was generated by Dattaet al. Members of photo.net can upload and critiqueeach others photographs and rate each photographwith a number between 1 and 7, with 7 being the bestpossible rating. Due to the wide range and subjectiv-ity of ratings, we choose to only use photographs withratings above 6 or below 4.2, which yields a datasetcontaining 1700 images split evenly between positivelabels and negative labels.

The second comprises images scraped from DPChal-lenge, another photo sharing website that allowsmembers to rate community-uploaded images on ascale of 1 to 10. The index file we used to locate im-ages was generated by Murray et al [4]. Followingguidelines from prior work, we choose to use pho-tographs with ratings above 7.2 or below 3.4, result-ing in a dataset containing 3000 images split evenlybetween positive and negative labels.

4. Feature extraction

Prior to extracting features, we partition each imageinto equally-sized tiles (Figure 2). By extracting fea-

Figure 2. Tiling scheme applied to image by learningpipeline.

tures on a per-tile basis, the learning algorithm canidentify regions of interest and infer relationships be-tween feature-tile pairs that indicate aesthetic qual-ity. For example, in the case of the image depictedin Figure 2, we surmise that the learning algorithmwould be able to discern the well-composed framingof the pier from the features extracted from its con-taining tiles with respect to those extracted from thesurrounding tiles.

Below, we describe the features we extract from eachimage tile.

Subject detection: Strong edges distinguish the im-age’s subject from its background. To quantify thedegree of subject-background separation, we apply aSobel filter to each image tile, binarize the result viaOtsu’s method, and compute the proportion of pixelsin the tile that are edge pixels:

fsd =

∑(x,y)∈Tile 1{I(x, y) = Thresholded edge}

|{(x, y)|(x, y) ∈ Tile}|

Color: A photograph’s color composition can dra-matically influence how a person perceives a photo-graph. We capture the color diversity within an imagetile using a color histogram that subdivides the threedimensional RGB color space into 64 equally sizedbins. Since each pixel can take on one of 256 discretevalues in each color channel, this results in each binbeing a cube with 16 possible values in each dimen-sion. We normalize each bin’s count by the total pixelcount so that it is invariant to image dimensions.

We also measure the average saturation and lumi-nance of each tile’s pixels. Finally, for the entire im-age, we compute the proportion of pixels that corre-spond to a particular hue (red, yellow, green, blue,and purple).

Detail: Higher levels of detail are generally desirablefor photographs, particularly for its subject. To ap-proximate the amount of detail, we compare the num-ber of edge pixels of a Gaussian filtered version of the

image tile to the number of edge pixels in the originalimage tile, i.e.

fd =

∑(x,y)∈Tile 1{Ifiltered(x, y) = Edge}∑

(x,y)∈Tile 1{I(x, y) = Edge}

For an image tile that is exceptionally detailed, manyof the higher-frequency edges in the region would beremoved by the Gaussian filter. Consequently, wewould expect fd to be closer to 0. Conversely, for atile that lacks detail, since few edges exist in the re-gion, applying the Gaussian filter would impart littlechange to the number of edges. In this case, we wouldexpect fd to be closer to 1.

Contrast: Contrast is the difference in color or bright-ness amongst regions in an image. Generally, thehigher the contrast, the more distinguishable objectsare from one another. We approximate the contrastwithin each image tile by calculating the standard de-viation of the grayscale intensities.

Blur: Depending on the image region, blurriness mayor may not be desirable. Poor technique or camerashake tends to yield images that are blurry across theentire frame, which is generally undesirable. On theother hand, low depth-of-field images with blurredout-of-focus highlights (“bokeh”) that complementsharp subjects are often regarded as being pleasing.

To efficiently estimate the amount of blur within animage, we calculate the variance of the Laplacian ofthe image. Low variance corresponds to blurrier im-ages, and high variance to sharper images.

Noise: The desirability of visual noise is contextual.For most modern images and for images that con-vey positive emotions, noise is generally undesirable.For images that convey negative semantics, however,noise may be desirable to accentuate their visual im-pact. We measure noise by calculating the image’s en-tropy.

Saliency: The saliency of the subject within a photo-graph can have a significant impact on the perceivedaesthetic quality of the photograph. We post-processeach image to separate the salient region from thebackground using a center-vs-surround approach de-scribed in Achanta et al [5]. We then sum the numberof salient pixels per image tile and normalize by thetile size.

5. Methods

Figure 3 depicts a high-level block diagram of thelearning pipeline we built. The pipeline comprisesthree main components: an image scraper, a bank offeature extractors, and a learning algorithm.

Figure 3. Block diagram of learning pipeline.

For each of the features we identified, there exists afeature extractor function that accepts an image asan input, calculates the feature value, and inserts thefeature-value mapping into a sparse feature vector al-located for the image. We rely on image processingalgorithms implemented in the scikit-image andopencv libraries for many of these functions [6, 7].

After the pipeline generates feature vectors for allimages in the training set, it uses them to train aclassifier. For the learning algorithm, we exper-imented with scikit-learn’s implementations ofsupport vector machines (SVM), random forests (RF),and gradient tree boosting (GBRT) [8]. We focus ourattention on these algorithms because they can ac-count for non-linear relationships amongst features.

SVM: The SVM learning algorithm with `1 regu-larization involves solving the primal optimizationproblem

minγ,w,b

1

2||w||2 + C

m∑i=1

ξi

subject to y(i)(wT x(i) + b) ≥ 1− ξi, i = 1, ...,m

, the dual of which is

maxα

m∑i=1

αi −1

2

m∑i,j=1

y(i)y(j)αiαj〈x(i), x(j)〉

subject to 0 ≤ αi ≤ C, i = 1, ...,mm∑i=1

αiy(i) = 0

Accordingly, provided that we find the values of αthat maximize the dual optimization problem, the hy-pothesis can be formulated as

h(x) =

{1 if

∑mi=1 αiy

(i)〈x(i), x〉+ b ≥ 0−1 otherwise

Note that since the dual optimization problem andhypothesis can be expressed as inner products be-tween input feature vectors, we can replace each in-ner product with a kernel applied to the two input

vectors, which allows us to train our classifier andperform classification in a higher-dimensional featurespace. This characteristic of SVMs makes them well-suited for our problem since we speculate that non-linear relationships amongst features influence imageaesthetic quality. For our system, we choose to use theGaussian kernel K(x, y) = exp

(γ||x− y||22

), which

corresponds to an infinite-dimensional feature map-ping.

Random forest: Random forests comprise collectionsof decision trees. Each decision tree is grown byselecting a random subset of input variables to usefor splitting at a particular node. Prediction then in-volves taking the average of the predictions of all theconstituent trees:

h(x) = sign

(1

m

m∑i=1

Ti(x)

)

Because of the way each decision tree is constructed,the variance of the average prediction is less than thatof any individual prediction. It is this characteris-tic that makes random forests more resistant to over-fitting than decision trees, and, thus, generally havemuch higher performance.

Gradient tree boosting: Boosting is a powerful learn-ing method that sequentially applies weak classifica-tion algorithms to reweighted versions of the trainingdata, with the reweighting done in such a way that,between every pair of classifiers in the sequence, theexamples that were misclassified by the previous clas-sifier are weighted higher for the next classifier. Inthis manner, each subsequent classifier in the ensem-ble is forced to concentrate on correctly classifying theexamples that were previously misclassified.

In gradient tree boosting, or gradient-boosted regres-sion trees (GBRT), the weak classifiers are decisiontrees. After fitting the trees, the predictions from allthe decision trees are weighted and combined to formthe final prediction:

h(x) = sign

(m∑i=1

αiTi(x)

)

In literature, tree boosting has been identified as be-ing one of the best learning algorithms available [9].

6. Experimental results and analysis

For each learning algorithm, we measure the per-formance of our classifier using 10-fold cross valida-tion on the photo.net dataset and the DPChallengedataset. We run backward feature selection to elimi-nate ineffective features to improve classification per-formance. For the final set of features, we remove

Figure 4. Classifier 10-fold cross-validation accuracy versustiling dimension (SVM with C = 1 and γ = 0.1, DPChal-lenge dataset).

each feature from the set and run cross-validation onthe resulting set to verify that the feature contributespositively to the classifier’s performance.

In experimenting with different tiling dimensions, wefound that dividing each image into five-by-five tilesgave the best performance (Figure 4). A tiling di-mension of 1 corresponds to extracting each featureacross the entire frame of the image. As anticipated,this yields significantly worse performance. Largertiling dimensions should theoretically work betterprovided that we have enough data to support the as-sociated increase in the number of features. Unfortu-nately, given the limited sizes of our datasets, the ad-dition of more features causes our classifier to overfitand thus degrades its accuracy for dimensions largerthan 5.

For SVM, we tuned our parameters using grid search,which ultimately led us to use C = 1 and γ = 0.1. Forrandom forest, we used 300 decision trees. We deter-mined this value by empirically finding the asymp-totic limit to the generalization error with respectto the number of decision trees used. For gradienttree boosting, we used 500 decision trees and a sub-sampling coefficient of 0.9. Using a sub-sampling co-efficient smaller than 1 allows us to trade off vari-ance for bias, which thereby mitigates overfitting andhence improves generalization performance.

Table 1 shows our 10-fold cross-validation accuracyfor each learning algorithm. For both datasets, we gotthe highest performance with GBRT, with accuraciesof 80.88% and 85.03%. The difference in performancemay have resulted from the DPChallenge dataset’shaving higher-resolution images than the photo.net

dataset, which makes certain visual features moredistinct in the former than the latter. Nonetheless,the similarity in results suggests that our methodol-ogy generalizes well to different datasets.

Figure 5 shows the confusion matrix for 10-fold cross-

SVM RF GBRTphoto.net 78.71% 78.58% 80.88%

DPChallenge 84.00% 83.15% 85.03%

Table 1. 10-fold cross-validation accuracy

Actuallabel

Predicted label1 0

1 TP85.80%

FN14.20%

0 FP15.73%

TN84.27%

Figure 5. Confusion matrix for 10-fold cross validation withGBRT on DPChallenge dataset.

validation using GBRT on the DPChallenge dataset.The true positive and false negative rates are approx-imately symmetric with the true negative and falsepositive rates, respectively, which signifies that ourclassifier is not biased towards predicting a certainclass. This also holds true for the photo.net dataset.

To analyze the shortcomings of our approach, we ex-amine images that our classifier misclassified.

Figure 6 shows an example of a negative image fromthe photo.net dataset that the classifier mispredictedas being positive. Note that the image is compo-sitionally sound – the subject is clearly distinguish-able from the background, fills most of the frame, iswell-balanced in the frame, and has components thatlie along the rule-of-thirds axes. The hot-pink back-ground, however, is jarring, and the subject matter ismundane and lacks significance. Unfortunately, be-cause it discretizes color features so coarsely, the clas-sifier is likely not able to effectively differentiate be-tween the artificial pink shade of the image’s back-ground and the warm red shade of a sunset, for in-stance. Moreover, it has no way of gleaning meaningfrom images. We therefore believe that it is primarilydue to these shortcomings that our classifier misclas-sified this particular image.

Figure 6. Negative image classified as positive by themodel.

Figure 7. Positive images classified as negative by themodel.

Figure 7 shows two photographs from the DPChal-lenge dataset that our classifier misclassified as be-ing negative. While the left photograph follows goodcomposition techniques, the subject has few high fre-quency edges, so the classifier would likely need torely more on saliency detection to pinpoint the sub-ject. Unfortunately, the current method of detectingthe salient region is not consistently reliable, so de-spite this photograph’s having a distinct salient re-gion, the classifier may deemphasize the contribu-tions of this feature. We believe that improving oursalient region detection accuracy across all imagesmay enable the classifier to utilize the saliency featuremore effectively, and thus correctly classify this pho-tograph. In the right image in Figure 7, the key visualelement is the strong leading lines that draw atten-tion to the hiker – the subject of the image. Leadinglines, however, are global features that are not well-captured by our tiling methodology, and, thus, arelikely not considered by the classifier.

In sum, although our system performs respectably,examining the images it mispredicts reveals many po-tential areas of improvement.

7. Future work and conclusions

We have demonstrated that modeling an image as aset of tiles, extracting certain visual features from eachtile, and training a learning algorithm to infer rela-tionships between tiles yields a high-performing pho-tographic aesthetics classification system that adaptswell to different image datasets. Thus, our work laysa sound foundation for future development. In par-ticular, we believe we can further improve the accu-racy of our system by deriving global visual featuresand parsing semantics from photographs. Our modelshould also apply to regression for use cases wherenumerical ratings are desired. Finally, augmentingthe system with the ability to choose a classifier de-pending on the identified mode of a photograph, e.g.portrait or landscape, may lead to more accurate clas-sification of aesthetic quality.

References[1] Pogacnik, D., Ravnik, R., Bovcon, N., & Solina,

F. (2012). Evaluating photo aesthetics using ma-chine learning.

[2] Datta, R., Joshi, D., Li, J., & Wang, J. Z. (2006).Studying aesthetics in photographic images us-ing a computational approach. In ComputerVision–ECCV 2006 (pp. 288-301). Springer BerlinHeidelberg.

[3] Ke, Y., Tang, X., & Jing, F. (2006, June). The de-sign of high-level features for photo quality as-sessment. In Computer Vision and Pattern Recog-nition, 2006 IEEE Computer Society Conference on(Vol. 1, pp. 419-426). IEEE.

[4] Murray, N., Marchesotti, L., & Perronnin, F.(2012, June). AVA: A large-scale database for aes-thetic visual analysis. In Computer Vision and Pat-tern Recognition (CVPR), 2012 IEEE Conference on(pp. 2408-2415). IEEE.

[5] Achanta, R., Hemami, S., Estrada, F., &Susstrunk, S. (2009, June). Frequency-tunedsalient region detection. In Computer vision andpattern recognition, 2009. cvpr 2009. ieee conferenceon (pp. 1597-1604). IEEE.

[6] van der Walt, S., Schonberger, J. L., Nunez-Iglesias, J., Boulogne, F., Warner, J. D., Yager, N.,... & Yu, T. the scikit-image contributors,(2014)Scikit-image: image processing in Python. PeerJ, 2(6).

[7] Bradski, G. (2000). The opencv library. DoctorDobbs Journal, 25(11), 120-126.

[8] Pedregosa, F., Varoquaux, G., Gramfort, A.,Michel, V., Thirion, B., Grisel, O., ... & Vander-plas, J. (2011). Scikit-learn: Machine learning inPython. The Journal of Machine Learning Research,12, 2825-2830.

[9] Friedman, J., Hastie, T., & Tibshirani, R. (2001).The elements of statistical learning (Vol. 1).Springer, Berlin: Springer series in statistics.

classiﬁcation of photographic images based on perceived...

Documents