uva-dare (digital academic repository) · gional newspaper haarlems dagblad . the photo collection...

11
UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) UvA-DARE (Digital Academic Repository) Scene Detection in De Boer Historical Photo Collection Wevers, M. DOI 10.5220/0010288206010610 Publication date 2021 Document Version Final published version Published in ICAART 2021 : Proceedings of the 13th International Conference on Agents and Artificial Intelligence : February 4-6, 2021. - Volume 1 License CC BY-NC-ND Link to publication Citation for published version (APA): Wevers, M. (2021). Scene Detection in De Boer Historical Photo Collection. In A. P. Rocha, L. Steels, & J. van den Herik (Eds.), ICAART 2021 : Proceedings of the 13th International Conference on Agents and Artificial Intelligence : February 4-6, 2021. - Volume 1: ARTIDIGH 2021, Vienna, Austria (pp. 601-610). SciTePress. https://doi.org/10.5220/0010288206010610 General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. Download date:02 Sep 2021

Upload: others

Post on 30-Jul-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UvA-DARE (Digital Academic Repository) · gional newspaper Haarlems Dagblad . The photo collection consists of approximately two million negatives accompanied with metadata for the

UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl)

UvA-DARE (Digital Academic Repository)

Scene Detection in De Boer Historical Photo Collection

Wevers, M.DOI10.5220/0010288206010610Publication date2021Document VersionFinal published versionPublished inICAART 2021 : Proceedings of the 13th International Conference on Agents and ArtificialIntelligence : February 4-6, 2021. - Volume 1LicenseCC BY-NC-ND

Link to publication

Citation for published version (APA):Wevers, M. (2021). Scene Detection in De Boer Historical Photo Collection. In A. P. Rocha, L.Steels, & J. van den Herik (Eds.), ICAART 2021 : Proceedings of the 13th InternationalConference on Agents and Artificial Intelligence : February 4-6, 2021. - Volume 1: ARTIDIGH2021, Vienna, Austria (pp. 601-610). SciTePress. https://doi.org/10.5220/0010288206010610

General rightsIt is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s)and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an opencontent license (like Creative Commons).

Disclaimer/Complaints regulationsIf you believe that digital publication of certain material infringes any of your rights or (privacy) interests, pleaselet the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the materialinaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letterto: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. Youwill be contacted as soon as possible.

Download date:02 Sep 2021

Page 2: UvA-DARE (Digital Academic Repository) · gional newspaper Haarlems Dagblad . The photo collection consists of approximately two million negatives accompanied with metadata for the

Scene Detection in De Boer Historical Photo Collection

Melvin WeversDepartment of History, University of Amsterdam, Amsterdam, The Netherlands

Keywords: Digital History, Computer Vision, Scene Detection, Digital Heritage.

Abstract: This paper demonstrates how transfer learning can be used to improve scene detection applied to a historicalpress photo collection. After applying transfer learning to a pre-trained Places-365 ResNet-50 model, weachieve a Top-1 accuracy of .68 and a Top-5 accuracy of .89 on our data set, which consists of 132 categories.In addition to describing our annotation and training strategy, we also reflect on the use of transfer learningand the evaluation of computer vision models for heritage institutes.

1 INTRODUCTION

Computer vision algorithms have become capable oflocating and identifying objects represented in imageswith high accuracy. While this specific technology iscommonplace in self-driving cars, drone technology,and the analysis of social media posts, its use for her-itage institutes has only recently been growing (Bhar-gav et al., 2019; Bell et al., 2018; Mager et al., 2020;Niebling et al., 2020). The possible benefits for her-itage institutions range from automatically enrichingcollections, to improving search capabilities, or en-abling large-scale analysis of visual collections. Thelatter is of particular interest for historians with an in-terest in the visual representations of the past.

While computer vision technology advancedrapidly in the last decade, most computer vision re-search focuses on use cases that require contempo-rary data as training material. Exceptions includeart-historical research that relies on computer vi-sion (Madhu et al., 2019; Bell and Impett, 2019; Of-fert, 2018). Training on contemporary material re-sults in models not tuned to detect objects in heritagecollections. In other words, the models have diffi-culty detecting past representations of objects, whichoften looked noticeably different. Moreover, the listof categories of objects in existing models are not al-ways existent or relevant to heritage collections. Inaddition to objects changing, the visual medium it-self has often-times changed, granting contemporaryvisual medium a different materiality. Compare, forexample, a grainy black and white image of a trafficsituation in the 1950s to a high-resolution color imagetaken with a zoom lens of a highway in 2020. In this

instance, the cars will look different, but the improvedtechnological capabilities of the camera also shapedthe materiality of the picture. Models that are trainedon millions of contemporary images do not have thesensitivity to deal with the color schemes and objectrepresentations in older images.

One approach to counter this blind spot in mod-els trained on contemporary data is to fine-tune theirperformance by feeding them with historical material.This method builds upon the categories existent in themodern data sets. However, these categories do notalways map onto the categories present in the collec-tions of heritage institutes, or they do not align withthe search interests of users of such heritage collec-tions. Another approach is transfer learning, whichadds new categories to existing models, which aretrained using the historical material that has been an-notated with these new categories. Because the mod-els have already been pre-trained with large collec-tions of images, we often only require a small numberof images to learn a new category. Of course, this de-pends on the visual complexity of the category and thediversity present in the training data for that category.In other words, an object that always looks the sameis easier to learn than one that shared visual aspectsbut also differs considerably.

Existing research that applies computer visionmethods to historical photographs looks into automat-ically dating images (Palermo et al., 2012), match-ing images taken from different viewpoints (Maiwald,2019), photogrammetric analysis of images (Maiwaldet al., 2017), and the classification of historical build-ings (Llamas et al., 2017).

As part of this research, we set out to discover

Wevers, M.Scene Detection in De Boer Historical Photo Collection.DOI: 10.5220/0010288206010610In Proceedings of the 13th International Conference on Agents and Artificial Intelligence (ICAART 2021) - Volume 1, pages 601-610ISBN: 978-989-758-484-8Copyright c© 2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved

601

Page 3: UvA-DARE (Digital Academic Repository) · gional newspaper Haarlems Dagblad . The photo collection consists of approximately two million negatives accompanied with metadata for the

what type of computer vision tasks were seen as mostrelevant in the context of heritage institutions. For thispurpose, we conducted fourteen interviews with digi-tal humanities experts, visitors of historical archives,and heritage experts.1 We discussed several computervision tasks, from which the respondents showed themost interest in object and scene detection, with a spe-cific interest in tracing buildings. Other tasks such asthe estimation of group sizes, facial emotion detec-tion, detection of logos, and posture estimation, therespondents deemed less relevant.

This paper focuses on one particular computer vi-sion task: scene detection. This task tries to describe“the place in which the objects seat”, rather than clas-sifying the object depicted on an image, or locatingparticular objects in a picture by drawing boundingboxes around them (Zhou et al., 2018). More specif-ically, this paper applies transfer learning to Places-365, a scene detection model trained on contemporarydata. We reflect on creating training data, the train-ing strategy, as well as the evaluation of the model.Finally, we discuss the benefits of this type of com-puter vision algorithm for heritage institutes and vi-sual archives. Rather than merely detecting objects,the ability to search for images based on the scenerepresented in them is a useful feature for heritageinstitutions, especially since this information is oftennot included in the metadata. Using such informa-tion, historians could examine the representations ofparticular scenes, for example, protests, at scale.

2 DATA

This data set for this study is the photo collection ofpress agency De Boer for the period 1945-2004.2 TheDe Boer collection focuses on national and regionalevents, although over time the collection’s focus grad-ually shifted to the region Kennemerland. This re-gion, just north of Amsterdam, includes cities suchas Haarlem, Zandvoort, Bloemendaal, and Alkmaar.The value of the collection was recognized locally,nationally, and internationally. In 1962, the founderof the press photo agency, Cees de Boer, was awardedthe World Press Photo and the Silver Camera (Zilv-eren Camera). He won the World Press Photo for apicture showing Dutch singer Ria Kuyken being at-tacked by a young bear in a circus. Seven years later,

1Due to the Covid-19 pandemic, we had to conduct theseinterviews virtually. We originally intended to invite morepeople in person to the archive to conduct face-to-face in-terviews.

2This data has been kindly provided by the Noord-Hollands Archief.

Figure 1: Arrival of Martin Luther King Jr. at SchipholAirport, August 15, 1964. Taken from De Boer Collection.

Cees’ son Poppe won the national photojournalismaward. The photo collection depicted a wide rangeof events, ranging from the first and only show of TheBeatles in the Netherlands in Blokker—a small vil-lage just north of Amsterdam, the arrival of MartinLuther King in 1964 (see Figure 1), to the opening ofrestaurants, and sports events. The photo press agencywas one of the main providers of pictures to the re-gional newspaper Haarlems Dagblad.

The photo collection consists of approximatelytwo million negatives accompanied with metadata forthe period 1945-2004. The metadata is extractedfrom physical topic cards and logs kept by the photoagency. On approximately nine thousand topic cardswith over a thousand unique topics, the agency de-tailed what was depicted in the pictures. For the pe-riod 1952-1990, these logs have been transcribed us-ing volunteers.3

The archive is currently in the process of digitiz-ing the two million negatives and linking these to thealready-transcribed metadata.4 At the moment of thispilot study, the archive has only digitized a selectionof images. This pilot study explores whether transferlearning can be applied to the data and how we shouldcategorize and label the data. During the larger dig-itization project, we will use the annotation strategydeveloped in this study to label a selection of imagesto further improve the scene detection model. This

3https://velehanden.nl/projecten/bekijk/details/project/ranh

4https://velehanden.nl/projecten/bekijk/details/project/ranh tagselection deboer

ARTIDIGH 2021 - Special Session on Artificial Intelligence and Digital Heritage: Challenges and Opportunities

602

Page 4: UvA-DARE (Digital Academic Repository) · gional newspaper Haarlems Dagblad . The photo collection consists of approximately two million negatives accompanied with metadata for the

model will also be used to automatically label imagesor detect images that are more difficult to label andwhich require human annotators. The enriched photocollection will be used to improve the search func-tionality of the archive. The labeled data and result-ing model will be made publicly available to serve asa starting point for other historical photo collectionsthat want to apply computer vision to their collection.

For this pilot study, we relied on a subset of2,545 images that had already been digitized by theNoord-Hollands Archief. Together with archivistsand cultural historians, we constructed scene cate-gories for these images.5 We used the categorizationscheme used in Places-365 as a starting point. As thename implies, this scheme contains 365 categories ofscenes, ranging from ‘alcove’ to ‘raft’.6 We combinedthe categories in Places-365 with categories from thecatalog cards that De Boer used. These catalog cardscould not be linked to the categorization scheme di-rectly, because these cards often contain categoriesthat are too specific or categories that were repre-sented visually. These categories could only be in-ferred from contextual information but not from theimage itself. During the process, we encountered thatthe creation of categories is always a trade-off be-tween specificity and generalization. In making thesecategories, we kept in mind whether there remained ahistorical and visual consistency in the categories andwhether the category would be of use for users of thecollection. At the same time, we had to make sure wewere not too specific, which would impact the amountof training data.

During the annotation process, we noticed the dif-ficulty in distinguishing between scenes that werecharacterized by a particular object and scenes de-fined by the location or the action performed in theimage. For example, the images in Figure 4 containspecific objects, while the images in Figure 3 depictscenes. Still, the images in the latter also contain ob-jects such as flags, wreaths, cranes, and trees that arenot necessarily exclusive to a specific scene. Thisparticular challenge is also discussed by developersof scene detection data sets. The developers of theSUN database, for example, describe how scenes arelinked to particular functions and behaviors, whichare closely tied to the visual features that structure thespace. In the case of large rooms or spacious areas,these allow for different types of behavior, openingup the possibility for different scene categories (Xiao

5In the Appendix, we added annotation guidelines,which will be used for the annotation of the further dataset.

6An overview of the categories can be found here: http://places2.csail.mit.edu/explore.html

Figure 2: Distribution of categories (N > 50) in the trainingset.

et al., 2010). Moreover, there are instances of largeinter-class similarity between scenes, for instance, be-tween the categories ‘library’ and ‘bookstore’. Atthe same time, we find large intra-class variations, inwhich the depicted scenes that belong to one categoryare actually quite diverse (Xie et al., 2020).

After annotation, we used an initial baselinemodel to correct incorrectly annotated images. Dur-ing the annotation process of the larger dataset, wewill iteratively evaluate the training categories andpossibly add new categories and merge or remove ex-isting ones. Our resulting dataset includes 115 uniquecategories that can be found along with their descrip-tions in the Appendix. The distribution of the cate-gories in our training set is heavily skewed and long-tailed (Figure 2). For training, we removed categorieswith less than twenty images, this included categoriessuch as: ‘houseboat’, ‘castle’, and ‘campsite’. How-ever, we intend to again include these categories whenwe are annotating the newly digitized photographs.The categories ‘soccer’ and ‘construction site’, for in-stance, are very well represented, while most othersappeared much more infrequently. Even though trans-fer learning can produce accurate results with fewtraining examples, more training data, especially incategories with greater visual variations, is to be pre-ferred.

Scene Detection in De Boer Historical Photo Collection

603

Page 5: UvA-DARE (Digital Academic Repository) · gional newspaper Haarlems Dagblad . The photo collection consists of approximately two million negatives accompanied with metadata for the

3 SCENE DETECTION

The authors of the Places-365 data set describe ascene as an “environment(s) in the world, bounded byspaces where a human body would fit” (Zhou et al.,2018). Scene detection is aimed at detecting such anenvironment, or scene, from visual material. Humanscan quickly recognize the functional and categoricalproperties of a scene while ignoring the objects andlocations in a scene (Oliva and Torralba, 2001). Still,this does not mean that humans recognize scenes inan unequivocal manner. Contextual factors that arenecessarily part of the image aid humans in describ-ing scenes depicted in the image. Users of heritagecollections rely on search not only to discover for rep-resentations of particular objects (Petras et al., 2017;Clough et al., 2017). Especially in the context of pressphotography, photographs often captured a particularhistorical event or setting. Press photos are highly di-verse and more often than not contain more groups ofpeople or objects placed in a particular setting.

In the De Boer collection, we find depictions ofscenes, such as memorials, construction sites, or pa-rades (see Figure 3). Even though these events arecharacterized by the presence of particular objects, inmany cases, they cannot be described by just a singleobject or person. At the same time, the collection alsoincludes images that can accurately be described bya single object (see Figure 4). Scene detection, how-ever, is also able to learn such representations.

We decided to not focus on object detection andthe segmentation of possible objects, but rather tofocus on the image as a whole using scene detec-tion. The category schemes in existing object detec-tion models were not directly applicable to the pho-tographs in this collection. Using existing object de-tection models did not yield useful categories. For ex-ample, for a picture of a shopping street, object detec-tion would identify people, bags, and perhaps a trafficlight. To be able to detect objects represented in ourphoto collection, we would need to draw boundingboxes around these objects and annotate them, whichwould prove to be a very time-consuming task. Nev-ertheless, in future work, we would like to explore thedevelopment of a framework that would provide his-torical descriptions based on the relationship betweendifferent objects in an image.

3.1 Transfer Learning

For the adaptation of existing scene detection modelsto our data set, we turn to transfer learning (Rawat andWang, 2017). This method refers to using “what hasbeen learned in one setting [...] to improve generaliza-

(a) Memorial.

(b) Parade.

(c) Construction Site.

Figure 3: Three photographs with scene labels from DeBoer collection.

tion in another setting” (Goodfellow et al., 2016). Inour case, we use what has been learned about scenescaptured in the Places-365 models to learn how to de-tect the scenes in our training data using our catego-rization scheme. Rather than training a model fromscratch, transfer learning is known for reaching goodaccuracy in a relatively short amount of time. Basi-cally, we build upon the information already stored inthe places model, which is based on millions of la-beled images.7

As a starting point, we use a ResNet-50 model—aconvolutional neural network of fifty layers— trainedon the Places-365 dataset.8 This dataset consists of1.8 million images from 365 scene categories. ThePlaces-365 data set builds upon the SUN database,

7for code and data see: https://github.com/melvinwevers/hisvis

8https://github.com/CSAILVision/places365

ARTIDIGH 2021 - Special Session on Artificial Intelligence and Digital Heritage: Challenges and Opportunities

604

Page 6: UvA-DARE (Digital Academic Repository) · gional newspaper Haarlems Dagblad . The photo collection consists of approximately two million negatives accompanied with metadata for the

(a) Artwork.

(b) Castle.

(c) Train.Figure 4: Three photographs characterized by a central ob-ject from De Boer collection.

consisting of 899 categories with short descriptions.9

For our annotation guidelines, we used the SUN de-scriptions as a starting point. Given that our data setconsists primarily out of black and white images, itis worth noting that this type of image was excludedfrom the SUN data. Next, we create a random vali-dation set, containing twenty percent of the trainingimages. We use this validation set, to estimate themodel’s performance in terms of accuracy and its fit.In the context of a heritage institute, scene detection

9https://groups.csail.mit.edu/vision/SUN/

is a meaningful task.Applied to the Places-365 validation data set, the

Places-365 ResNet-50 model reaches a 54.74% top-1accuracy and 85.08% top-5 accuracy. Top-1 accuracyrefers to the percentage of images where the predictedlabel with the highest probability matches the ground-truth label. Top-5 accuracy calculates whether theground-truth label is in the top-five predicted labels.Because of the ambiguity between scene categories,top-5 accuracy is generally seen as a more suitablecriterion than top-1 accuracy (Zhou et al., 2018).The Places-365 model outperforms the widely-usedImageNet-CNN on scene-related data sets, underscor-ing the benefit of using tailor-made models for scenedetection over more generic models, such as Ima-geNet. Due to its performance on the places-365,we also turned to the ResNet-50 implementation. Wehave also experimented with simpler model architec-tures, but these were less able to capture the complex-ity of the images.

For our study, we load a pre-trained Places-365model, which we then tune to our categories us-ing the deep learning framework Fast.AI (Howardet al., 2018). We transfer learn using the One-Cycle method, an efficient approach to setting hyper-parameters (learning rate, momentum, and weight de-cay), which can lead to faster convergence (Smith,2018).

To account for the unevenly distributed and oftensmall number of training data and to prevent over-fitting, we experiment with different data augmenta-tion methods, including image transformation, labelsmoothing, MixUp, and CutMix. Overfitting refersto the model adapting too closely to the training data,making it less suitable for working with images themodel was not trained on. The image transformations,in this case, include, flipping, rotating, and zoomingof the source image. These transformations increasethe number of training images the model sees andmakes it harder for the model to overfit to a particu-lar image. Labels Smoothing adds a small amount ofnoise to the one-hot encoding of the labels, making itmore difficult for the model to be confident about theprediction. This reduces overfitting and improves thegeneralizability of the model (Szegedy et al., 2014).

MixUp and CutMix are two relatively recent dataaugmentation techniques. MixUp basically overlaystwo different images with different labels. For exam-ple, one can have an image of a cat with an overlayof a dog. This makes it more difficult for the modelto determine what’s in each image. The model has topredict two labels for each image as well the extent towhich they are ‘mixed’ (Zhang et al., 2018). CutMixis an extension of a random erasing technique that

Scene Detection in De Boer Historical Photo Collection

605

Page 7: UvA-DARE (Digital Academic Repository) · gional newspaper Haarlems Dagblad . The photo collection consists of approximately two million negatives accompanied with metadata for the

has often been used in data augmentation. In randomerasing, a random block of pixels is removed from theimage, making it harder for the model to learn the im-age. CutMix cuts and pastes random patches of pixelsbetween training images. The labels are also mixedproportionally to the area of patches in the trainingimage. CutMix pushes the model to focus on the lessdiscriminative parts of an image, making it suitablefor object detection (Yun et al., 2019). The downsideof MixUp and CutMix can be that they make it toodifficult for the model to detect features, causing un-derfitting.

We train the model for a maximum of 35 epochs,with early stopping at five epochs monitoring changesin validation loss. To account for underfitting, we ex-perimented with varying the α parameter of MixUpand CutMix, which controls how much of the aug-mentation is applied. Too much of the augmentationwould make it too difficult for the model to extractgeneralizable features. Training a model is finding abalance between underfitting and overfitting, which isdependent on the size and complexity of the trainingdata. In the domain of cultural heritage, we are oftenworking with limited sets of annotated training mate-rial, making it worthwhile to examine to what extenttransfer learning can be applied and how we can copewith overfitting to these limited sets of data. In ouruse case, we want to model to classify unseen data,which requires a model that can generalize.

4 RESULTS

Our baseline model which only uses image transfor-mations achieves a top-1 accuracy of 0.62 and a top-5accuracy of 0.88 (see Table 1). Adding MixUp witha low α (0.2) and Label Smoothing slightly improvesthe top-1 accuracy to 0.68, but these augmentationshave no effect on the top-5 accuracy. The addition ofCutMix and Label Smoothing has a similar effect witha Top-1 accuracy of 0.67. It is noteworthy that thebaseline model already achieves good results, whichunderscores the power of transfer learning and the useof the Places-365 model as a starting point for morespecific scene detection tasks. We can see that in al-most nine out of ten cases, the correct result can befound in the top 5 results. For our larger data set, wewill again explore to what extent MixUp and CutMixcan boost the performance of the model.

There was one category that was never correctlyidentified, namely ‘accident stretcher’. This categorydepicts people carried away on a stretcher. The cat-egory consists of only 6 training images, which arealso quite diverse. The category ‘funeral‘ also scores

Table 1: Training Results.

Top 1-Acc Top 5-acc

Baseline + Aug 0.62 0.88Label Smoothing (LS) 0.61 0.87MixUp (0.2) 0.62 0.86MixUp (0.2) + Aug 0.63 0.87MixUp (0.2) + Aug + LS 0.68 0.89MixUp (0.3) 0.63 0.88MixUp (0.3) + Aug 0.61 0.87MixUp (0.3) + Aug + LS 0.60 0.86MixUp (0.4) 0.67 0.89MixUp (0.4) + Aug 0.61 0.86MixUp (0.4) + Aug + LS 0.67 0.89CutMix (0.2) 0.63 0.87CutMix (0.2) + Aug 0.63 0.87CutMix (0.2) + Aug + LS 0.64 0.87CutMix (0.5) 0.62 0.87CutMix (0.5) + Aug 0.66 0.89CutMix (0.5) + Aug + LS 0.67 0.89CutMix (1.0) 0.61 0.86CutMix (1.0) + Aug 0.63 0.87CutMix (1.0) + Aug + LS 0.67 0.89

low on precision (0.25) and recall (0.1). This cat-egory contains ten training images, but again theyare quite diverse and show a strong resemblance to‘church indoor‘, ‘parade’, and ‘memorial’. We ex-pect this accuracy to improve with more training ma-terial. The categories that the model most often con-fuses with each other includes the categories: ‘cer-emony’, ‘handshake’, ‘portrait children’, ‘residentialneighborhood’, ‘harbor’, and ‘portrait group’. Thesecategories were commonly predicted as respectively:‘handshake’, ‘ceremony’, ‘portrait group’, ‘street’,‘boats’, and ‘ceremony’. The predictions offered bythe model are sensible since they are for the most partclosely related to the correct labels. For example, thedistinction between the category street and a residen-tial neighborhood is difficult, and actually, it mightbe more appropriate to attach both labels to the im-age. Also, the former might be better described asan object and not a scene, while in some contexts astreet can also be an environment in which objectsare housed. This example foregrounds the concep-tual overlap between an object and scene. Figure 8shows the images with the top losses, which indicatesthat among the predictions the correct answer hada low probability. Upon closer inspection, we haveto conclude that these predictions are not necessarilywrong. We see, for example, an image that shows ahandshake but also people in military attire. This im-age was labeled as ‘military‘ but classified as ‘hand-shake’. This leads us to conclude that it might beworthwhile for the annotation of the larger dataset toallow multiple labels and creating a multi-label clas-sifier. This type of classifier requires more training

ARTIDIGH 2021 - Special Session on Artificial Intelligence and Digital Heritage: Challenges and Opportunities

606

Page 8: UvA-DARE (Digital Academic Repository) · gional newspaper Haarlems Dagblad . The photo collection consists of approximately two million negatives accompanied with metadata for the

Figure 5: Top-5 predictions for a picture with the label ‘Din-ing Room’.

Figure 6: Top-5 predictions for a picture with the label ‘Har-bor’.

Figure 7: Top-5 predictions for a picture with the label ‘Por-trait Group’.

data, but in future work, we will explore how muchmore training data is required in order to reach accu-rate predictions.

5 DISCUSSION

This paper demonstrated how we annotated a dataset of historical press photographs and applied trans-fer learning to leverage existing scene detection mod-els to a new domain. We highlighted the diffi-culties of categorization and possible solutions forworking with skewed data sets. While our base-line model with basic image transformation alreadyreached good accuracy scores, further augmentationsincluding MixUp and Label Smoothing improved ourtop-1 accuracy (from 0.62 to 0.68). Our top-5 accu-racy was only very slightly improved by these addi-tional augmentations (0.88 to 0.89). The latter indi-cates that the correct answer was often among the top5 answers given. While there still exists some am-biguity about labels, in some instances one of thesefive labels was understandable from a visual perspec-tive, but it might cause some ethical concerns. For ex-ample, images of a funeral with lots of flowers wereoccasionally labelled as ‘parade’. Such mistakes are

Figure 8: Top Losses.

more impactful than mistaking, for instance, an auto-mobile for a truck. In future work, we want to explorehow we can penalize such mistakes; hence, improvingthe learning process. Furthermore, once more imagesof the collection have been digitized, we can furtherrefine the presented model and improve and expandour presented categorization scheme.

REFERENCES

Bell, P. and Impett, L. (2019). Ikonographie und interak-tion. computergestutzte analyse von posen in bildernder heilsgeschichte. Das Mittelalter, 24(1):31–53.

Bell, P., Ommer, B., Ommer, B., and Ommer, B. (2018).Computing Art Reader: Einfuhrung in Die DigitaleKunstgeschichte, P. Kuroczynski et al. (Ed.).

Bhargav, S., van Noord, N., and Kamps, J. (2019).Deep Learning as a Tool for Early Cinema Analy-sis. In Proceedings of the 1st Workshop on Structuringand Understanding of Multimedia heritage Contents,SUMAC ’19, pages 61–68, Nice, France. Associationfor Computing Machinery.

Clough, P., Hill, T., Paramita, M. L., and Goodale, P. (2017).Europeana: What users search for and why. In Inter-national Conference on Theory and Practice of Digi-tal Libraries, pages 207–219. Springer.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). DeepLearning. The MIT Press, Cambridge, Massachusetts.

Howard, J. et al. (2018). fastai. https://github.com/fastai/fastai.

Llamas, J., M. Lerones, P., Medina, R., Zalama, E., andGomez-Garcıa-Bermejo, J. (2017). Classification ofArchitectural Heritage Images Using Deep LearningTechniques. Applied Sciences, 7(10):992.

Scene Detection in De Boer Historical Photo Collection

607

Page 9: UvA-DARE (Digital Academic Repository) · gional newspaper Haarlems Dagblad . The photo collection consists of approximately two million negatives accompanied with metadata for the

Madhu, P., Kosti, R., Muhrenberg, L., Bell, P., Maier, A.,and Christlein, V. (2019). Recognizing characters inart history using deep learning. In Proceedings ofthe 1st Workshop on Structuring and Understandingof Multimedia heritAge Contents, pages 15–22.

Mager, T., Khademi, S., Siebes, R., Hein, C., de Boer,V., and van Gemert, J. (2020). Visual Content Anal-ysis and Linked Data for Automatic Enrichment ofArchitecture-Related Images. In Kremers, H., editor,Digital Cultural Heritage, pages 279–293. SpringerInternational Publishing, Cham.

Maiwald, F. (2019). Generation of a benchmark dataset us-ing historical photographs for an automated evaluationof different feature matching methods. InternationalArchives of the Photogrammetry, Remote Sensing &Spatial Information Sciences.

Maiwald, F., Vietze, T., Schneider, D., Henze, F., Munster,S., and Niebling, F. (2017). Photogrammetric analysisof historical image repositories for virtual reconstruc-tion in the field of digital humanities. The Interna-tional Archives of Photogrammetry, Remote Sensingand Spatial Information Sciences, 42:447.

Niebling, F., Bruschke, J., Messemer, H., Wacker, M., andvon Mammen, S. (2020). Analyzing Spatial Distri-bution of Photographs in Cultural Heritage Applica-tions. In Liarokapis, F., Voulodimos, A., Doulamis,N., and Doulamis, A., editors, Visual Computing forCultural Heritage, Springer Series on Cultural Com-puting, pages 391–408. Springer International Pub-lishing, Cham.

Offert, F. (2018). Images of image machines. visual in-terpretability in computer vision for art. In Proceed-ings of the European Conference on Computer Vision(ECCV).

Oliva, A. and Torralba, A. (2001). Modeling the shapeof the scene: A holistic representation of the spatialenvelope. International journal of computer vision,42(3):145–175.

Palermo, F., Hays, J., and Efros, A. A. (2012). Dating his-torical color images. In Fitzgibbon, A., Lazebnik, S.,Perona, P., Sato, Y., and Schmid, C., editors, Com-puter Vision – ECCV 2012, pages 499–512, Berlin,Heidelberg. Springer Berlin Heidelberg.

Petras, V., Hill, T., Stiller, J., and Gade, M. (2017).Europeana–a search engine for digitised cultural her-itage material. Datenbank-Spektrum, 17(1):41–46.

Rawat, W. and Wang, Z. (2017). Deep Convolutional Neu-ral Networks for Image Classification: A Comprehen-sive Review. Neural Computation, 29(9):2352–2449.

Smith, L. N. (2018). A disciplined approach to neu-ral network hyper-parameters: Part 1 – learningrate, batch size, momentum, and weight decay.arXiv:1803.09820 [cs, stat].

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-novich, A. (2014). Going Deeper with Convolutions.arXiv:1409.4842 [cs].

Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., and Torralba,A. (2010). Sun database: Large-scale scene recog-nition from abbey to zoo. In 2010 IEEE Computer

Society Conference on Computer Vision and PatternRecognition, pages 3485–3492. IEEE.

Xie, L., Lee, F., Liu, L., Kotani, K., and Chen, Q. (2020).Scene recognition: A comprehensive survey. PatternRecognition, 102:107205.

Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., andYoo, Y. (2019). CutMix: Regularization Strategyto Train Strong Classifiers with Localizable Features.arXiv:1905.04899 [cs].

Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D.(2018). Mixup: Beyond Empirical Risk Minimiza-tion. arXiv:1710.09412 [cs, stat].

Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., and Tor-ralba, A. (2018). Places: A 10 Million ImageDatabase for Scene Recognition. IEEE Transac-tions on Pattern Analysis and Machine Intelligence,40(6):1452–1464.

APPENDIX

Here we list our annotation guides lines per cate-gories. Some of these categories have been not beenincluded in the training, because they included fewerthan twenty images.Accident Car. Traffic accident involving an automo-bileAccident Stretcher. Accident involving an person ona stretcherAccident Train. Traffic accident involving an trainAerial. Picture with an aerial perspectiveAmphitheater. The collection contains many pic-tures taken at an outside amphitheater in Bloemen-daal.Animals cow. Pictures of live and dead cows.Animals dog. Pictures of dogsAnimals horse. Pictures of horsesAnimals misc. Animals that do not fit in the previouscategories. For final dataset, this might be subdividedin more categories.Artwork. Artwork without people, the focus is on theart work. There is also a separate category for statues.Auditorium. Public building (used for speeches, per-formances) where audience sits. Pictures with andwithout audiences. Overlap with categories ‘confer-ence room‘ and ‘speech’Bakery. Photos inside bakery, baking bread, present-ing bread indoor/outdoors Overlap with KitchenBar. Area where drinks are served/consumed. Over-lap with Dining Room/Game RoomBaseball. People playing baseballBasketball. People playing basketball.Beach. If the beach is the picture’s main focus, we se-lect beach. Dunes is a separate category. Overlap with

ARTIDIGH 2021 - Special Session on Artificial Intelligence and Digital Heritage: Challenges and Opportunities

608

Page 10: UvA-DARE (Digital Academic Repository) · gional newspaper Haarlems Dagblad . The photo collection consists of approximately two million negatives accompanied with metadata for the

crowd/horse/running. When there is only water visi-ble, we still select Beach. Possibly ‘sea’ or ‘ocean’ asa category.Beauty Salon. Images that feature hairdressers orbeauty salonsBicycles. Features bicycles or people riding bicycles.Overlap with cycling, category for professional cy-cling and street/Boat Deck. Picture that focus on the deck of a boat,either with or without people. Not showing entireship/boat from a distance. Overlap with boats cate-gory.Boats. Category focusing on a boat or multiple boats.Overlap with harbor/shipyard. If boat is harbouredand harbor is taking up large area of picture. Ship-yard depicts construction area for boats, boats underconstruction Overlap with beach, canal, river, water-front. Difference is focus on boatBookstore Library. Bookstore or library as a com-bined category, as they are often difficult to sepa-rate. The pictures feature rows of books and/or peoplereading. Overlap with office, room which also oftenfeatures books.Bridge. Picture should feature a bridge as a centralelement. Overlap with street/canal/river/boatBuilding Facade. Depicting the facade of a build-ing/rows of buildings. Not showing the full build-ing from a distance. Overlap with residentialarea/mansion. Latter are single large house, formerpictures of areas without a clear focus on the facade.Overlap shopping area/residential area/mansionBus/truck. Focus on large busses and trucks. Overlapwith streetButchers Shop. Butcher shop from inside,or people preparing meat. Overlap with ani-mals/shopfront/shop/kitchenCanal. Flow of water, also includes nat-ural flow of water rivers. Overlap withbridge/river/boat/fishing/waterfrontCar. Pictures that focus on a carCar Shop. Showroom in which cars are soldCatwalk. Models on a catwalkCemetery. Pictures taken of a cemetery. Importantelement includes tombs, or tomb stones.Ceremony. A group of people bundled together for aceremony. This could be awards, flowers, or a medal.Note that there is also a specific category for handshakes.Chess Checkers. People playing either chess orcheckers. These are visually quite similar, for largerdataset, this category might be divided into two, ifthere are enough images.Church Indoor. Pictures taken inside a church. Thiscould include masses, but also view of the church

without people.Church Outdoor. Pictures of the church/cathedralbuilding.Circus. Pictures taken of a circus, inside of a circustent. The outside of the circus tent is categorized astent.City Hall Stairs. Pictures taken of a group of peopleon the stairs outside of the city hall of Haarlem. Thisis quite a specific categories, but there are quite a fewpictures that fit this category.Classroom. Students inside a classroomClergy. Pictures of clergy indoors or outdoors.Construction Site. Construction site, this also in-cludes pictures of demolitions. It is quite difficult toseparate the two visually. If enough images of both,we could separate the two.Courtyard. Area between buildings, or outside yardin a group of buildings.Circus. Pictures taken of a circus, inside of a circustent. The outside of the circus tent is categorized astent.Cricket. People playing cricketCrowd. Gathering of people, where individuals arenot clearly discernible. When posing for picture putin portrait category.Cycling. Professional cyclistsDancing. People dancing Overlap with bar, musicperformance, plazaDining Room. Area where people eat, both in restau-rants and housesDrive Way. Drive way in front of buildingsDunes. Photos of dunes. Overlap with pa-rade/memorial/crowdExcavation. People digging up something, archeo-logical finds Overlap with construction siteExhibition. Room with artwork(s) and people. Thefocus is more clearly on the settingFactory Indoor. Pictures taken with facto-ries/assembly lines/production facilitiesFarm Field. People working in farm fields/picturesof crops/farm fields Overlap with animalsField Hockey. People playing field hockeyFire. Pictures of fires, or building destroyed by fires.Fishery. Pictures of fishing industryFlag. People holding or raising flags Overlap withparade in which people hold flagsFlowers. Pictures that include flowersForest/Park. Picture taken in a forest or parkFuneral. Pictures of a funeral. Similar to memorialand cemetery. Here we only choose pictures that showa casket or burial itself.Overlap with memorial, pa-rade, cemetery, flowers, church indoorGymnastics. People performing gymnastics indoorand outdoor Overlap with circus/dancing

Scene Detection in De Boer Historical Photo Collection

609

Page 11: UvA-DARE (Digital Academic Repository) · gional newspaper Haarlems Dagblad . The photo collection consists of approximately two million negatives accompanied with metadata for the

Handshake. People shaking hands, subcategory ofceremony.Harbor. Ships docked at a harbor, focus is not ships,but context of the harbor. Overlap with boats and wa-terfront ¿ waterfront is focus is on waterfront/kadeHistorical Plays. People dressed up in historical gar-ment enacting historical playsHospital. Pictures taken in a hospital/dentist/medicallab settingIce Skating. People skating on iceKitchen. People in kitchen, preparing food Overlapwith butcher storeLiving Room. People situated in a living room space.Overlap with portraits, which are often taken in a liv-ing room. To learn the living room category, I placedpictures in here taken in living rooms, with enoughinformation on the living room. Overlap with portraitMansion. Large houses, separated from housingblocksMarket Indoor. Indoor market/shopping fairMarket Outdoor. Outdoor market Overlap withcrowdMarriage. Depicting a marriage couple, marriageceremony Overlap with portrait groupMeeting Room. Setting features meeting table withpeople sitting around itMemorial. Depicting a memorial site. Overlap withfuneral/flowers/flagMilitary. Pictures depicting military personnelor military equipment. Overlap with parade andbus/trucksMotorcycle. Depicting motorcycle(s)Musical Performance. People performing musicOverlap with danceOffice. Pictures set in an office environmentParade. People parading, marching bandsParade Floats. Pictures with flower trucks/floatsPatio. People on patio’s sittingPlayground. People/Children playing in a play-ground. Overlap with funfair and portrait childrenPond. Scenery of a pond. Overlap withcanal/river/parkPortrait Child. One childPortrait Children. More childrenPortrait Individual.Portrait Group.Protest. People protesting, banners clear signalRacecourse. Pictures taken on racecourse. Also cat-egory soapbox race Overlap with car, accident.Residential Neighborhood. Living area. Overlapwith street/building facadeRowing. People rowingRunning. People running in a sports eventSaint Niclaus. Pictures of the Saint Niclaus festivities

Shed. Wooden buildings, beach houses, living trailersShipyard. Construction area for ships. Overlap withboats and harbor.Shop Interior. Photos of different kinds of shop inte-riors.Shop Front. Picture of a shop front, ‘etalage’Shopping Street. Depicting street of shops/shoppersSign. Signs, plaques, mapsSnowfield. Scenes set in snow, people skiing, sled-dingSoapbox Race.Soccer. people playing soccerSoccer Indoor. People playing indoor soccerSoccer Team. Portrait of soccer teamSpeech. Focus on person speckingStatue. Pictures of sculptures/statuesStreet. Depicting street sceneries. Overlap with car,residential neighbourhoodSwimming Pool Indoor.Swimming Pool Outdoor.Theatre. Plays performed in a theatre settingTower. Pictures of towers, not church towers.Train. Pictures of trains. Overlap with train stationTrain Station. Pictures of train stations, trains at sta-tions etc..Tram. Pictures of tramsVolley ball. Pictures of people playing volley ballWater Ski. People on water skisWater Front. Scenes that focus on the water frontWindmill. Pictures that contain a windmill

ARTIDIGH 2021 - Special Session on Artificial Intelligence and Digital Heritage: Challenges and Opportunities

610