interpreting black-box semantic segmentation models in...

MLVis Machine Learning Methods in Visualisation for Big Data (2019)D Archambault I Nabney and J Peltonen (Editors)

Interpreting Black-Box Semantic Segmentation Models in RemoteSensing Applications

A Janik1 K Sankaran1 A Ortiz2

1Montreal Institute of Learning Algorithms2University of Texas - El Paso

AbstractIn the interpretability literature attention is focused on understanding black-box classifiers but many problems ranging frommedicine through agriculture and crisis response in humanitarian aid are tackled by semantic segmentation models The ab-sence of interpretability for these canonical problems in computer vision motivates this study In this study we present a user-centric approach that blends techniques from interpretability representation learning and interactive visualization It allowsto visualize and link latent representation to real data instances as well as qualitatively assess strength of predictions We haveapplied our method to a deep learning model for semantic segmentation U-Net in a remote sensing application of buildingdetection This application is of high interest for humanitarian crisis response teams that rely on satellite images analysisPreliminary results shows utility in understanding semantic segmentation models demo presenting the idea is available online

CCS ConceptsbullHuman-centered computing rarr Information visualization bull Computing methodologies rarr Knowledge representation andreasoning Image segmentation

1 Introduction

The possibility of exploring characteristics of models beyond ac-curacy is becoming a legal demand in business applications un-der the light of recently introduced laws including the EuropeanGDPR and the ldquoright to explanationrdquo of decisions made by algo-rithms [GF16] Machine Learning is often introduced as an oraclerather than a scientifically explainable approach and this is causefor concern Relying on visualizations of neuron activations is notenough ndash people need interpretations How do models link to theunderlying datasets on which they were trained How can we usethis knowledge to open the black-box and discover the reasoningbehind the model

Being able to tell what properties of the data result in good modelperformance is useful for designing them in a more transparentway and with more honest certifications of when they can be de-ployed reliably

While interpretability in machine learning can be realized inmany ways the focus of this work is the problem of explain-ing black-box models as defined in [GMRlowast18] While there issubstantial research on explanation of black box image classifiers[RSG16 KWGlowast17] less is available for image segmentation Thequestion we explore in this study is how can we explain predictedsegmentations by inspecting their learned representations and nav-igating the associated latent space

One of the problems of remote sensing is segmentation of differ-ent elements of satellite images eg roads bridges buildings carsland coverage etc Information about detected buildings is beingused for example to estimate region populations This knowledgeguides humanitarian efforts in distribution of food water and otherbasic resources for people affected by the crisis and for creatingstrategies for epidemiology prevention

11 Related Work

The survey [HKPC] present the framework for classification of vi-sual methods in deep learning Related approaches [YYBlowastYCNlowast]are based on plotting filters and exploring features in our approachwe plot observations itself within the context giving a possibilityto select region of interest to a user we decrease cognitive load thatcomes with representing everything at once on the same chart

Our contribution

bull combining feature representation ideas in computer vision withinteractive visualizationbull predicting evaluation score for entire latent space (IoU smooth-

ing)bull demo visualization - available online here httpadrijanik

githubiounet-vis

ccopy 2019 The Author(s)Eurographics Proceedings ccopy 2019 The Eurographics Association

A Janik K Sankaran amp A Ortiz Interpreting Black-Box Semantic Segmentation Models in Remote Sensing Applications

12 Incompleteness in Remote Sensing

According to [DVK] the need for interpretability originates froman incompleteness of the problem definition which makes it diffi-cult to optimize and evaluate To understand this in the context ofremote sensing one needs to understand the userrsquos perspective asone of the questions about interpretability is to whom it should beinterpretable [TBHlowast18]

Incompleteness in remote sensing may manifest itself in differ-ent ways Domain knowledge is one - resources for inference areoften limited in humanitarian remote sensing applications whichmay guide model choice Another aspect is safety and reliabilityndash we are not able to flag all undesired outputs for an end-to-endsystem it will never be fully testable Finally ethics are an im-portant consideration ndash every model is biased by the data it wastrained on and by the model of the world used to annotate data Forexample main street in Chicago and in Niger State have differentvisual representations although they fulfill similar roles Incom-pleteness may also be associated with mismatched objectives ormulti-objective trade-offs like privacy vs quality

Therefore in the presence of incompleteness explanations canensure that underspecifications of formalization are visible and un-derstood by users [DVK]

In the remote sensing scenario interpretability could highlight

bull biases from the training set (eg a model trained on cities shouldnot be used in rural areas)bull more honest information about the characteristics of data and

their effect on model performance so that users can set theirexpectations accordinglybull techniques to guide sample collection (eg how target areas dif-

fer from the areas that was covered in the training set)bull the importance of the underlying data to a wider audience (eg

one might mistakenly think that the model should work for everycity in the world in the case of a building detection task whichmight be disappointing and can undermine trust towards usagemachine learning at all)

2 Method

This study presents an interactive visualization method for high-lighting model capabilities Let us first introduce one of the metricsfor evaluating semantic segmentation models that our method usesIt is the Intersection over Union score (IoU) which intuitively canbe understood as a ratio between overlapping area and union area ofdetected mask and ground truth mask (Equation 1) The method isbased on linked brushing and IoU smoothing to interact with latentrepresentations from an encoder-decoder segmentation model IoUsmoothing is an approximation of the IoU score across the wholetraining dataset and it will be explained in the sections following

Our method was designed for explaining the U-Net semanticsegmentation model [RFB15] though in the future we also planto explore other networks with encoder-decoder architecture Itrequires access to a trained U-Net segmentation model trainingdataset and activations at the bottleneck layer To evaluate segmen-tation prediction with respect to ground truth we used IoU score

defined as

IoU(y y) =ycap yycup y

(1)

where y is a ground truth mask and y is a predicted mask

21 U-Net

U-Net is a deep network commonly used in segmentation problemsIt learns a reduced representation of an image through a down-sampling path while at the same time preserving localized infor-mation about desired properties through an up-sampling path withskip connections which is used to make a prediction

Each component is composed of convolutional layers goingdown and transposed convolutions going up with max-pooling lay-ers in between Down-sampling is responsible for reducing the in-put image to a concise representation while up-sampling retrieveslocalized information for the networkrsquos output The latent repre-sentation referred in this work is represented by activations of thebottleneck layer ndash the layer that contains the quintessence of ana-lyzed image

22 Dimensionality Reduction

At the bottleneck there are 512 neurons from which activationsare collected this amount of data is incomprehensible for a hu-man without any aid To decrease cognitive load of such a hugeamount of data we used dimensionality reduction method PrincipalComponent Analysis (PCA) [Dun] was chosen as an example of awell-established dimensionality reduction method that can projecta high-dimensional representation into a low-dimensional spacepreserving distance relationships between the data Even thoughsome information is lost during the process we gain the ability toshow the reduced representation to the user in a form of cognitivelyaccessible low dimension views The choice of dimensionality re-duction method was motivated by the fact that PCA preserves dis-tances between input samples which is important for our applica-tion The first two components were used for visualization becauseof lower cognitive load of two dimensional charts in comparisonto 3D charts but of course depending on the end goal the first 3components can be also plotted as a 3D chart or even all of thecomponents can be plotted as a scatterplot matrice with a cost ofincreased complexity of the visualization

23 IoU Smoothing

Prediction of IoU score over the whole training data space was ob-tained by training a multi-layer perceptron (MLP) [RHW85] whichis a neural network with one hidden layer with 100 neurons opti-mized with Adam [KB] with initial learning rate of 0001 regular-ized with L2 norm As an input values of 9 principle componentswere used for 85 of training samples - remaining 15 was heldout as a test set

IoU smoothing gives a sense of IoU score for areas of latentspace where there is no data available The method presented givesa map of the latent landscape learned by the model



Estimated IoU plotted as a background of scatter plot gives pos-sibility to qualitatively assess strength of predictions based on lo-cation in the latent space and its estimated score

24 Algorithm

The motivation behind our approach is to guide users in under-standing and successfully applying the model to the consideredtask Our visualization is based on principle of linked brushing[Spe] allowing users to interactively explore coordinated viewsacross subsets of the data [Kei02]

Our visualization is obtained through the following steps Firstlyactivations from the bottleneck of the network are collected for thetraining dataset Next step is to associate IoU score and set of ac-tivations with image (patch) and predictions (ground truth and in-ference) Once the activations are collected the dimensionality isreduced by PCA and two principal components are used to plotpoints on the scatterplot Each point is a representation of an in-put image Given a set of reduced activations and associated IoUscores we train a MLP regressor to estimate an IoU score acrossthe whole training space This prediction is visualized as a heatmapof IoU scores plotted in the background of the scatterplot The laststep is applying brushing to the plot in order to associate the la-tent views with the original samples For each point contained inthe brush selection we display the corresponding ground truth andprediction mask This provides a convenient view of the dataset andmodel properties

In the following section we present a proof of concept designedfor the task of interpreting building detection models in remotesensing

3 Application

One potential application of this method could be the pre-screeningof satellite images in a newly encountered region filtering to thoselikely to contain features of interest To prioritize manual labelingwe can evaluate the similarity between the latent representationof new patches and those from urbanized regions used for train-ing One of the on-going initiatives in humanitarian applicationsis collaborative mapping [HS19] ndash this currently is not integratedwith machine learning tools to the extent it could be Visualizationand interpretability focused methods may help convince people toadopt AI solutions To give a sense of how collaborative mappingworks letrsquos take an example of the application MapSwipe a partof OpenStreetMap ecosystem Currently volunteers have to man-ually swipe tiles that contains images of undeveloped areas Fil-tering down to those that have desired features present is tediousespecially where the images have the same geography In a buildingdetection task we could use smarter way of prescreening tiles withour method that can instantly prioritize regions with higher proba-bility of being inhabited Examples will be presented in Figure 1

31 Dataset

We applied this method to Inria Aerial Labelling Dataset[MTCA17] as it is an example of a well-explored labeled datasetfor satellite imagery The training set contains 180 color image tiles

of size 5000 x 5000 covering a surface of 1500m x 1500m each (ata 30cm resolution) There are 36 tiles for each region It covers5 regions Austin Chicago Kitsap County Western Tyrol ViennaFor the test set there were another 5 regions chosen BellinghamWA Bloomington IN Innsbruck San Francisco Eastern Tyrol Itprovides all together coverage of 810 km2 Images were sliced intopatches of size 572 x 572

32 Trained Model

We analyzed trained U-Net model optimized with Adam algorithmwith batch normalization It scored overall IoU of 7187 on vali-dation set and 6798 IoU on the transfer set

33 Demo Visualization

The red region in the Figure 1 is an artifact of how IoU is definedif in the image there is nothing to be detected there is no unionbetween detections and formula of IoU does not make sense weassumed that in such situation the IoU score will be 0 This area isalso highly condensed and qualitatively we can see that it containsmostly images of undeveloped areas without any buildings Demois available on GitHub httpgithubcomadrijanikunet-vis

34 Clustering

After reducing dimensionality through PCA we explored cluster-ing of the new representation To get the idea of what was learned indifferent locations of space for each cluster we selected representa-tive points characteristic for them We used k-means and DBSCANafter comparing silhouette scores of several parameters configura-tions better clustering was obtained with k-means algorithm with14 classes Despite representatives we also explored their medianand mean IoU scores The choice of representatives was based ontheir proximity to the centroid of the cluster another direction thatseems to be better is choice of prototypical points as describedin [WT] which we plan to explore further

Exploring clusters led us to some peculiar discovery about ourgiven model For example according to our U-Net cementeriesand car parking lots are similar (Figure 2) Why Probably becauseof similar pattern of rectangular shaped objects positioned next toeach other The question is if it is a desirable generalization for agiven task

Another interesting observation that we made were errors of pre-dictions that were attributed to erroneous ground truth predictions(Figure 3) It seems that annotations of data were collected in a dif-ferent time then the actual images in the dataset We found manyexamples were image represents the construction site but ground-truth annotation shows buildings

In this case study alone we have discovered that undesired out-puts may originate from

bull poor generalization capabilities for specific type of databull ground-truth errorsbull the definition of error metric



Figure 1 Demo visualization with several selections merged together Selections present samples from three qualitatively distinguishableregions a) that does not contain any buildings b) that contains few buildings c) highly urbanized with many buildings and with higher IoUscore We can see a bipolar nature of learned representation undeveloped area and urbanized

(a) Car parking (b) Cemetery

Figure 2 Two representatives of one cluster - according to the modelIs it a desirable generalization Does it matter if network confusescars with a grave

(a) Image (b) Ground-truth (c) Prediction

Figure 3 Network almost correctly segmented image of a constructionsite despite ground-truth being wrong it can be an evidence supportinggeneralization capabilities of the network despite noisy annotations

4 Conclusion

Users tend to not trust the models which they do not understandThis is not surprising since models really are genuinely compli-cated structures Overall performance measures of black-box mod-els are simply not enough to justify the use of model predictionsin the field If users do not trust models they do not use them Ifa crisis response team does not trust AI predictions then we arenot using the full potential of current technology meaning that thehelp offered to people affected by crises is not best that it couldbe To address this problem we propose the usage of interpretableapproaches - focus on the end-user the decision maker workingin limited resources and time critical environment for introducingmachine learning to current humanitarian workflows Are there anydistinguishable clusters of outliers Can we find the reason whymodels make errors Are the errors consistent What are the mostcommon errors What is the generalization capability of the modelTo what extent can you trust the model in a new region Those areonly a handful of questions that are of interest not only for practi-tioners but for scientists and we believe that approach focused pri-marily on interpretability could shed a new light on those questionsWith this work we emphasize the importance of interpretabilityand explore its utility in remote sensing analysis in the context ofhumanitarian AI enhancing tools that are already used by com-munity We presented a method of visualization of a segmentationmodel along with its training data and describe a latent space viewthat we believe will be useful for estimating IOU score or error ofnew unseen data



References[Dun] DUNTEMAN G Principal Components Analy-

sis URL httpsmethodssagepubcombookprincipal-components-analysis doi1041359781412985475 2

[DVK] DOSHI-VELEZ F KIM B Towards a rigorous science of inter-pretable machine learning URL httparxivorgabs170208608 arXiv170208608 2

[GF16] GOODMAN B FLAXMAN S European Union regulations onalgorithmic decision-making and a rdquoright to explanationrdquo arXiv e-prints(Jun 2016) arXiv160608813 arXiv160608813 1

[GMRlowast18] GUIDOTTI R MONREALE A RUGGIERI S TURINI FPEDRESCHI D GIANNOTTI F A Survey Of Methods For Explain-ing Black Box Models arXiv180201933 [cs] (Feb 2018) arXiv180201933 URL httparxivorgabs180201933 1

[HKPC] HOHMAN F KAHNG M PIENTA R CHAU D H Visualanalytics in deep learning An interrogative survey for the next fron-tiers URL httparxivorgabs180106889 arXiv180106889 1

[HS19] HUNT A SPECHT D Crowdsourced mapping in cri-sis zones collaboration organisation and impact Journal ofInternational Humanitarian Action 4 1 (Dec 2019) URLhttpsjhumanitarianactionspringeropencomarticles101186s41018-018-0048-1 doi101186s41018-018-0048-1 3

[KB] KINGMA D P BA J Adam A method for stochastic opti-mization URL httparxivorgabs14126980 arXiv14126980 2

[Kei02] KEIM D A Information visualization and visual data miningIEEE Transactions on Visualization and Computer Graphics 8 1 (Jan2002) 1ndash8 doi1011092945981847 3

[KWGlowast17] KIM B WATTENBERG M GILMER J CAI C WEXLERJ VIEGAS F SAYRES R Interpretability Beyond Feature Attri-bution Quantitative Testing with Concept Activation Vectors (TCAV)arXiv171111279 [stat] (Nov 2017) arXiv 171111279 URL httparxivorgabs171111279 1

[MTCA17] MAGGIORI E TARABALKA Y CHARPIAT G AL-LIEZ P Can semantic labeling methods generalize to any citythe inria aerial image labeling benchmark In 2017 IEEE Inter-national Geoscience and Remote Sensing Symposium (IGARSS)(Fort Worth TX July 2017) IEEE pp 3226ndash3229 URLhttpieeexploreieeeorgdocument8127684doi101109IGARSS20178127684 3

[RFB15] RONNEBERGER O FISCHER P BROX T U-Net Convolu-tional Networks for Biomedical Image Segmentation arXiv150504597[cs] (May 2015) arXiv 150504597 URL httparxivorgabs150504597 2

[RHW85] RUMELHART D E HINTON G E WILLIAMS R J Learn-ing internal representations by error propagation Tech rep CaliforniaUniv San Diego La Jolla Inst for Cognitive Science 1985 2

[RSG16] RIBEIRO M T SINGH S GUESTRIN C WhyShould I Trust You Explaining the Predictions of Any ClassifierarXiv160204938 [cs stat] (Feb 2016) arXiv 160204938 URLhttparxivorgabs160204938 1

[Spe] SPENCE R Information Visualization Design for Interaction 2edition ed Pearson 3

[TBHlowast18] TOMSETT R BRAINES D HARBORNE D PREECE ACHAKRABORTY S Interpretable to Whom A Role-based Model forAnalyzing Interpretable Machine Learning Systems arXiv180607552[cs] (June 2018) arXiv 180607552 URL httparxivorgabs180607552 2

[WT] WU C TABAK E G Prototypal analysis and prototypal regres-sion URL httparxivorgabs170108916 arXiv170108916 3

[YCNlowast] YOSINSKI J CLUNE J NGUYEN A FUCHS T LIP-SON H Understanding neural networks through deep visualizationURL httparxivorgabs150606579 arXiv150606579 1

[YYBlowast] YU W YANG K BAI Y YAO H RUI Y Visualizing andcomparing convolutional neural networks URL httparxivorgabs14126631 arXiv14126631 1



12 Incompleteness in Remote Sensing

According to [DVK] the need for interpretability originates froman incompleteness of the problem definition which makes it diffi-cult to optimize and evaluate To understand this in the context ofremote sensing one needs to understand the userrsquos perspective asone of the questions about interpretability is to whom it should beinterpretable [TBHlowast18]

Incompleteness in remote sensing may manifest itself in differ-ent ways Domain knowledge is one - resources for inference areoften limited in humanitarian remote sensing applications whichmay guide model choice Another aspect is safety and reliabilityndash we are not able to flag all undesired outputs for an end-to-endsystem it will never be fully testable Finally ethics are an im-portant consideration ndash every model is biased by the data it wastrained on and by the model of the world used to annotate data Forexample main street in Chicago and in Niger State have differentvisual representations although they fulfill similar roles Incom-pleteness may also be associated with mismatched objectives ormulti-objective trade-offs like privacy vs quality

Therefore in the presence of incompleteness explanations canensure that underspecifications of formalization are visible and un-derstood by users [DVK]

In the remote sensing scenario interpretability could highlight

bull biases from the training set (eg a model trained on cities shouldnot be used in rural areas)bull more honest information about the characteristics of data and

their effect on model performance so that users can set theirexpectations accordinglybull techniques to guide sample collection (eg how target areas dif-

fer from the areas that was covered in the training set)bull the importance of the underlying data to a wider audience (eg

one might mistakenly think that the model should work for everycity in the world in the case of a building detection task whichmight be disappointing and can undermine trust towards usagemachine learning at all)

2 Method

This study presents an interactive visualization method for high-lighting model capabilities Let us first introduce one of the metricsfor evaluating semantic segmentation models that our method usesIt is the Intersection over Union score (IoU) which intuitively canbe understood as a ratio between overlapping area and union area ofdetected mask and ground truth mask (Equation 1) The method isbased on linked brushing and IoU smoothing to interact with latentrepresentations from an encoder-decoder segmentation model IoUsmoothing is an approximation of the IoU score across the wholetraining dataset and it will be explained in the sections following

Our method was designed for explaining the U-Net semanticsegmentation model [RFB15] though in the future we also planto explore other networks with encoder-decoder architecture Itrequires access to a trained U-Net segmentation model trainingdataset and activations at the bottleneck layer To evaluate segmen-tation prediction with respect to ground truth we used IoU score

defined as

IoU(y y) =ycap yycup y

(1)

where y is a ground truth mask and y is a predicted mask

21 U-Net

U-Net is a deep network commonly used in segmentation problemsIt learns a reduced representation of an image through a down-sampling path while at the same time preserving localized infor-mation about desired properties through an up-sampling path withskip connections which is used to make a prediction

Each component is composed of convolutional layers goingdown and transposed convolutions going up with max-pooling lay-ers in between Down-sampling is responsible for reducing the in-put image to a concise representation while up-sampling retrieveslocalized information for the networkrsquos output The latent repre-sentation referred in this work is represented by activations of thebottleneck layer ndash the layer that contains the quintessence of ana-lyzed image

22 Dimensionality Reduction

At the bottleneck there are 512 neurons from which activationsare collected this amount of data is incomprehensible for a hu-man without any aid To decrease cognitive load of such a hugeamount of data we used dimensionality reduction method PrincipalComponent Analysis (PCA) [Dun] was chosen as an example of awell-established dimensionality reduction method that can projecta high-dimensional representation into a low-dimensional spacepreserving distance relationships between the data Even thoughsome information is lost during the process we gain the ability toshow the reduced representation to the user in a form of cognitivelyaccessible low dimension views The choice of dimensionality re-duction method was motivated by the fact that PCA preserves dis-tances between input samples which is important for our applica-tion The first two components were used for visualization becauseof lower cognitive load of two dimensional charts in comparisonto 3D charts but of course depending on the end goal the first 3components can be also plotted as a 3D chart or even all of thecomponents can be plotted as a scatterplot matrice with a cost ofincreased complexity of the visualization

23 IoU Smoothing

Prediction of IoU score over the whole training data space was ob-tained by training a multi-layer perceptron (MLP) [RHW85] whichis a neural network with one hidden layer with 100 neurons opti-mized with Adam [KB] with initial learning rate of 0001 regular-ized with L2 norm As an input values of 9 principle componentswere used for 85 of training samples - remaining 15 was heldout as a test set

IoU smoothing gives a sense of IoU score for areas of latentspace where there is no data available The method presented givesa map of the latent landscape learned by the model




24 Algorithm




3 Application


31 Dataset



32 Trained Model




34 Clustering













4 Conclusion


























24 Algorithm




3 Application


31 Dataset



32 Trained Model




34 Clustering













4 Conclusion






























4 Conclusion
























interpreting black-box semantic segmentation models in...

Documents