segmentation and quantification of infarction without ...80 this model is that the rapid development...

Segmentation and Quantification of Infarction without Contrast Agents via Spatiotemporal Generative Adversarial Learning

Journal Pre-proof

Segmentation and Quantification of Infarction without ContrastAgents via Spatiotemporal Generative Adversarial Learning

Chenchu Xu, Joanne Howey, Pavlo Ohorodnyk, Mike Roth,Heye Zhang, Shuo Li

PII: S1361-8415(19)30108-2DOI: https://doi.org/10.1016/j.media.2019.101568Reference: MEDIMA 101568

To appear in: Medical Image Analysis

Received date: 31 December 2018Revised date: 16 September 2019Accepted date: 30 September 2019

Please cite this article as: Chenchu Xu, Joanne Howey, Pavlo Ohorodnyk, Mike Roth,Heye Zhang, Shuo Li, Segmentation and Quantification of Infarction without Contrast Agentsvia Spatiotemporal Generative Adversarial Learning, Medical Image Analysis (2019), doi:https://doi.org/10.1016/j.media.2019.101568

This is a PDF file of an article that has undergone enhancements after acceptance, such as the additionof a cover page and metadata, and formatting for readability, but it is not yet the definitive version ofrecord. This version will undergo additional copyediting, typesetting and review before it is publishedin its final form, but we are providing this version to give early visibility of the article. Please note that,during the production process, errors may be discovered which could affect the content, and all legaldisclaimers that apply to the journal pertain.

© 2019 Published by Elsevier B.V.

https://doi.org/10.1016/j.media.2019.101568

https://doi.org/10.1016/j.media.2019.101568

Highlights

• Simultaneous segmentation and quantification of infarction without contrastagent.

• Learning time-series images by using spatiotemporal pyramid representa-tion.

• Improving performance by exploiting the commonalities and differencesacross tasks.

• Embedding segmentation and quantification tasks into adversarial learning.

1

regulated

Spatiotemporal variation encoder (STVE)

Cine MR64 x 64 x 25

Enriched feature extraction of MI

... .

Spat

iote

mpo

ral

varia

tion

enco

der

netw

ork

T

......

nnnnn der

nn

......

...or

k

......

...

Shared representation

Shared representation

SegmentationTask

QuantificationTask

SegmentationResult

QuantificationResults

Cross-task generator (CTG)

Improved generating performance by reciprocal action across tasks.

Three task Labels relatedness discriminators

Z

Generated results

Ground truth

Intra labels relatedness for segmentation

task

Inter labels relatedness

between Seg and Quan

regulated

Corrected the inter-/intra-task labels relatedness by adversarial learning

Work flowNetwork structure

Cine MR

....

Segresult

Quanresults C

Input

Output

Discriminators

Encoder

Generator

L1+L2 Loss

Fake/TureLoss

Intra labels relatedness for Quantification

task

2

Segmentation and Quantification of Infarction withoutContrast Agents via Spatiotemporal Generative

Adversarial Learning

Chenchu Xua, Joanne Howeya, Pavlo Ohorodnyka, Mike Rotha, Heye Zhangb,∗,Shuo Lia,∗

aWestern University, London ON, CanadabSun Yat-Sen University, Shenzhen, China

Abstract

Accurate and simultaneous segmentation and full quantification (all indices arerequired in a clinical assessment) of the myocardial infarction (MI) area are cru-cial for early diagnosis and surgical planning. Current clinical methods remainsubject to potential high-risk, nonreproducibility and time-consumption issues. Inthis study, a deep spatiotemporal adversarial network (DSTGAN) is proposed asa contrast-free, stable and automatic clinical tool to simultaneously segment andquantify MIs directly from the cine MR image. The DSTGAN is implementedusing a conditional generative model, which conditions the distributions of theobjective cine MR image to directly optimize the generalized error of the map-ping between the input and the output. The method consists of the following: 1)A multi-level and multi-scale spatiotemporal variation encoder learns a coarse tofine hierarchical feature to effectively encode the MI-specific morphological andkinematic abnormality structures, which vary for different spatial locations andtime periods. 2) The top-down and cross-task generators learn the shared repre-sentations between segmentation and quantification to use the commonalities anddifferences between the two related tasks and enhance the generator preference.3) Three inter-/intra-tasks to label the relatedness discriminators are iterativelyimposed on the encoder and generator to detect and correct the inconsistenciesin the label relatedness between and within tasks via adversarial learning. Ourproposed method yields a pixel classification accuracy of 96.98%, and the mean

∗Corresponding authorEmail address: [email protected] (Shuo Li)

Preprint submitted to Elsevier October 4, 2019

absolute error of the MI centroid is 0.96 mm from 165 clinical subjects. Theseresults indicate the potential of our proposed method in aiding standardized MIassessments.

Keywords: Myocardial infarction, Segmentation, Full quantification, Sequentialimages, Generative adversarial networks

1. Introduction1

Accurate and simultaneous segmentation and full quantification of myocardial2

infarction (MI) areas (all indices are required in a clinical assessment, including3

infarct size, percentage of infarct size, percentage of segments, perimeter, cen-4

troid, major axis length, minor axis length, orientation and transmurality (Ørn5

et al., 2007)) are the most effective tools to help clinicians quantitatively and qual-6

itatively assess MI areas (Karim et al., 2013; Heiberg et al., 2008; Schuijf et al.,7

2004). MI frequently leads to fatal heart failure, and the success of its therapy de-8

pends highly on the potential assessment of the scar status. In recent years, clinical9

studies have indicated that the accurate segmentation of MI areas is important be-10

cause it reveals the infarcted tissues and facilitates the identification and recovery11

of the dysfunctional but still viable myocardium (Karim et al., 2013; Ingkanisorn12

et al., 2004). However, segmentation alone cannot provide all necessary scar in-13

formation for further treatment. The accurate estimation of all indices of the MI14

area has gained increasing attention because it reveals the presence, location, and15

transmurality of acute and chronic MI to improve the selection of the appropri-16

ate therapeutic options (ablation or resynchronization) (Heiberg et al., 2008; Flett17

et al., 2011). Thus, both segmentation and full quantification of MIs are indispens-18

able for early MI patient management and therapy planning because they provide19

all required information for a thorough understanding of MIs in clinical studies.20

Current clinical procedures for MI patients relying on late gadolinium en-21

hancement (LGE) cardiac magnetic resonance (MR) imaging, which is a ‘gold22

standard’ approach to MI area imaging, are subject to potential high-risk, non-23

repeatable and time-consuming problems (Fox et al., 2010; Kali et al., 2014; Or-24

dovas and Higgins, 2011), as shown in Fig. 1. A potential high risk caused by the25

intravenous administration of gadolinium-based contrast agents during the imag-26

ing of MI patients (patients after the initial diagnosis by an electrocardiograph)27

is that this administration can be fatal to patients with chronic kidney diseases28

(Fox et al., 2010). Patients with coronary artery disease have a higher prevalence29

of coexisting chronic kidney disease than patients without coronary artery dis-30

4

·

·

·

·

·

·

·

·

·

·

·

·

q

q

Figure 1: The proposed method can segment non-contrast agents and quantify an infarction di-rectly from a cine MR image. The advantages of this approach are the great benefit to patientswith MI compared to existing LGE-based clinical procedures.

ease. The process is non-repeatable because it requires an experienced clinician31

to manually/semi-automatically segment the MI areas from the LGE MR image,32

and this process is subject to high inter-observer variability and subjectivity (Or-33

dovas and Higgins, 2011). The process is also time consuming because it involves34

three steps: LGE imaging, manual/semi-automatic segmentation from the LEG,35

and manual/semi-automatic quantification from the segmentation itself to obtain36

the final quantification result, which may even result in an accumulative error37

(Kali et al., 2014).38

To overcome the aforementioned problems of current clinical procedures, sev-39

eral methods have been attempted to directly segment the MI area from cine MR40

5

images . Cine MR imaging is one of the most common myocardial function exam-41

inations, and it presents accurate myocardial spatiotemporal changes. Ultimately,42

these spatiotemporal changes in myocardial viability caused by MI directly af-43

fect the electrical and muscle contractions and drive the morphology and kine-44

matic differences between the MI area and the healthy myocardium (Bijnens et al.,45

2007). The use of cine MR images in successfully detecting and locating infarcted46

areas without contrast agents has been reported(Suinesiaputra et al., 2017; Wong47

et al., 2016). However, this approach has not been sufficiently exploited. Only48

two cine MR-based methods obtained segmentation of the MI with comparable49

accuracy to LEG MR (Duchateau et al., 2016; Xue et al., 2017), and no previous50

method enables one to simultaneously accurately segment and quantify infarcted51

areas simultaneously.52

Accurate and simultaneous segmentation and full quantification of infarcted53

areas in MI patients directly from cine MR images is extremely challenging and54

involves the following issues: 1) Systematic learning of the spatiotemporal fea-55

tures of the MI. There were significant differences and inconsistencies in the spa-56

tial and temporal correlation of MI in different cine MR images. Since there is57

no a priori information of the infarcted area in the cine MR images (Petitjean and58

Dacher, 2011), the effective learning of the hidden spatiotemporal information of59

cine MR images becomes a prerequisite to establish the intrinsic representation60

of the MI to differentiate infarcted areas from other tissues. 2) Direct integra-61

tion of the contiguous feature map of different outputs. There was great hetero-62

geneity and locally optimal discrepancy while training on the input 2D+T (two-63

dimensional + time) cine MR mapping two different outputs (1D and 2D)(Luc64

et al., 2016). Accurate learning of the mapping between the segmented image or65

estimated indices and the cine MR images highly depends on the ability to lever-66

age the adjacency in the output label maps to guide the optimization of feature67

maps in the training process, such as the optimization of fine details in the seg-68

mentation feature maps. 3) Effective exploitation of the significant commonalities69

and differences between segmentation and quantification tasks (Xue et al., 2018).70

Segmentation and quantification as two related tasks that share commonalities can71

use a shared representation during the learning process to produce a beneficial in-72

teraction and improve the learning efficiency and prediction accuracy.73

To address these challenges, a deep spatiotemporal generative adversarial net-74

work (DSTGAN) is proposed to jointly segment and quantify MIs directly from75

cine MR images without the use of contrast agents. The DSTGAN investigates76

the infarcted area segmentation/quantification problem from the perspective of77

conditional generative adversarial networks (GANs), which learn a conditional78

6

Cine MR

Seg

result

Quan

results

C

Input

Output

Discriminators

Generator

L1+L2

Loss

Fake/Tur

e

Loss

Figure 2: The proposed DSTGAN is a conditional generative model. It will be flexible enoughto simultaneously segment the image and quantify the index by conditioning the spatiotemporalrepresentation on both the generator and the three discriminators.

generative model (Fig. 2) (Mirza and Osindero, 2014). The reason for using79

this model is that the rapid development of GANs has shown impressive results80

in continuous/discrete data generation (Goodfellow et al., 2014), particularly for81

assessing the joint configuration of multi-label variables and enforcing forms of82

higher-order consistency (Luc et al., 2016). As shown in Figure 1, for a query cine83

MR image of any patient who is preliminarily diagnosed with MI, the DSTGANs84

will be flexible enough to simultaneously obtain a segmented image and quanti-85

fied indices by conditioning the spatiotemporal representation of the MI on the86

input and correspondingly generated output. Specifically, the cine MR images are87

first mapped to a latent vector through an encoder; then, the vector is projected to88

segmentation and quantification through a deconvolutional/regression generator.89

Three adversarial networks are imposed on the generator, which force the gener-90

ated results to be closer to clinical standards and the distribution of input cine MR91

images. The contribution of the proposed DSTGANs can be summarized from92

four aspects:93

• For the first time, a non-contrast agent MI simultaneous segmentation and94

quantification approach is proposed to provide clinicians with all images or95

indices required in the current clinical assessment of MI directly from 2D+T96

cine MR images.97

• An inherent hierarchical and multi-scale 2D+T encoder is proposed for the98

7

learning of a strong and comprehensive spatiotemporal pyramid represen-99

tation of 2D+T images. It systematically models the time-series image se-100

quence with large rotation, distortion and different local correlation struc-101

tures between consecutive frames in different spatial locations and time pe-102

riods.103

• A reciprocal cross-task architecture is proposed to improve the learning ef-104

ficiency and generation accuracy by exploiting the commonalities and dif-105

ferences across tasks. It merges the deep and coarse information with the106

shallow and fine information of the MI throughout the network to enrich107

this beneficial interaction representation.108

• An iterative conditional adversarial learning strategy is proposed to inte-109

grate three different target discriminators to flexibly use multiple inductive110

biases to simultaneously and directly approximate the generated distribu-111

tions to the clinical standards.112

This work advances our preliminary work in MICCAI-2018: 1) A new pow-113

erful spatiotemporal feature learning network now encodes cine MR images and114

improves the segmentation/quantification. 2) Conditional GANs have been used115

in the adversarial training process to enhance the generated results. 3) Three iter-116

ative discriminators optimize different distribution tasks to make the model more117

flexible, while the iterative training strategy ensures that their training is stable.118

4) Two additional quantitative indicators (percentage of infarct size and transmu-119

rality) have been estimated to cover all indicators that can be used in the clinic.120

To our knowledge, this is the first report of these two indicators being directly121

estimated from the cine MR images of MI patients. 5) Experiments are extended122

to a larger database, and more statistical validations with rigorous discussion have123

been added to this paper.124

2. Related Work125

2.1. Existing MI segmentation or quantification methods126

Currently, there is no method for simultaneous segmentation and quantifica-127

tion of the MI directly from a cine MR image, and even a direct MI quantification128

method is not available (Fig. 3). Existing MI segmentation methods cannot satisfy129

clinical needs in terms of accuracy because these methods provide sparse move-130

ment, and the deformation information cannot accurately detect detailed abnor-131

mal heart wall motion. These methods mainly fall into three categories: energy-132

8

T1

T25

· Both spatial and temporal learning

· Tasks reciprocal interaction

· Inter/intra-task labels relatedness

forcing

Coarse

segment-level localization

Inaccurate

quantification from

localization results

T

Temporal information onlyPost processing to

regularize segmentation

Multiple-frame

Cine MR

Cine MR

Simultaneous segmentation and full

quantification dirtily from MR

Endocardial

/Epicardial

segmentation

Figure 3: The proposed method can simultaneously segment and quantify infarcts directly fromthe cine MR image because of the improvements in existing methods and the specific design forMI characteristics.

based (Ledesma-Carbayo et al., 2005), statistical shape model-based (Suinesiapu-133

tra et al., 2017), and machine learning-based (Bleton et al., 2015). However, until134

now, only Duchateau et al (Duchateau et al., 2016) and Xu et al (Xue et al., 2017)135

have successfully achieved pixel-wise MI segmentation through the LV shape un-136

certainty model and patch-based recurrent neural networks (RNNs), respectively.137

However, these two methods face the issue of insufficient accuracy because of138

the cumbersome LV shape modeling process via manual or semi-automated endo-139

cardial/epicardial delineation and only single-scale correlation learning for both140

spatial and temporal features. Moreover, all existing MI quantification methods141

are based on the quantitative segmentation results (Kali et al., 2014). Compared142

to this multi-step method, direct quantification in one-step regression is more effi-143

cient and more accurate because it avoids the cumulative error generated by mul-144

tiple steps. Because of the excellent ability of deep learning in task-aware feature145

representation, direct quantification has also been widely recognized in the med-146

ical image community and used in the estimation of different tissues or lesions147

(Schlegl et al., 2017; Sun et al., 2017). Specifically for the cardiac field, success148

in using direct quantification has also been reported for estimating a wide range of149

9

indices (Zhen et al., 2015b, 2017a; Afshin et al., 2012), including volume (Zhen150

et al., 2017b), wall thickness (Xue et al., 2017) and ejection fraction (Gu et al.,151

2018), from various cardiac images (Zhen et al., 2016, 2017b, 2015a), which led152

to a challenge called LV full quantification at STACOM-MICCAI’18 (Xue et al.,153

2017).154

2.2. Deep spatiotemporal feature learning155

The existing deep spatiotemporal feature learning methods, mainly divided156

into RNN-based (Yeung et al., 2016) and 3D convolution (Conv)-based (Tran157

et al., 2015) methods, demonstrate powerful performance (Xingjian et al., 2015;158

Shou et al., 2016). However, most of these methods cannot be mechanically ap-159

plied to MI feature learning: 1) The spatial part of spatiotemporal features that are160

missing or have incomplete results in the RNN-based methods are incapable of161

modeling the morphological and kinematic changes specific to the MI. The RNN162

and its variants (such as long short-term memory (LSTM) and convolution RNN163

(Xingjian et al., 2015)) have been widely used to model the temporal state transi-164

tions in frames to capture motion information and improve the inter-frame consis-165

tency. However, LSTM fails in taking spatial features into consideration because166

the fully connected layer learns the 2D+T images as a one-dimensional sequence.167

Convolution RNN methods, in which the convolution layer is a local constant168

(fixed kernel size) filter, lack the ability to model the non-local spatiotemporal169

features specific to the MI with long-range rotation or distortion and different170

local correlation structures in different spatial locations and time periods (Shou171

et al., 2016). 2) The loss of granularity over time results in the 3DConv-based172

model being unable to systematically model all 25 frames from the beginning to173

the end. 3DConv has been used to capture temporal scene changes when actions174

are being performed; it is commonly placed into the 3D convolutional neural net-175

work framework, which can learn the spatiotemporal abstraction of high-level se-176

mantics directly from 2D+T images (Shou et al., 2016). However, 3DConv loses177

granularity during the temporal growth (Shou et al., 2017).178

2.3. Adversarial label relatedness reinforcement179

GANs have achieved much success in reinforcing and correcting inconsis-180

tent label relatedness in training feature maps. Although the conditional random181

field (CRF)-based methods also enable reinforcing label relatedness and demon-182

strate impressive performance when labels are independently predicted, CRF-183

based methods have limitations in our application because the learned parame-184

ters of higher-order potentials are limited in number when both segmented and185

10

quantified label relatedness are integrated into 2D+T data. Thus, the use of an186

adversarial training approach enforces the multi-range label relatedness without187

being limited to a notably specific class of high-order potentials in a CRF model,188

which is a wise choice. The GAN work introduces a new framework for estimat-189

ing generative models via an adversarial process (Goodfellow et al., 2014). It is190

composed of two models that are alternatively trained to compete with each other.191

The generator G captures the distribution of training samples and learns to gener-192

ate new samples to imitate the training. The discriminative modelD differentiates193

the generated samples from the training. D is optimized to distinguish training194

samples and to generate results generated by G using a two-player min-max game195

with the following objective function:196

minG

maxD

Ex∼pdata(x)[logD(x)]+

Ez∼p(z)[log(1−D(G(z)))](1)

where training data are represented by x, the data distribution is pdata, and z is a197

vector sampled from a certain distribution p(z).198

In conditional GANs (Mirza and Osindero, 2014), by conditioning the model199

on additional information, it is possible to direct the data generation process. Con-200

ditional GANs successfully solve the instability of the original GAN, for which201

the training process is unstable and the generated results are often noisy and in-202

comprehensible.203

3. Methodology204

The structure of the DSTGANs is shown in Fig. 4. The input of DSTGANs205

is a 2D + T cine MR dataset X , and the output is Y = (y1, y2, ..., yn), which206

includes a segmentation task ys = y1 and a quantification task yq = y2, ..., yn,207

n=10 (including infarct size, percentage of infarct size, percentage of segments,208

perimeter, centroid, major axis length, minor axis length, orientation and trans-209

murality). DSTGANs apply the GAN model in the conditional setting via three210

different functional parts: 1) The spatiotemporal variation encoder (STVE) E(.)211

simultaneously encodes the enriched local correlation structure of different spa-212

tial locations and temporal periods in X through spatiotemporal variation learn-213

ing with a pyramidal structure (Goodfellow et al., 2016). The output of STVE z214

preserves a comprehensive representation of input X , which includes all feature215

maps from the low level to the high level. 2) The cross-task generator (CTG) G(.)216

generates the segmentation task and quantification task results simultaneously and217

11

regulated

Spatiotemporal variation encoder

(STVE)

Cine MR

64 x 64 x 25

Enriched feature extraction of MI

... .. ... ...

Sp

ati

ote

mp

ora

l

va

ria

tio

n

en

cod

er

ne

two

rk

T

...

...

nn

n

de

r

n

...

...

...

ork

...

...

...

Shared

representation

Shared

representation

Seg

Task

Quan

Task

SegmentationResult

QuantificationResults

Cross-task generator (CTG)

Improved generating performance

by reciprocal action across tasks.

Three task Labels relatedness

discriminators

Z

Generated results

Ground truth

Intra labels relatedness for

Seg task

Inter labels relatedness

between Seg and Quan

regulated

Corrected the inter-/intra-task labels

relatedness by adversarial learning

Intra labels relatedness for

Quan task

Figure 4: The proposed DSTGAN effectively integrates the spatiotemporal variation encoder,cross-task generator, and three intra-/inter-task label relatedness discriminators to accurately seg-ment (seg) and quantify (Quan) MIs.

reciprocally from the pyramid feature map z via a top-down pathway, and multi-218

ple connections upsample and connect the beneficial interaction feature maps of219

these two related tasks. The output of the CTG can be expressed as G(z) = Y .220

3) Three discriminators (discriminator segmentation Ds, discriminator quantifica-221

tion Dq, discriminator relationship Dr) are conditioned on X for iterative impo-222

sition on G. Regarding all of them, Ds and Dq force regularization of the label223

relatedness within segmentation y1 and quantification y2, ....yn to conform to the224

relatedness in ground truth, respectively; Dr regularizes the relationship between225

y1 and y2, ....yn to conform to those in the ground truth.226

3.1. Spatiotemporal variation encoder (STVE) to enrich the feature extraction of227

MI228

The STVE innovatively leverages 4 spatiotemporal variation convolution blocks229

(STVCB) to build a pyramidal feature to accurately learn the left ventricular mor-230

phology and kinematic abnormalities. As shown in Fig. 5, this framework with231

four STVCBs uses a spatiotemporal scaling step of 2 to sequentially encode the232

cine MR images X from high-resolution to low-resolution and outputs the four233

spatiotemporal resolution feature maps respectively. Then, these four feature234

maps combine into a set of pyramidal features z. Specifically, the four STVCBs235

are block1, block2, block3, and block4 and correspond to the output four spa-236

12

T

Cine MR

64 x 64 x 25

Strong spatiotemporal

representation for all

shape MI

Input

STVACB

STVACB

STVACB

STVACBSTVACBSTVACB

STVACB

Output

Feature

pyramid

structure

Spatiotemporal variation convolution

block (STVCB)

Conv 1

W x H x T

Conv 2

W/2 x H x T

Conv 3

W x H/2 x T

Conv kW/2 x H/2 x

T/2

Images or features

W x H x T

T

W

H

Conv

kernel

Designed for models different local correlation structures of

different spatial locations and time periods of MI

Spatiotemporal variation

encoder (STVE)

Figure 5: The proposed STVE systematically models different local correlation structures of dif-ferent spatial locations and time periods of MIs and other tissues in successive frames using theinnovative design of multiple spatiotemporal variation convolution blocks and the spatiotemporalfeature pyramid structure. The STVE maps the input cine MR images to a feature pyramid Z.

tiotemporal feature maps {C1, C2, C3, C4}, where the spatiotemporal resolution237

of those pyramid features are {32× 32× 16, 16× 16× 6, 8× 8× 3, 4× 4× 1}238

respectively.239

The benefit of such pyramid features is two-fold: First, by using multiple scale240

feature learning frameworks, the encoder can handle the asymmetry distortion241

(Falsetti et al., 1981), which is a mutability myocardial motion and deformation242

pattern caused by MI. Second, by using strided convolution instead of the pooling243

layers, the encoder can be fully differentiable and learn its own spatial downsam-244

pling.245

Each STVCB integrates multiple spatiotemporal scale sliding sub-blocks that246

consist of 3DConvs and aConvLSTM to produce different kernel sizes and avoid247

defects when they are separately used. Multiple sliding convolutional windows of248

different kernel sizes are used for multiple associations to learn the feature map249

from the input of a block. Different sliding windows with different kernel sizes250

proportionally scale the spatiotemporal resolution of the STVCB to simultane-251

ously learn the spatiotemporal dependency structure with different spatial scales252

13

and multiple time periods. Specifically, 3 scales and 3 aspect ratios are the default253

for each STVCB. SinceW0, H0, and T0 are the inputs of the block, the scale ratios254

are⌈1, 1

2, 14

⌉for all W0H0T0, and the aspect ratios are (1, 1, 2), (1, 2, 1), (2, 1, 1)255

for both scale ratios⌈12, 14

⌉. ConvLSTM is selected for convolutional windows256

with a scale ratio of {1}, which is the input feature map of an STVCB. This se-257

lection is natural because ConvLSTM can handle the long temporal information258

with the spatial correlation better than 3DConv. Simultaneously, the other scales259

and aspect ratios are used for 3DConv because it can be directly systematically260

modeled for a short time period by simply adjusting the size of the convolution261

kernel.262

3.2. Cross-task generator (CTG) to share the commonalities between the tasks263

The CTG top-down upsamples and merges feature maps from the spatiotem-264

poral feature pyramid z, and it reciprocally connects the features of different lay-265

ers and different tasks to jointly generate the MI segmentation and quantification266

results, as shown in Fig. 6.267

This CTG efficiently integrates coarse, high-layer information with fine, low-268

layer information to avoid contextual information loss and gradient dispersion.269

• For the segmentation task, the component of CTG is similar to the U-Net270

(Ronneberger et al., 2015). It begins in a coarser-resolution feature map271

C4 and merges rest of the z(C3, C2, C1) to the same spatial size features272

during the iterated upsampling processing, where the upsampling size is 2273

times. In other words, the CTG merges the feature map of each upsampling274

unit output with the corresponding {C1, C2, C3, C4} maps (a 1 × 1 convo-275

lutional layer to unify dimensions) and feeds merged features into the next276

upsampling unit.277

• For the quantification task, the component of CTG is modified from the278

FPN (Lin et al., 2017). It uses five fully connected layers to extract five279

quantification-related feature vectors {Q1, Q2, Q3, Q4, Q5} from five scales280

of feature maps that are C4 and four outputs of the upsampling unit. In ad-281

dition, it then combines these five features to regress the final quantification282

result by lateral connections. All dimensionalities of the fully connected283

layers are 256.284

Moreover, the CTG also exploits the detailed commonalities and differences285

between the segmentation task and quantification task to produce the beneficial286

14

Forcible sharing

Figure 6: The proposed CTG simultaneously and accurately generates the MI segmentation andquantification results using the creative design of top-down pathways and cross-task connections.After concatenating z, the CTG maps the input Z to output the generated result Y .

interactions among the tasks by shared training. Both segmentation and quantifi-287

cation components of CTG use shared upsampling output and the pyramid feature288

z, and the weights and bias of these two related tasks forcibly share and mutually289

improve during training.290

3.3. Iterative label relatedness discriminators to regularize the contiguous gener-291

ation of intra-/inter-tasks292

Iterative label relatedness discriminators (Ds, Dq, and Dr) leverage the intrin-293

sic label relatedness between and within the segmentation and quantification to294

regularize the generated results (Fig. 7), which effectively improves the accuracy295

and generalization of the generation. The three discriminators model three types296

of intrinsic label relatedness: 1) Ds models intra-task relatedness within pixels297

of the segmentation image (such as spatial structure information); 2) Dq models298

intra-task relatedness within values of quantification indices (such as the geomet-299

ric sequence between the area and perimeter); 3) Dr models inter-task relatedness300

between segmentation and quantification (such as the relationship between the301

segmentation perimeter and quantification perimeter). The three discriminators302

check that these relatedness characteristics are consistent with the ground truth,303

and then improve the generation performance by improving these relatedness304

characteristics of the generated results until they match the ground truth during305

15

adversarial training. All discriminators condition the spatiotemporal distribution306

of the input cine MR dataset c (after 5 ConvLSTM-Batch Norm-ReLU).307

The discriminatorsDs andDq use the CNN and Bi-LSTM (Huang et al., 2015)308

to build the intra-task label relatedness for segmentation and quantification, re-309

spectively. For Ds, since the CNN-based model (4 layers, 3 × 3 kernel, max-310

pooling, stride 2) can assess the spatial correlation of label variables in the field311

of view, which is the entire image or a large portion of the image, mismatches in312

the label relatedness statistics can be penalized by the adversarial loss term, which313

is the binary cross-entropy. For Dq, since the Bi-LSTMs (3 layers) can consider314

a complete contextual relationship to learn a vector pattern, the adversarial pro-315

cess forces the distribution of the generated yq to gradually approach the prior316

distribution of ground truth. Note that c and yq are uniformly scaled to a uniform317

dimension d = 100 via duplicate or 1DConv.318

Discriminator Dr uses the Bi-LSTM to build the inter-task label relatedness319

between segmentation and quantification. Specifically, the segmentation results320

ys are fed into two downsampling Bi-LSTM units with dimension d. Then, these321

features are concatenated along the channel dimension with the scaled quantifica-322

tion features and fed into three layers of Bi-LSTMs to jointly learn the features323

across the image and the indices. Finally, a fully connected layer with one node is324

used to produce the decision score. Such an inter-task label relatedness learning325

network can help the CTG rectify small defects in the generated results, which en-326

hances more details in the MI boundary areas and makes the quantification more327

stable.328

3.4. Task-aware loss function to effectively and steadily promote the network329

training330

Our task-aware loss function (Eq. 7) is dedicatedly formulated to train DST-331

GAN stability and accuracy by combining three loss terms, which are multi-task332

generation loss (Eq. 2), adversarial loss (Eqs. 3 and 4), and reciprocal loss (Eq. 5).333

Generation loss: To improve the accuracy of the segmented images and quan-334

tified indices, the multi-task generation leverages the elastic net loss function that335

integrates L1 (lasso) and L2 (ridge) regularizations to obtain a sparse model that336

better encourages less blurring and improves the regression performance when it337

predicts the related tasks.338

16

{0, 1}

CineCMR

Input of D_s

{0, 1}

QuantificationResult

Input of D_q ......

{0, 1}

QuantificationResult

G(E_st(X))

Input of D_r

......

SegmentationResult

Iterative training

Discriminators D_s for intra label relatedness of segmentation

SegmentationResult

Iterative training

Three discriminators are designed to regularize three different label Relatedness for improving generating performance

Discriminators D_q for intra label relatedness of quantification

Discriminators D_r for inter label relatedness between Seg and Quan

CineCMR

Figure 7: Three inter-/intra-task label relatedness discriminators effectively improve the perfor-mance of the generator by directly regulating the distribution of results to approximate the clinicalstandards.

The generation loss Lg(E,G,X) are formulated as follows:339

Lg = minE,G

(Y , G(E(X)))

=1

n

n∑

i=1

(‖yi − yi‖1 + ‖yi − yi‖22)

(2)

where Y = (y1, y2, ..., yn) is the ground truth.340

Adversarial loss. To train the adversarial training more stably, the adversarial341

loss is adopted from WGAN-GP (Gulrajani et al., 2017) for three discriminators.342

In particular, WGAN-GP avoids potential discontinuity with respect to the gen-343

erators parameters and local saturation leading to vanishing gradients by using a344

continuous Earth-Mover distance to replace the Jensen-Shannon divergence dis-345

tance in the GAN formulation. It also adds a gradient penalty to penalize the norm346

of the gradient of the critic about its input, thereby avoiding undesired behaviors.347

Thus, adversarial loss La of inter-task label relatedness La1(E,G,X,Ds, Dq) and348

17

intra-task label relatedness La2(G,X, Y,Dr) formula is given as:349

La1 = EE(X)∼P (X)

[D{s,q}(G(E(X)))]

− Ey{s,q}∼P{s,q}

[D{s,q}(y{s,q}, X)]

+λ1 EI{s,q}∼PI{s,q}

[(∥∥∥∇I{s,q}D{s,q}(I{s,q})

∥∥∥2− 1)2]

(3)

The La1 is a min-max objective function to train generator G and the discrim-350

inator segmentation Ds and the discriminator quantification Dq with condition X .351

The I is a penalty on the gradient norm for random samples, and λ is a penalty352

coefficient. The encoded pyramid feature E(X) is imposed on Ds or Dq, the353

distribution of the training data is P{s,q}, and the distribution of the cine MR is354

P (X).355

Similarly, the discriminator Dr, and G with condition X can be trained by:356

La2 = EE(X)∼P (X)

[Dr(G(E(X)))]

− EY∼PY

[Dr(Y,X)]

+λ1 EIY ∼PIY

[(∥∥∥∇IYDr(IY )

∥∥∥2− 1)2]

(4)

La2 drives the discriminator Dr on both the segmentation image and quantifi-357

cation indices to force the generator to further improve the performance. The La1358

of Ds, Dq and La2 of Dr use iterative (separate) strategy training.359

Reciprocal loss. To further improve generalization and stability after adver-360

sarial training, the reciprocal loss is used to minimize the inter-task label related-361

ness difference between the ys and yq while maintaining the performance. The in-362

tuition behind mutual loss is to maintain the accuracy of the results while reducing363

the difference between two quantification results: 1) the result directly from net-364

work regression and 2) the result computed from the segmentation image. These365

two sets of results should be the same because they are obtained from same LGE366

image. Thus, the reciprocal loss as an auxiliary term is added behind the genera-367

tor loss and adversarial loss. It drives the network to have a reciprocal constraint368

for these two results and an inductive bias for further improving generalization369

and stability after obtaining accurate results of adversarial training. Note that the370

percentage of infarct size, the percentage of segments, and the transmurality are371

18

not included. The reciprocal loss Lr(G,X, yq, ys) is defined as:372

Lr = EY∼PY

[||yq − yq||21 + ||yq − yq||22] (5)

where yq is the quantification results obtained from segmentation image.373

Finally, To generate the output Y, the full objective L is built to linearly374

combine all previous partial losses (generation loss Lg, adversarial loss La{1,2},375

and reciprocal loss Lr):376

L = λ2Lg(E,G) + λ3La{1,2}(E,G,X, Y,Ds, Dq, Dr)

+λ4Lr(G,X, yq, ys)(6)

where λ2 , λ3 and λ4 are the hyper-parameters that balance the relative importance377

of the different terms. These hyper-parameters tune the values that achieved the378

overall highest performance in the experiment. Then, we define the following379

minimax problem:380

argminE,G

maxD∈{(DS ,Dq),(Dr)}

L (7)

4. Materials and Implementation Details381

Materials A total of 165 patients were retrospectively selected from two clin-382

ical databases for a retrospective observational study, including 140 MI patients383

and 25 NON-MI patients. All patients were admitted with acute ST-segment el-384

evation MI (in at least 2 contiguous electrocardiogram leads). All patients com-385

pleted cine (12375 images) and LGE (495 images) MR imaging scans. Cine MR386

imaging was performed using a 3T MR system (MAGNETOM Verio, Siemens387

Healthcare) with a 32-channel cardiac coil. After the scout MR images were388

obtained, cine images were acquired during repeated breath-holds to cover the389

entire heart. LGE MR imaging was performed in the same orientations and with390

the same slice thickness as cine imaging using a segmented inversion-recovery391

gradient-echo sequence ten minutes after the intravenous injection of gadolinium392

(Magnevist, 0.2 m mol/kg; Bayer Schering Healthcare). The inversion time was393

optimized to null the myocardium during the image acquisition. The imaging394

parameters were as follows: TR = 10.5 ms, TE = 5.4 ms, FA = 30◦, matrix of395

256×162, slice thickness of 6 mm, BW of 140 Hz/px, and 25 cardiac phases,396

each pixel is 1.26x1.26 mm.397

The ground truth was obtained by two radiologists. In detail, two experienced398

(over 10 years) radiologists (Dr. N. Zhang and Dr. L. Xu) manually assessed399

19

LGE MR images (the well-known gold standard in the clinical MI assessment)400

and segment the infarction area to obtain the ground truth of segmentation, and401

if a disagreement occurred, it was resolved by a consensus between the experts.402

Then, Dr. N. Zhang measured the ground truth of the segmentation to obtain the403

ground truth of the quantification.404

Implementation details. DSTGANs are designed to generate 64 × 64 bi-405

nary images and 1 × 10 indices directly from X ∈ (x1, x2, ..., xT ,RH×W×T ),406

where H and W are the height and width of each temporal frame, respectively407

(H = W = 64), and T is a temporal step set to T = 25. The input vector is408

first encoded to a feature of the pyramid, which consists of a 32 × 32 × 256T ,409

16 × 16 × 256T , 8 × 8 × 256T , 4 × 4 × 256T tensor, where T is the number of410

temporal resolutions in the tensor. Batch normalization (Goodfellow et al., 2016)411

and LeakyReLU activation (Goodfellow et al., 2016) are applied after every block412

of the encoder except the last one. The residual blocks consist of 33 stride 1 con-413

volutions, batch normalization and LeakyReLU. The upsampling blocks consist414

of nearest-neighbor upsampling followed by a 33 stride 1 convolution. All codes415

are based on Keras (Tensorflow), with an 8x NVIDIA P100 for training. All net-416

works are trained using the ADAM solver (Kingma and Ba, 2014) with a batch417

size of 6 and an initial learning rate of 0.001. The λ1 is 10, λ2 is 0.5, λ3 is 1,418

and λ4 is 10. The iterative training strategy is used, and stage 1 iteratively trains419

{E,G} and {Ds, Dq} for 800 epochs by fixingDr. Then, stage 2 iteratively trains420

G and Dr for another 600 epochs by fixing {E,G} and {Ds, Dq}. There are two421

iterative training processes. A network with heart localization layers, as described422

in (Xu et al., 2017), was used to automatically crop the cine MR images to 64423

× 64 region-of-interest sequences including the LV. Moreover, because LGE MR424

imaging was performed in the same orientations and with the same slice thickness425

as cine imaging, no extra steps were required to align them. All experiments were426

assessed with a 10-fold cross-validation test (no patients are in different groups).427

Evaluation criteria. The performance of the proposed DSTGAN framework428

is evaluated based on the generation accuracy of the MI segmentation and quan-429

tification. For the segmentation task, the overall pixel classification accuracy, sen-430

sitivity, specificity, Dice coefficient (Dice =2|Y ∩Y ||Y |+|Y | × 100%) and intersection-431

over-union (IoU = |Y ∩Y ||Y ∪Y | × 100%) between the ground truth and the generated432

image are calculated to evaluate the accuracy. For the quantification task, the433

mean absolute error (MAE =∑n

i=1|Y−Y |n

) and statistical hypothesis testing (p-434

value) between the ground true and generated indices are calculated to evaluate435

20

LGE

image

Ground truth

On

segmentation

Our method

for

segmentation

Difference

after overlap

Figure 8: The infarction areas segmented by our method (third row) are consistent with the groundtruth (second row).

the accuracy. Note that the MAE of the centroid is the sum of the MAE in X-axis436

and the MAE in Y-axis. The p-value is obtained by an independent t-test between437

the prediction values and the ground truth. If it is greater than 0.05 (i.e., p > .05),438

the two groups can be treated as consistent. If p < 0.05, the two groups violate439

the assumption of homogeneity of variances and can be treated as inconsistent.440

5. Experiments and Results441

The DSTGAN demonstrates high performance with a pixel classification ac-442

curacy of 96.98%. The MAE of the centroid point is 0.96 mm with the ground443

truth obtained manually by human experts, which demonstrates the effectiveness444

of this method in MI segmentation and quantification.445

5.1. Accurate MI segmentation and precise MI quantification446

The experimental result shows that the DSTGAN can accurately segment the447

MI, as shown in Figure 8. The DSTGAN achieves an overall pixel classification448

accuracy of 96.98% with a sensitivity of 92.15% and a specificity of 98.70%. The449

Dice coefficient is 92.89±0.33%, and the IoU is 82.16±0.27%; the ROC and PR450

curves are shown in Fig. 9. The ground truth and the result are binary images;451

each pixel is assessed for infarction or normality (0 or 1).452

21

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1-Specificity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sensitiv

ity

ROC Curves

DSTGANsCTGSTVEConvLSTM3DConvLSTM

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pre

cis

ion

PR graph

DSTGANsCJGSTVEConvLSTM3DConvLSTM

Figure 9: The ROCs and PRs demonstrate that the DSTGAN outperforms each separate compo-nent and other spatiotemporal learning methods.

DSTGANs can produce a good quantification of the MI, as shown in Ta-453

bles 1 and 2. The MAE computed between the ground truth and our estima-454

tion of the infarction size is 17.15±8.91 mm2; the percentage of the infarcted455

size is 9.24±7.73%; the percentage of the segment is 4.07±2.91%; the perime-456

ter is 3.95±2.21 mm; the centroid is 0.96±0.77 mm; the major axis length457

is 2.04±1.10 mm; the minor axis length is 1.07±0.80 mm; the orientation is458

5.47±2.24◦; and the transmurality is 13.18±9.83%. Table 3 summarizes the com-459

parative results on each quantification index subgroup. The ground truth was ob-460

tained by radiologists who measured the segmentation results of DE-MR images.461

462

Tables 1 and 2 also demonstrated that the joint learning of the segmentation463

task and quantification task outperforms each individual task. The joint learn-464

ing improves the effectiveness and stability for both segmentation and quantifi-465

cation in the terms of accuracy (1% improved), Dice (2% improved), IOU (3%466

improved), infarcted size (1 mm improved), the percentage of the infarcted size467

(4% improved), the percentage of the segment (0.5% improved), the perimeter (1468

mm improved), the major axis length (0.1 mm improved), the minor axis length469

(0.7mm improved), the orientation (1◦ improved), and the transmurality (3% im-470

proved).471

22

Table 1: Each technological innovation component in DSTGANs has effectively improved theaccuracy of the MI segmentation and part of the quantification

Segmentation Quantification

Accuracy Dice IOU Size Per-size Per-Seg(%) (%) (%) (mm2) (%) (%)

ConvLSTM 89.27 83.70 66.10 72.73 10.54 8.91(Baseline) ±3.04 ±3.27 ±6.91 ±62.89 ±8.36 ±7.07+ STVCB 90.94 83.68 69.63 80.37 11.63 6.95

±1.15 ±3.46 ±3.82 ±67.14 ±9.91 ±5.27+ Feature pyramid 92.37 85.68 70.51 49.32 10.14 4.46

structure ±0.81 ±1.01 ±2.66 ±42.41 ±8.72 ±3.16+ CTG 95.10 90.68 73.47 27.26 12.64 5.17

±0.56 ±0.73 ±1.41 ±24.91 ±9.85 ±4.27+ Ds 95.92 91.26 77.34 - - -

±0.58 ±0.58 ±0.30 - - -+ Dq - - - 18.72 10.11 4.62

- - - ±14.15 ±7.04 ±3.30+ Ds and Dq 96.71 91.43 79.52 19.24 8.76 4.34

±0.47 ±0.30 ±0.32 ±11.93 ±6.91 ±3.57+ Dr 96.93 92.11 78.85 20.61 9.69 1.42

±0.23 ±0.38 ±0.37 ±10.04 ±6.72 ±3.34+ Conditional 96.98 92.89 80.16 17.15 9.24 4.07

±0.31 ±0.33 ±0.27 ±8.91 ±7.73 ±2.91

MAE is used to evaluate all the quantification indicesSize = infarct size, Per-size = percentage of infarct size, Per-Seg = percentage of segments

5.2. Advantages of the spatiotemporal feature encoder472

Figures 9 and 10 and Tables 1, 2 and 3 indicate that the STVE performs bet-473

ter than all other frameworks because it creates a pyramid structure to produce a474

coarse to fine hierarchical feature and achieves a strong spatial and temporal rep-475

resentation of multiple scales and aspect ratios. To evaluate the feature-learning476

performance, we replace the STVE with a version without the pyramid structure477

and without STVCBs and with recently popular methods (3DConvs, ConvLSTMs,478

3DConvs + LSTMs, CNNs and LSTMs) in our framework. These experimen-479

tal results demonstrate that the integration of different spatiotemporal scales and480

aspect ratios and the hierarchical, coarse-to-fine spatiotemporal feature learning481

23

Table 2: Table 1 ContinuedQuantification

Peri Cent Majo Mino Orie Tran(mm) (mm) (mm) (mm) (◦) (%)

ConvLSTM 10.17 4.55 9.49 8.64 12.62 18.61(Baseline) ±8.20 ±2.97 ±7.31 ±7.08 ±10.35 ±15.96+ STVCB 7.37 2.78 6.84 5.35 8.27 15.07

±6.94 ±1.28 ±5.01 ±3.50 ±6.22 ±13.72+ Feature pyramid 6.94 1.42 7.08 4.58 8.14 18.31

structure ±5.40 ±0.72 ±5.49 ±3.38 ±7.15 ±16.83+ CTG 8.61 1.13 5.34 2.23 7.74 15.27

±6.68 ±0.88 ±3.97 ±1.39 ±5.81 ±13.35+ Ds - - - - - -+ Dq 4.86 0.91 2.12 1.76 6.58 16.37

±2.28 ±0.74 ±1.50 ±0.94 ±4.29 ±10.92+ Ds and Dq 5.07 0.97 2.31 1.60 6.36 16.18

±1.16 ±0.57 ±1.53 ±0.77 ±3.48 ±11.40+ Dr 4.21 0.92 2.68 1.16 7.60 19.86

±3.04 ±0.72 ±1.78 ±0.82 ±4.07 ±12.09+ Conditional 3.95 0.96 2.04 1.07 5.47 13.18

±2.21 ±0.77 ±1.10 ±0.80 ±2.24 ±9.38

MAE is used to evaluate all the quantification indicesPeri = perimeter, Cent = centroid, Majo = major axis length, Mino = minor axis length, Orie= orientation, Tran = transmurality

framework is more reliable and robust than the cases that do not use or inde-482

pendently use them. This integration provides our framework with a more com-483

prehensive understanding than all existing methods that learn the morphological484

and kinematic abnormalities of the left ventricle. Furthermore, our experiments485

demonstrate that the STVCB, which combines ConvLSTM and 3DConv to simul-486

taneously extract different spatiotemporal resolution features, is the best choice487

among similar methods. This combination avoids the disadvantages of ConvL-488

STM with only a single spatiotemporal resolution and 3DConv, which is only489

suitable for short-term spatiotemporal learning, and directly models the advanced490

temporal and spatial abstraction of MIs by considering spatial and temporal cor-491

relations.492

24

Table 3: STFPNs work better than other popular spatiotemporal learning constructs.Dice Infracted size (mm)

STVE 92.89% 17.153DConvs 81.67% 65.80ConvLSTMs 86.37% 49.783DConvs + LSTMs 87.46% 42.01LSTMs 74.07% 116.74CNNs 71.91% 224.44

5.3. Advantage of the conditional GAN architecture493

Figure 9, Tables 1 and 2 show that the DSTGAN has better segmentation and494

quantification performance than STVE + CTG without the GAN architecture only495

because it enhances the inter-/intralabel relatedness via adversarial learning. To496

evaluate this ability, discriminators Ds, Dq and Dr are individually implemented497

in the CTG for the estimated tasks. Among these results, Tables 1 and 2 indi-498

cate that the integration of the top-down, coarse-to-fine feature map generation499

process and the cross-task connection sharing process enables improving the ac-500

curacy of the task generation. This integration improves the loss of information501

when a feature map is multiple-sampled or upsampled and improves the shared502

representation between segmentation and quantification tasks by learning the re-503

ciprocal interaction between them. Tables 1 and 2 and Figure 9 indicate that the504

three types of discriminators effectively improve the accuracy and robustness of505

the generated results by learning the relatedness within and among the task labels,506

respectively, and this architecture is flexible and easy to implement. Moreover,507

Tables 1 and 2 indicate that the conditional information effectively controls the508

generation process to reduce the fluctuation of the result and improve the robust-509

ness of the network.510

5.4. Advantage of the jointing using L1 and L2 loss511

Table 4 demonstrates that our generator loss of jointly using the L1 and L2512

term achieve an overall better solution than only using the L1 term. Jointly us-513

ing L1 and L2 improves the performance in the aspects of infarction size, the514

percentage of the segment, the major axis length, the minor axis length, and the515

orientation. This is because our network involves multitask learning, and L2 of516

L1 + L2 outperforms L1 in the regression part of the efficiency and effectiveness.517

Moreover, jointly using L1 and L2 significantly improves the stability for all seg-518

mentation and quantification results. This is because the addition of the L2 term519

25

Table 4: Jointly using L1 and L2 loss achieves a better solution than only the L1 term.Dice Size Per-size Per-Seg Peri

L1 92.90 ± 0.51 17.94 ± 9.11 9.06 ± 8.62 4.61 ± 3.27 3.90 ± 2.88L1+L2 92.89 ± 0.33 17.15 ± 8.91 9.24 ± 7.73 4.07 ± 2.91 3.95 ± 2.21

Cent Majo Mino Orie TranL1 0.95 ± 0.77 2.21 ± 1.11 1.13 ± 0.82 5.49 ± 3.09 13.02 ± 10.52

L1+L2 0.96 ± 0.77 2.04 ± 1.10 1.07 ± 0.80 5.47 ± 2.24 13.18 ± 9.38

MAE is used to evaluate all the quantification indicesPeri = perimeter, Cent = centroid, Majo = major axis length, Mino = minor axis length, Orie= orientation, Tran = transmurality

avoids the issues that the derivatives of L1 are not continuous in all tasks, thus520

more stable results achieved.521

5.5. Comparison with state-of-the-art methods522

Table 5 demonstrates that DSTGANs achieve higher segmentation and quan-523

tification accuracy and more quantification terms than the existing methods in seg-524

mentation/quantification (Seq/Qua). MuTGAN corresponds to our MICCAI 2018525

work. Both terms ‘Seq’ and ’Qua’ indicate that this method can directly segment526

or quantify myocardial infarction without additional steps. The term ‘NaN’ indi-527

cates that this method cannot estimate this index. Overall, by comparison, DST-528

GANs consistently outperform our previous method and all state-of-the-art MI529

segmentation methods (Xue et al., 2017; Popescu et al., 2016; Bleton et al., 2015)530

in terms of accuracy, Dice coefficient, IoU, infarct size, perimeter and centroid,531

and the MI’s transmurality and percentage of infarct size of the LV are obtained532

for the first time. These significant improvements in segmentation, particularly533

in IoU, indicate that DSTGANs improve the existing methodshave large numbers534

of badly scattered points, unconnected areas and severe overestimation problems535

(Xue et al., 2017). A possible reason is that our method systematically models the536

motion and deformation of 2D+T data instead of the pixel-by-pixel estimation in537

existing methods and uses the inter-/intra-label relatedness to regulate the train-538

ing process. Furthermore, the improvement in MI quantification demonstrates the539

superiority of DSTGANs as a direct quantification method and a combination of540

quantification and segmentation. This work exceeds the work in MICCAI-2018541

through a more effective spatiotemporal feature learning framework, a more de-542

tailed generation structure, and more advanced training strategies.543

26

Table 5: DSTGANs simultaneously segment and quantify MIs without a contrast agent and yieldhigher performance than some state-of-the-art methods.

DSTGANs MuTGAN Xu et al Popescu et al Bleton et al(Xu et al., 2017)(Popescu et al., 2016)(Bleton et al., 2015)

Seg/Qua Seg and QuaSeg and Qua Seg only Seg only Seg onlyAccuracy 96.98% 96.57% 94.73% 86.47% 84.39%

Dice 92.89% 90.18% 89.40% 75.33% 71.51%IoU 82.16% 79.51% 76.28% 68.15% 63.72%

Infarct size 17.15 23.24 93.16∗ 159.93∗ 194.75∗

Perimeter 3.95 5.27 17.83∗ 31.62∗ 44.37∗

Centroid 0.96 1.16 6.49∗ 8.03∗ 10.92∗

Pct. infarct size 4.07 NaN NaN NaN NaNTransmurality 13.18 NaN NaN NaN NaN

∗ Quantification result estimated from segmentation result, and Pct. = Percentage ofNote that the MuTGAN corresponds to our MICCAI 2018 work.

6. Discussion544

To the best of our knowledge, this is the first report that describes a non-545

contrast agent approach that can provide clinicians with all the required factors546

for the current clinical assessment of MI directly from cine MR imaging.547

Our approach satisfies a clinical desire to replace contrast agent administra-548

tion with an inexpensive, end-to-end, low-risk, reproducible and steady tool with549

high clinical impact that can change the clinical procedure and management of550

MI patients. This is vital for clinical studies because of the limitations of contrast551

agent administration (e.g., using gadolinium chelate), which may not be suitable552

for patients with chronic kidney disease according to the US Renal Data Sys-553

tem. Specifically, >40% patients with chronic kidney disease also suffer from554

cardiovascular disease (Fox et al., 2010), and 20% of acute MI patients have ac-555

companying chronic kidney disease (Fox et al., 2010). In contrast, our frame-556

work only requires cine images, which are inevitably acquired as a routine part557

of a cardiac examination for assessment of function, and produces results with558

high accuracy in segmenting and quantifying MIs that are comparable to the LGE559

technique. This is evidenced by Table 6, which indicates that no significant dif-560

ferences (P > 0.05) were found between the results obtained by our method and561

the ground truth obtained by experienced experts.562

For the first time, our approach provides a clinical tool that can automatically563

quantify the transmurality and percentage infarct size of the LV of the MI di-564

rectly from cine MR images. This ability can help clinicians to more accurately565

1) plan the patient treatment because these indices are the primary determinant of566

27

Table 6: No significant differences (P > 0.05) were found between the quantification resultsobtained using our methods and the clinician-measured values in the LGE image.

Characteristic DSTGANs Ground Truth PInfarct Size (cm2) 9.31±2.69 7.74±2.20 .146Pct. infarct size (%) 28.94±19.31 24.47±16.25 .060Pct. segments (%) 27.39±16.43 25.97±12.29 .079Perimeter (mm2) 172.55±73.49 170.85±82.79 .102Centroid X:30.24±13.66 X:30.03±13.00 .743

Y:34.34±10.91 Y:33.72±11.28Major axis length (mm) 73.74±26.15 72.16±28.42 .515Minor axis length (mm) 33.64±21.22 31.40±19.99 .260Orientation (◦) 109.44±56.49 106.02±56.28 .112Transmurality (%) 77.60±33.54 64.59±27.38 .089

Pct. = Percentage of

LV remodeling in long-term survivors of MI (Ørn et al., 2007), 2) diagnose the567

LV constriction damage because the percentage infarct size of the LV is a major568

determinant of the LV performance after MI (Ørn et al., 2007) and 3) determine569

the possibility of cardiac functional recovery after revascularization because lower570

transmurality indicates a high likelihood (Ørn et al., 2007). However, the direct571

quantification of these two indices is extremely challenging for a diagnosis based572

on nonenhanced imaging because 1) small transmural and subendocardial scars573

often show no signal abnormalities on cine MR images and 2) additional informa-574

tion on the LV wall that is not included in our training data is required, with one575

method requiring LV wall thickness and the other requiring LV wall size. Even so,576

our approach successfully quantifies these two indices directly from cine MR im-577

ages because it can learn the detailed pixel’s motion and deformation information578

of all tissues in the cine MR images during training (some of the characteristics of579

the LV wall have been learned and displayed), as shown in Figure 10.580

Our proposed DSTGAN can be extended to similar applications because of581

the following technical advantages: 1) Multi-scale kernel 3D convolution and582

spatiotemporal feature pyramids are first constructed to effectively handle the high583

spatiotemporal variability in different spatial scales and temporal periods, and they584

overcome the limitation of variability with great gaps in spatiotemporal patterns585

and disturbances in the pixel gray levels among different locations or periods. 2)586

The task-aware objective function learns a reciprocal commonality representation587

of two related tasks to train mappings from the input cine MR images to two re-588

28

Figure 10: The feature map clearly demonstrates the ability of our method to learn the motionand deformation of cine MR images. The second row represents the 32 × 32 feature maps of thecorresponding segmentation ground truth (first row). The high-intensity regions in these featuremaps approximately correspond to the most responsive area for the deformation and motion. Thelow-intensity regions represent the infarction area because infarction causes deformation and mo-tion abnormalities (motion reduced and myocardium thinned). In addition, our method can also beunsupervised to learn some of the characteristics of the LV wall through the unique spatiotemporalinformation of the myocardium.

lated outputs and ensures that the generated results indeed approach the ground589

truth via adversarial training. This function overcomes the limitation of multi-590

tasking adversarial training, which requires notably different loss formulations to591

address different distributions of features, which are difficult to train. 3) Con-592

ditional adversarial learning with multiple discriminators successfully iteratively593

optimizes the integrated distribution with the input as conditional information us-594

ing the label contiguity to produce deterministic outputs. This learning strategy595

to avoid a basic conditional GAN has a stochastic output because of noise z and596

enhances the generation performance.597

7. Conclusion598

A deep spatiotemporal adversarial network has been proposed for simultane-599

ous segmentation and quantification of MIs without the use of contrast agents. The600

DSTGAN was conducted on 165 subjects and yielded a pixel classification accu-601

racy of 96.98%; the MAE of the infarction size was 17.15 mm2. These results602

demonstrate that our proposed method can be an efficient and accurate clinical603

tool to systemize MI assessment and provide timely interpretations for clinical604

29

follow-up.605

References606

Afshin, M., Ayed, I.B., Islam, A., Goela, A., Peters, T.M., Li, S., 2012. Global607

assessment of cardiac function using image statistics in mri, in: International608

Conference on Medical Image Computing and Computer-Assisted Intervention,609

Springer. pp. 535–543.610

Bijnens, B., Claus, P., Weidemann, F., Strotmann, J., Sutherland, G.R., 2007.611

Investigating cardiac function using motion and deformation analysis in the612

setting of coronary artery disease. Circulation 116, 2453–2464.613

Bleton, H., Margeta, J., Lombaert, H., Delingette, H., Ayache, N., 2015. My-614

ocardial infarct localization using neighbourhood approximation forests, in: In-615

ternational Workshop on Statistical Atlases and Computational Models of the616

Heart, Springer. pp. 108–116.617

Duchateau, N., De Craene, M., Allain, P., Saloux, E., Sermesant, M., 2016. Infarct618

localization from myocardial deformation: prediction and uncertainty quantifi-619

cation by regression from a low-dimensional space. IEEE transactions on med-620

ical imaging 35, 2340–2352.621

Falsetti, H.L., Marcus, M.L., Kerber, R.E., Skorton, D.J., 1981. Quantification of622

myocardial ischemia and infarction by left ventricular imaging. Circulation 63,623

747–751.624

Flett, A.S., Hasleton, J., Cook, C., Hausenloy, D., Quarta, G., Ariti, C., Muthu-625

rangu, V., Moon, J.C., 2011. Evaluation of techniques for the quantification of626

myocardial scar of differing etiology using cardiac magnetic resonance. JACC:627

cardiovascular imaging 4, 150–156.628

Fox, C.S., Muntner, P., Chen, A.Y., Alexander, K.P., Roe, M.T., Cannon, C.P.,629

Saucedo, J.F., Kontos, M.C., Wiviott, S.D., 2010. Use of evidence-based thera-630

pies in short-term outcomes of st-segment elevation myocardial infarction and631

non–st-segment elevation myocardial infarction in patients with chronic kidney632

disease. Circulation 121, 357–365.633

Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y., 2016. Deep learning. vol-634

ume 1. MIT press Cambridge.635

30

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,636

Courville, A., Bengio, Y., 2014. Generative adversarial nets, in: Advances in637

neural information processing systems, pp. 2672–2680.638

Gu, B., Shan, Y., Sheng, V.S., Zheng, Y., Li, S., 2018. Sparse regression with639

output correlation for cardiac ejection fraction estimation. Information Sciences640

423, 303–312.641

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C., 2017. Im-642

proved training of wasserstein gans, in: Advances in Neural Information Pro-643

cessing Systems, pp. 5767–5777.644

Heiberg, E., Ugander, M., Engblom, H., Gotberg, M., Olivecrona, G.K., Erlinge,645

D., Arheden, H., 2008. Automated quantification of myocardial infarction from646

mr images by accounting for partial volume effects: animal, phantom, and hu-647

man study. Radiology 246, 581–588.648

Huang, Z., Xu, W., Yu, K., 2015. Bidirectional lstm-crf models for sequence649

tagging. arXiv preprint arXiv:1508.01991 .650

Ingkanisorn, W.P., Rhoads, K.L., Aletras, A.H., Kellman, P., Arai, A.E., 2004.651

Gadolinium delayed enhancement cardiovascular magnetic resonance corre-652

lates with clinical measures of myocardial infarction. Journal of the American653

College of Cardiology 43, 2253–2259.654

Kali, A., Cokic, I., Tang, R.L., Yang, H.J., Sharif, B., Marban, E., Li, D., Berman,655

D., Dharmakumar, R., 2014. Determination of location, size and transmurality656

of chronic myocardial infarction without exogenous contrast media using car-657

diac magnetic resonance imaging at 3t. Circulation: Cardiovascular Imaging ,658

CIRCIMAGING–113.659

Karim, R., Housden, R.J., Balasubramaniam, M., Chen, Z., Perry, D., Uddin, A.,660

Al-Beyatti, Y., Palkhi, E., Acheampong, P., Obom, S., et al., 2013. Evalua-661

tion of current algorithms for segmentation of scar tissue from late gadolinium662

enhancement cardiovascular magnetic resonance of the left atrium: an open-663

access grand challenge. Journal of Cardiovascular Magnetic Resonance 15,664

105.665

Kingma, D.P., Ba, J.L., 2014. Adam: Amethod for stochastic optimization, in:666

Proc. 3rd Int. Conf. Learn. Representations.667

31

Ledesma-Carbayo, M.J., Kybic, J., Desco, M., Santos, A., Suhling, M., Hunziker,668

P., Unser, M., 2005. Spatio-temporal nonrigid registration for ultrasound car-669

diac motion estimation. IEEE transactions on medical imaging 24, 1113–1126.670

Lin, T.Y., Dollar, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J., 2017.671

Feature pyramid networks for object detection., in: CVPR, p. 4.672

Luc, P., Couprie, C., Chintala, S., Verbeek, J., 2016. Semantic segmentation using673

adversarial networks. arXiv preprint arXiv:1611.08408 .674

Mirza, M., Osindero, S., 2014. Conditional generative adversarial nets. arXiv675

preprint arXiv:1411.1784 .676

Ordovas, K.G., Higgins, C.B., 2011. Delayed contrast enhancement on mr images677

of myocardium: past, present, future. Radiology 261, 358–374.678

Ørn, S., Manhenke, C., Anand, I.S., Squire, I., Nagel, E., Edvardsen, T., Dickstein,679

K., 2007. Effect of left ventricular scar size, location, and transmurality on680

left ventricular remodeling with healed myocardial infarction. The American681

journal of cardiology 99, 1109–1114.682

Petitjean, C., Dacher, J.N., 2011. A review of segmentation methods in short axis683

cardiac mr images. Medical image analysis 15, 169–184.684

Popescu, I.A., Irving, B., Borlotti, A., DallArmellina, E., Grau, V., 2016. Myocar-685

dial scar quantification using slic supervoxels-parcellation based on tissue char-686

acteristic strains, in: International Workshop on Statistical Atlases and Compu-687

tational Models of the Heart, Springer. pp. 182–190.688

Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks689

for biomedical image segmentation, in: International Conference on Medical690

image computing and computer-assisted intervention, Springer. pp. 234–241.691

Schlegl, T., Waldstein, S.M., Bogunovic, H., Endstraßer, F., Sadeghipour, A.,692

Philip, A.M., Podkowinski, D., Gerendas, B.S., Langs, G., Schmidt-Erfurth,693

U., 2017. Fully automated detection and quantification of macular fluid in oct694

using deep learning. Ophthalmology .695

Schuijf, J.D., Kaandorp, T.A., Lamb, H.J., van der Geest, R.J., Viergever, E.P.,696

van der Wall, E.E., de Roos, A., Bax, J.J., 2004. Quantification of myocardial697

infarct size and transmurality by contrast-enhanced magnetic resonance imag-698

ing in men. The American journal of cardiology 94, 284–288.699

32

Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F., 2017. Cdc:700

Convolutional-de-convolutional networks for precise temporal action localiza-701

tion in untrimmed videos, in: Computer Vision and Pattern Recognition, IEEE.702

pp. 1417–1426.703

Shou, Z., Wang, D., Chang, S.F., 2016. Temporal action localization in untrimmed704

videos via multi-stage cnns, in: Proceedings of the IEEE Conference on Com-705

puter Vision and Pattern Recognition, pp. 1049–1058.706

Suinesiaputra, A., Ablin, P., Alba, X., Alessandrini, M., Allen, J., Bai, W., Cimen,707

S., Claes, P., Cowan, B.R., D’hooge, J., et al., 2017. Statistical shape modeling708

of the left ventricle: myocardial infarct classification challenge. IEEE journal709

of biomedical and health informatics .710

Sun, H., Zhen, X., Bailey, C., Rasoulinejad, P., Yin, Y., Li, S., 2017. Direct es-711

timation of spinal cobb angles by structured multi-output regression, in: Inter-712

national Conference on Information Processing in Medical Imaging, Springer.713

pp. 529–540.714

Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M., 2015. Learning spa-715

tiotemporal features with 3d convolutional networks, in: Proceedings of the716

IEEE international conference on computer vision, pp. 4489–4497.717

Wong, K.C., Tee, M., Chen, M., Bluemke, D.A., Summers, R.M., Yao, J., 2016.718

Regional infarction identification from cardiac ct images: a computer-aided719

biomechanical approach. International journal of computer assisted radiology720

and surgery 11, 1573–1583.721

Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c., 2015.722

Convolutional lstm network: A machine learning approach for precipitation723

nowcasting, in: Advances in neural information processing systems, pp. 802–724

810.725

Xu, C., Xu, L., Gao, Z., Zhao, S., Zhang, H., Zhang, Y., Du, X., Zhao, S., Ghista,726

D., Li, S., 2017. Direct detection of pixel-level myocardial infarction areas727

via a deep-learning algorithm, in: International Conference on Medical Image728

Computing and Computer-Assisted Intervention, Springer. pp. 240–249.729

Xue, W., Brahm, G., Pandey, S., Leung, S., Li, S., 2018. Full left ventricle quan-730

tification via deep multitask relationships learning. Medical image analysis 43,731

54–65.732

33

Xue, W., Nachum, I.B., Pandey, S., Warrington, J., Leung, S., Li, S., 2017. Direct733

estimation of regional wall thicknesses via residual recurrent neural network,734

in: International Conference on Information Processing in Medical Imaging,735


Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L., 2016. End-to-end learning of737

action detection from frame glimpses in videos, in: Proceedings of the IEEE738

Conference on Computer Vision and Pattern Recognition, pp. 2678–2687.739

Zhen, X., Islam, A., Bhaduri, M., Chan, I., Li, S., 2015a. Direct and simultaneous740

four-chamber volume estimation by multi-output regression, in: International741

Conference on Medical Image Computing and Computer-Assisted Intervention,742


Zhen, X., Wang, Z., Islam, A., Bhaduri, M., Chan, I., Li, S., 2016. Multi-scale744

deep networks and regression forests for direct bi-ventricular volume estima-745

tion. Medical Image Analysis 30, 120 – 129. doi:https://doi.org/10.746

1016/j.media.2015.07.003.747

Zhen, X., Wang, Z., Yu, M., Li, S., 2015b. Supervised descriptor learning for748

multi-output regression, in: Proceedings of the IEEE conference on computer749

vision and pattern recognition, pp. 1211–1218.750

Zhen, X., Yu, M., He, X., Li, S., 2017a. Multi-target regression via robust low-751

rank learning. IEEE transactions on pattern analysis and machine intelligence752

40, 497–504.753

Zhen, X., Zhang, H., Islam, A., Bhaduri, M., Chan, I., Li, S., 2017b. Direct754

and simultaneous estimation of cardiac four chamber volumes by multioutput755

sparse regression. Medical image analysis 36, 184–196.756

34

Conflicts of interest: none757

35

segmentation and quantification of infarction without ...80 this model is that the rapid development...

Documents