segmentation and quantification of infarction without ...80 this model is that the rapid development...
TRANSCRIPT
Segmentation and Quantification of Infarction without Contrast Agents via Spatiotemporal Generative Adversarial Learning
Journal Pre-proof
Segmentation and Quantification of Infarction without ContrastAgents via Spatiotemporal Generative Adversarial Learning
Chenchu Xu, Joanne Howey, Pavlo Ohorodnyk, Mike Roth,Heye Zhang, Shuo Li
PII: S1361-8415(19)30108-2DOI: https://doi.org/10.1016/j.media.2019.101568Reference: MEDIMA 101568
To appear in: Medical Image Analysis
Received date: 31 December 2018Revised date: 16 September 2019Accepted date: 30 September 2019
Please cite this article as: Chenchu Xu, Joanne Howey, Pavlo Ohorodnyk, Mike Roth,Heye Zhang, Shuo Li, Segmentation and Quantification of Infarction without Contrast Agentsvia Spatiotemporal Generative Adversarial Learning, Medical Image Analysis (2019), doi:https://doi.org/10.1016/j.media.2019.101568
This is a PDF file of an article that has undergone enhancements after acceptance, such as the additionof a cover page and metadata, and formatting for readability, but it is not yet the definitive version ofrecord. This version will undergo additional copyediting, typesetting and review before it is publishedin its final form, but we are providing this version to give early visibility of the article. Please note that,during the production process, errors may be discovered which could affect the content, and all legaldisclaimers that apply to the journal pertain.
© 2019 Published by Elsevier B.V.
Highlights
• Simultaneous segmentation and quantification of infarction without contrastagent.
• Learning time-series images by using spatiotemporal pyramid representa-tion.
• Improving performance by exploiting the commonalities and differencesacross tasks.
• Embedding segmentation and quantification tasks into adversarial learning.
1
regulated
Spatiotemporal variation encoder (STVE)
Cine MR64 x 64 x 25
Enriched feature extraction of MI
... .
Spat
iote
mpo
ral
varia
tion
enco
der
netw
ork
T
......
nnnnn der
nn
......
...or
k
......
...
Shared representation
Shared representation
SegmentationTask
QuantificationTask
SegmentationResult
QuantificationResults
Cross-task generator (CTG)
Improved generating performance by reciprocal action across tasks.
Three task Labels relatedness discriminators
Z
Generated results
Ground truth
Intra labels relatedness for segmentation
task
Inter labels relatedness
between Seg and Quan
regulated
Corrected the inter-/intra-task labels relatedness by adversarial learning
Work flowNetwork structure
Cine MR
....
Segresult
Quanresults C
Input
Output
Discriminators
Encoder
Generator
L1+L2 Loss
Fake/TureLoss
Intra labels relatedness for Quantification
task
2
Segmentation and Quantification of Infarction withoutContrast Agents via Spatiotemporal Generative
Adversarial Learning
Chenchu Xua, Joanne Howeya, Pavlo Ohorodnyka, Mike Rotha, Heye Zhangb,∗,Shuo Lia,∗
aWestern University, London ON, CanadabSun Yat-Sen University, Shenzhen, China
Abstract
Accurate and simultaneous segmentation and full quantification (all indices arerequired in a clinical assessment) of the myocardial infarction (MI) area are cru-cial for early diagnosis and surgical planning. Current clinical methods remainsubject to potential high-risk, nonreproducibility and time-consumption issues. Inthis study, a deep spatiotemporal adversarial network (DSTGAN) is proposed asa contrast-free, stable and automatic clinical tool to simultaneously segment andquantify MIs directly from the cine MR image. The DSTGAN is implementedusing a conditional generative model, which conditions the distributions of theobjective cine MR image to directly optimize the generalized error of the map-ping between the input and the output. The method consists of the following: 1)A multi-level and multi-scale spatiotemporal variation encoder learns a coarse tofine hierarchical feature to effectively encode the MI-specific morphological andkinematic abnormality structures, which vary for different spatial locations andtime periods. 2) The top-down and cross-task generators learn the shared repre-sentations between segmentation and quantification to use the commonalities anddifferences between the two related tasks and enhance the generator preference.3) Three inter-/intra-tasks to label the relatedness discriminators are iterativelyimposed on the encoder and generator to detect and correct the inconsistenciesin the label relatedness between and within tasks via adversarial learning. Ourproposed method yields a pixel classification accuracy of 96.98%, and the mean
∗Corresponding authorEmail address: [email protected] (Shuo Li)
Preprint submitted to Elsevier October 4, 2019
absolute error of the MI centroid is 0.96 mm from 165 clinical subjects. Theseresults indicate the potential of our proposed method in aiding standardized MIassessments.
Keywords: Myocardial infarction, Segmentation, Full quantification, Sequentialimages, Generative adversarial networks
1. Introduction1
Accurate and simultaneous segmentation and full quantification of myocardial2
infarction (MI) areas (all indices are required in a clinical assessment, including3
infarct size, percentage of infarct size, percentage of segments, perimeter, cen-4
troid, major axis length, minor axis length, orientation and transmurality (Ørn5
et al., 2007)) are the most effective tools to help clinicians quantitatively and qual-6
itatively assess MI areas (Karim et al., 2013; Heiberg et al., 2008; Schuijf et al.,7
2004). MI frequently leads to fatal heart failure, and the success of its therapy de-8
pends highly on the potential assessment of the scar status. In recent years, clinical9
studies have indicated that the accurate segmentation of MI areas is important be-10
cause it reveals the infarcted tissues and facilitates the identification and recovery11
of the dysfunctional but still viable myocardium (Karim et al., 2013; Ingkanisorn12
et al., 2004). However, segmentation alone cannot provide all necessary scar in-13
formation for further treatment. The accurate estimation of all indices of the MI14
area has gained increasing attention because it reveals the presence, location, and15
transmurality of acute and chronic MI to improve the selection of the appropri-16
ate therapeutic options (ablation or resynchronization) (Heiberg et al., 2008; Flett17
et al., 2011). Thus, both segmentation and full quantification of MIs are indispens-18
able for early MI patient management and therapy planning because they provide19
all required information for a thorough understanding of MIs in clinical studies.20
Current clinical procedures for MI patients relying on late gadolinium en-21
hancement (LGE) cardiac magnetic resonance (MR) imaging, which is a ‘gold22
standard’ approach to MI area imaging, are subject to potential high-risk, non-23
repeatable and time-consuming problems (Fox et al., 2010; Kali et al., 2014; Or-24
dovas and Higgins, 2011), as shown in Fig. 1. A potential high risk caused by the25
intravenous administration of gadolinium-based contrast agents during the imag-26
ing of MI patients (patients after the initial diagnosis by an electrocardiograph)27
is that this administration can be fatal to patients with chronic kidney diseases28
(Fox et al., 2010). Patients with coronary artery disease have a higher prevalence29
of coexisting chronic kidney disease than patients without coronary artery dis-30
4
·
·
·
·
·
·
·
·
·
·
·
·
q
q
Figure 1: The proposed method can segment non-contrast agents and quantify an infarction di-rectly from a cine MR image. The advantages of this approach are the great benefit to patientswith MI compared to existing LGE-based clinical procedures.
ease. The process is non-repeatable because it requires an experienced clinician31
to manually/semi-automatically segment the MI areas from the LGE MR image,32
and this process is subject to high inter-observer variability and subjectivity (Or-33
dovas and Higgins, 2011). The process is also time consuming because it involves34
three steps: LGE imaging, manual/semi-automatic segmentation from the LEG,35
and manual/semi-automatic quantification from the segmentation itself to obtain36
the final quantification result, which may even result in an accumulative error37
(Kali et al., 2014).38
To overcome the aforementioned problems of current clinical procedures, sev-39
eral methods have been attempted to directly segment the MI area from cine MR40
5
images . Cine MR imaging is one of the most common myocardial function exam-41
inations, and it presents accurate myocardial spatiotemporal changes. Ultimately,42
these spatiotemporal changes in myocardial viability caused by MI directly af-43
fect the electrical and muscle contractions and drive the morphology and kine-44
matic differences between the MI area and the healthy myocardium (Bijnens et al.,45
2007). The use of cine MR images in successfully detecting and locating infarcted46
areas without contrast agents has been reported(Suinesiaputra et al., 2017; Wong47
et al., 2016). However, this approach has not been sufficiently exploited. Only48
two cine MR-based methods obtained segmentation of the MI with comparable49
accuracy to LEG MR (Duchateau et al., 2016; Xue et al., 2017), and no previous50
method enables one to simultaneously accurately segment and quantify infarcted51
areas simultaneously.52
Accurate and simultaneous segmentation and full quantification of infarcted53
areas in MI patients directly from cine MR images is extremely challenging and54
involves the following issues: 1) Systematic learning of the spatiotemporal fea-55
tures of the MI. There were significant differences and inconsistencies in the spa-56
tial and temporal correlation of MI in different cine MR images. Since there is57
no a priori information of the infarcted area in the cine MR images (Petitjean and58
Dacher, 2011), the effective learning of the hidden spatiotemporal information of59
cine MR images becomes a prerequisite to establish the intrinsic representation60
of the MI to differentiate infarcted areas from other tissues. 2) Direct integra-61
tion of the contiguous feature map of different outputs. There was great hetero-62
geneity and locally optimal discrepancy while training on the input 2D+T (two-63
dimensional + time) cine MR mapping two different outputs (1D and 2D)(Luc64
et al., 2016). Accurate learning of the mapping between the segmented image or65
estimated indices and the cine MR images highly depends on the ability to lever-66
age the adjacency in the output label maps to guide the optimization of feature67
maps in the training process, such as the optimization of fine details in the seg-68
mentation feature maps. 3) Effective exploitation of the significant commonalities69
and differences between segmentation and quantification tasks (Xue et al., 2018).70
Segmentation and quantification as two related tasks that share commonalities can71
use a shared representation during the learning process to produce a beneficial in-72
teraction and improve the learning efficiency and prediction accuracy.73
To address these challenges, a deep spatiotemporal generative adversarial net-74
work (DSTGAN) is proposed to jointly segment and quantify MIs directly from75
cine MR images without the use of contrast agents. The DSTGAN investigates76
the infarcted area segmentation/quantification problem from the perspective of77
conditional generative adversarial networks (GANs), which learn a conditional78
6
Cine MR
Seg
result
Quan
results
C
Input
Output
Discriminators
Generator
L1+L2
Loss
Fake/Tur
e
Loss
Figure 2: The proposed DSTGAN is a conditional generative model. It will be flexible enoughto simultaneously segment the image and quantify the index by conditioning the spatiotemporalrepresentation on both the generator and the three discriminators.
generative model (Fig. 2) (Mirza and Osindero, 2014). The reason for using79
this model is that the rapid development of GANs has shown impressive results80
in continuous/discrete data generation (Goodfellow et al., 2014), particularly for81
assessing the joint configuration of multi-label variables and enforcing forms of82
higher-order consistency (Luc et al., 2016). As shown in Figure 1, for a query cine83
MR image of any patient who is preliminarily diagnosed with MI, the DSTGANs84
will be flexible enough to simultaneously obtain a segmented image and quanti-85
fied indices by conditioning the spatiotemporal representation of the MI on the86
input and correspondingly generated output. Specifically, the cine MR images are87
first mapped to a latent vector through an encoder; then, the vector is projected to88
segmentation and quantification through a deconvolutional/regression generator.89
Three adversarial networks are imposed on the generator, which force the gener-90
ated results to be closer to clinical standards and the distribution of input cine MR91
images. The contribution of the proposed DSTGANs can be summarized from92
four aspects:93
• For the first time, a non-contrast agent MI simultaneous segmentation and94
quantification approach is proposed to provide clinicians with all images or95
indices required in the current clinical assessment of MI directly from 2D+T96
cine MR images.97
• An inherent hierarchical and multi-scale 2D+T encoder is proposed for the98
7
learning of a strong and comprehensive spatiotemporal pyramid represen-99
tation of 2D+T images. It systematically models the time-series image se-100
quence with large rotation, distortion and different local correlation struc-101
tures between consecutive frames in different spatial locations and time pe-102
riods.103
• A reciprocal cross-task architecture is proposed to improve the learning ef-104
ficiency and generation accuracy by exploiting the commonalities and dif-105
ferences across tasks. It merges the deep and coarse information with the106
shallow and fine information of the MI throughout the network to enrich107
this beneficial interaction representation.108
• An iterative conditional adversarial learning strategy is proposed to inte-109
grate three different target discriminators to flexibly use multiple inductive110
biases to simultaneously and directly approximate the generated distribu-111
tions to the clinical standards.112
This work advances our preliminary work in MICCAI-2018: 1) A new pow-113
erful spatiotemporal feature learning network now encodes cine MR images and114
improves the segmentation/quantification. 2) Conditional GANs have been used115
in the adversarial training process to enhance the generated results. 3) Three iter-116
ative discriminators optimize different distribution tasks to make the model more117
flexible, while the iterative training strategy ensures that their training is stable.118
4) Two additional quantitative indicators (percentage of infarct size and transmu-119
rality) have been estimated to cover all indicators that can be used in the clinic.120
To our knowledge, this is the first report of these two indicators being directly121
estimated from the cine MR images of MI patients. 5) Experiments are extended122
to a larger database, and more statistical validations with rigorous discussion have123
been added to this paper.124
2. Related Work125
2.1. Existing MI segmentation or quantification methods126
Currently, there is no method for simultaneous segmentation and quantifica-127
tion of the MI directly from a cine MR image, and even a direct MI quantification128
method is not available (Fig. 3). Existing MI segmentation methods cannot satisfy129
clinical needs in terms of accuracy because these methods provide sparse move-130
ment, and the deformation information cannot accurately detect detailed abnor-131
mal heart wall motion. These methods mainly fall into three categories: energy-132
8
T1
T25
· Both spatial and temporal learning
· Tasks reciprocal interaction
· Inter/intra-task labels relatedness
forcing
Coarse
segment-level localization
Inaccurate
quantification from
localization results
T
Temporal information onlyPost processing to
regularize segmentation
Multiple-frame
Cine MR
Cine MR
Simultaneous segmentation and full
quantification dirtily from MR
Endocardial
/Epicardial
segmentation
Figure 3: The proposed method can simultaneously segment and quantify infarcts directly fromthe cine MR image because of the improvements in existing methods and the specific design forMI characteristics.
based (Ledesma-Carbayo et al., 2005), statistical shape model-based (Suinesiapu-133
tra et al., 2017), and machine learning-based (Bleton et al., 2015). However, until134
now, only Duchateau et al (Duchateau et al., 2016) and Xu et al (Xue et al., 2017)135
have successfully achieved pixel-wise MI segmentation through the LV shape un-136
certainty model and patch-based recurrent neural networks (RNNs), respectively.137
However, these two methods face the issue of insufficient accuracy because of138
the cumbersome LV shape modeling process via manual or semi-automated endo-139
cardial/epicardial delineation and only single-scale correlation learning for both140
spatial and temporal features. Moreover, all existing MI quantification methods141
are based on the quantitative segmentation results (Kali et al., 2014). Compared142
to this multi-step method, direct quantification in one-step regression is more effi-143
cient and more accurate because it avoids the cumulative error generated by mul-144
tiple steps. Because of the excellent ability of deep learning in task-aware feature145
representation, direct quantification has also been widely recognized in the med-146
ical image community and used in the estimation of different tissues or lesions147
(Schlegl et al., 2017; Sun et al., 2017). Specifically for the cardiac field, success148
in using direct quantification has also been reported for estimating a wide range of149
9
indices (Zhen et al., 2015b, 2017a; Afshin et al., 2012), including volume (Zhen150
et al., 2017b), wall thickness (Xue et al., 2017) and ejection fraction (Gu et al.,151
2018), from various cardiac images (Zhen et al., 2016, 2017b, 2015a), which led152
to a challenge called LV full quantification at STACOM-MICCAI’18 (Xue et al.,153
2017).154
2.2. Deep spatiotemporal feature learning155
The existing deep spatiotemporal feature learning methods, mainly divided156
into RNN-based (Yeung et al., 2016) and 3D convolution (Conv)-based (Tran157
et al., 2015) methods, demonstrate powerful performance (Xingjian et al., 2015;158
Shou et al., 2016). However, most of these methods cannot be mechanically ap-159
plied to MI feature learning: 1) The spatial part of spatiotemporal features that are160
missing or have incomplete results in the RNN-based methods are incapable of161
modeling the morphological and kinematic changes specific to the MI. The RNN162
and its variants (such as long short-term memory (LSTM) and convolution RNN163
(Xingjian et al., 2015)) have been widely used to model the temporal state transi-164
tions in frames to capture motion information and improve the inter-frame consis-165
tency. However, LSTM fails in taking spatial features into consideration because166
the fully connected layer learns the 2D+T images as a one-dimensional sequence.167
Convolution RNN methods, in which the convolution layer is a local constant168
(fixed kernel size) filter, lack the ability to model the non-local spatiotemporal169
features specific to the MI with long-range rotation or distortion and different170
local correlation structures in different spatial locations and time periods (Shou171
et al., 2016). 2) The loss of granularity over time results in the 3DConv-based172
model being unable to systematically model all 25 frames from the beginning to173
the end. 3DConv has been used to capture temporal scene changes when actions174
are being performed; it is commonly placed into the 3D convolutional neural net-175
work framework, which can learn the spatiotemporal abstraction of high-level se-176
mantics directly from 2D+T images (Shou et al., 2016). However, 3DConv loses177
granularity during the temporal growth (Shou et al., 2017).178
2.3. Adversarial label relatedness reinforcement179
GANs have achieved much success in reinforcing and correcting inconsis-180
tent label relatedness in training feature maps. Although the conditional random181
field (CRF)-based methods also enable reinforcing label relatedness and demon-182
strate impressive performance when labels are independently predicted, CRF-183
based methods have limitations in our application because the learned parame-184
ters of higher-order potentials are limited in number when both segmented and185
10
quantified label relatedness are integrated into 2D+T data. Thus, the use of an186
adversarial training approach enforces the multi-range label relatedness without187
being limited to a notably specific class of high-order potentials in a CRF model,188
which is a wise choice. The GAN work introduces a new framework for estimat-189
ing generative models via an adversarial process (Goodfellow et al., 2014). It is190
composed of two models that are alternatively trained to compete with each other.191
The generator G captures the distribution of training samples and learns to gener-192
ate new samples to imitate the training. The discriminative modelD differentiates193
the generated samples from the training. D is optimized to distinguish training194
samples and to generate results generated by G using a two-player min-max game195
with the following objective function:196
minG
maxD
Ex∼pdata(x)[logD(x)]+
Ez∼p(z)[log(1−D(G(z)))](1)
where training data are represented by x, the data distribution is pdata, and z is a197
vector sampled from a certain distribution p(z).198
In conditional GANs (Mirza and Osindero, 2014), by conditioning the model199
on additional information, it is possible to direct the data generation process. Con-200
ditional GANs successfully solve the instability of the original GAN, for which201
the training process is unstable and the generated results are often noisy and in-202
comprehensible.203
3. Methodology204
The structure of the DSTGANs is shown in Fig. 4. The input of DSTGANs205
is a 2D + T cine MR dataset X , and the output is Y = (y1, y2, ..., yn), which206
includes a segmentation task ys = y1 and a quantification task yq = y2, ..., yn,207
n=10 (including infarct size, percentage of infarct size, percentage of segments,208
perimeter, centroid, major axis length, minor axis length, orientation and trans-209
murality). DSTGANs apply the GAN model in the conditional setting via three210
different functional parts: 1) The spatiotemporal variation encoder (STVE) E(.)211
simultaneously encodes the enriched local correlation structure of different spa-212
tial locations and temporal periods in X through spatiotemporal variation learn-213
ing with a pyramidal structure (Goodfellow et al., 2016). The output of STVE z214
preserves a comprehensive representation of input X , which includes all feature215
maps from the low level to the high level. 2) The cross-task generator (CTG) G(.)216
generates the segmentation task and quantification task results simultaneously and217
11
regulated
Spatiotemporal variation encoder
(STVE)
Cine MR
64 x 64 x 25
Enriched feature extraction of MI
... .. ... ...
Sp
ati
ote
mp
ora
l
va
ria
tio
n
en
cod
er
ne
two
rk
T
...
...
nn
n
de
r
n
...
...
...
ork
...
...
...
Shared
representation
Shared
representation
Seg
Task
Quan
Task
SegmentationResult
QuantificationResults
Cross-task generator (CTG)
Improved generating performance
by reciprocal action across tasks.
Three task Labels relatedness
discriminators
Z
Generated results
Ground truth
Intra labels relatedness for
Seg task
Inter labels relatedness
between Seg and Quan
regulated
Corrected the inter-/intra-task labels
relatedness by adversarial learning
Intra labels relatedness for
Quan task
Figure 4: The proposed DSTGAN effectively integrates the spatiotemporal variation encoder,cross-task generator, and three intra-/inter-task label relatedness discriminators to accurately seg-ment (seg) and quantify (Quan) MIs.
reciprocally from the pyramid feature map z via a top-down pathway, and multi-218
ple connections upsample and connect the beneficial interaction feature maps of219
these two related tasks. The output of the CTG can be expressed as G(z) = Y .220
3) Three discriminators (discriminator segmentation Ds, discriminator quantifica-221
tion Dq, discriminator relationship Dr) are conditioned on X for iterative impo-222
sition on G. Regarding all of them, Ds and Dq force regularization of the label223
relatedness within segmentation y1 and quantification y2, ....yn to conform to the224
relatedness in ground truth, respectively; Dr regularizes the relationship between225
y1 and y2, ....yn to conform to those in the ground truth.226
3.1. Spatiotemporal variation encoder (STVE) to enrich the feature extraction of227
MI228
The STVE innovatively leverages 4 spatiotemporal variation convolution blocks229
(STVCB) to build a pyramidal feature to accurately learn the left ventricular mor-230
phology and kinematic abnormalities. As shown in Fig. 5, this framework with231
four STVCBs uses a spatiotemporal scaling step of 2 to sequentially encode the232
cine MR images X from high-resolution to low-resolution and outputs the four233
spatiotemporal resolution feature maps respectively. Then, these four feature234
maps combine into a set of pyramidal features z. Specifically, the four STVCBs235
are block1, block2, block3, and block4 and correspond to the output four spa-236
12
T
Cine MR
64 x 64 x 25
Strong spatiotemporal
representation for all
shape MI
Input
STVACB
STVACB
STVACB
STVACBSTVACBSTVACB
STVACB
Output
Feature
pyramid
structure
Spatiotemporal variation convolution
block (STVCB)
Conv 1
W x H x T
Conv 2
W/2 x H x T
Conv 3
W x H/2 x T
Conv kW/2 x H/2 x
T/2
Images or features
W x H x T
T
W
H
Conv
kernel
Designed for models different local correlation structures of
different spatial locations and time periods of MI
Spatiotemporal variation
encoder (STVE)
Figure 5: The proposed STVE systematically models different local correlation structures of dif-ferent spatial locations and time periods of MIs and other tissues in successive frames using theinnovative design of multiple spatiotemporal variation convolution blocks and the spatiotemporalfeature pyramid structure. The STVE maps the input cine MR images to a feature pyramid Z.
tiotemporal feature maps {C1, C2, C3, C4}, where the spatiotemporal resolution237
of those pyramid features are {32× 32× 16, 16× 16× 6, 8× 8× 3, 4× 4× 1}238
respectively.239
The benefit of such pyramid features is two-fold: First, by using multiple scale240
feature learning frameworks, the encoder can handle the asymmetry distortion241
(Falsetti et al., 1981), which is a mutability myocardial motion and deformation242
pattern caused by MI. Second, by using strided convolution instead of the pooling243
layers, the encoder can be fully differentiable and learn its own spatial downsam-244
pling.245
Each STVCB integrates multiple spatiotemporal scale sliding sub-blocks that246
consist of 3DConvs and aConvLSTM to produce different kernel sizes and avoid247
defects when they are separately used. Multiple sliding convolutional windows of248
different kernel sizes are used for multiple associations to learn the feature map249
from the input of a block. Different sliding windows with different kernel sizes250
proportionally scale the spatiotemporal resolution of the STVCB to simultane-251
ously learn the spatiotemporal dependency structure with different spatial scales252
13
and multiple time periods. Specifically, 3 scales and 3 aspect ratios are the default253
for each STVCB. SinceW0, H0, and T0 are the inputs of the block, the scale ratios254
are⌈1, 1
2, 14
⌉for all W0H0T0, and the aspect ratios are (1, 1, 2), (1, 2, 1), (2, 1, 1)255
for both scale ratios⌈12, 14
⌉. ConvLSTM is selected for convolutional windows256
with a scale ratio of {1}, which is the input feature map of an STVCB. This se-257
lection is natural because ConvLSTM can handle the long temporal information258
with the spatial correlation better than 3DConv. Simultaneously, the other scales259
and aspect ratios are used for 3DConv because it can be directly systematically260
modeled for a short time period by simply adjusting the size of the convolution261
kernel.262
3.2. Cross-task generator (CTG) to share the commonalities between the tasks263
The CTG top-down upsamples and merges feature maps from the spatiotem-264
poral feature pyramid z, and it reciprocally connects the features of different lay-265
ers and different tasks to jointly generate the MI segmentation and quantification266
results, as shown in Fig. 6.267
This CTG efficiently integrates coarse, high-layer information with fine, low-268
layer information to avoid contextual information loss and gradient dispersion.269
• For the segmentation task, the component of CTG is similar to the U-Net270
(Ronneberger et al., 2015). It begins in a coarser-resolution feature map271
C4 and merges rest of the z(C3, C2, C1) to the same spatial size features272
during the iterated upsampling processing, where the upsampling size is 2273
times. In other words, the CTG merges the feature map of each upsampling274
unit output with the corresponding {C1, C2, C3, C4} maps (a 1 × 1 convo-275
lutional layer to unify dimensions) and feeds merged features into the next276
upsampling unit.277
• For the quantification task, the component of CTG is modified from the278
FPN (Lin et al., 2017). It uses five fully connected layers to extract five279
quantification-related feature vectors {Q1, Q2, Q3, Q4, Q5} from five scales280
of feature maps that are C4 and four outputs of the upsampling unit. In ad-281
dition, it then combines these five features to regress the final quantification282
result by lateral connections. All dimensionalities of the fully connected283
layers are 256.284
Moreover, the CTG also exploits the detailed commonalities and differences285
between the segmentation task and quantification task to produce the beneficial286
14
Forcible sharing
Figure 6: The proposed CTG simultaneously and accurately generates the MI segmentation andquantification results using the creative design of top-down pathways and cross-task connections.After concatenating z, the CTG maps the input Z to output the generated result Y .
interactions among the tasks by shared training. Both segmentation and quantifi-287
cation components of CTG use shared upsampling output and the pyramid feature288
z, and the weights and bias of these two related tasks forcibly share and mutually289
improve during training.290
3.3. Iterative label relatedness discriminators to regularize the contiguous gener-291
ation of intra-/inter-tasks292
Iterative label relatedness discriminators (Ds, Dq, and Dr) leverage the intrin-293
sic label relatedness between and within the segmentation and quantification to294
regularize the generated results (Fig. 7), which effectively improves the accuracy295
and generalization of the generation. The three discriminators model three types296
of intrinsic label relatedness: 1) Ds models intra-task relatedness within pixels297
of the segmentation image (such as spatial structure information); 2) Dq models298
intra-task relatedness within values of quantification indices (such as the geomet-299
ric sequence between the area and perimeter); 3) Dr models inter-task relatedness300
between segmentation and quantification (such as the relationship between the301
segmentation perimeter and quantification perimeter). The three discriminators302
check that these relatedness characteristics are consistent with the ground truth,303
and then improve the generation performance by improving these relatedness304
characteristics of the generated results until they match the ground truth during305
15
adversarial training. All discriminators condition the spatiotemporal distribution306
of the input cine MR dataset c (after 5 ConvLSTM-Batch Norm-ReLU).307
The discriminatorsDs andDq use the CNN and Bi-LSTM (Huang et al., 2015)308
to build the intra-task label relatedness for segmentation and quantification, re-309
spectively. For Ds, since the CNN-based model (4 layers, 3 × 3 kernel, max-310
pooling, stride 2) can assess the spatial correlation of label variables in the field311
of view, which is the entire image or a large portion of the image, mismatches in312
the label relatedness statistics can be penalized by the adversarial loss term, which313
is the binary cross-entropy. For Dq, since the Bi-LSTMs (3 layers) can consider314
a complete contextual relationship to learn a vector pattern, the adversarial pro-315
cess forces the distribution of the generated yq to gradually approach the prior316
distribution of ground truth. Note that c and yq are uniformly scaled to a uniform317
dimension d = 100 via duplicate or 1DConv.318
Discriminator Dr uses the Bi-LSTM to build the inter-task label relatedness319
between segmentation and quantification. Specifically, the segmentation results320
ys are fed into two downsampling Bi-LSTM units with dimension d. Then, these321
features are concatenated along the channel dimension with the scaled quantifica-322
tion features and fed into three layers of Bi-LSTMs to jointly learn the features323
across the image and the indices. Finally, a fully connected layer with one node is324
used to produce the decision score. Such an inter-task label relatedness learning325
network can help the CTG rectify small defects in the generated results, which en-326
hances more details in the MI boundary areas and makes the quantification more327
stable.328
3.4. Task-aware loss function to effectively and steadily promote the network329
training330
Our task-aware loss function (Eq. 7) is dedicatedly formulated to train DST-331
GAN stability and accuracy by combining three loss terms, which are multi-task332
generation loss (Eq. 2), adversarial loss (Eqs. 3 and 4), and reciprocal loss (Eq. 5).333
Generation loss: To improve the accuracy of the segmented images and quan-334
tified indices, the multi-task generation leverages the elastic net loss function that335
integrates L1 (lasso) and L2 (ridge) regularizations to obtain a sparse model that336
better encourages less blurring and improves the regression performance when it337
predicts the related tasks.338
16
{0, 1}
CineCMR
Input of D_s
{0, 1}
QuantificationResult
Input of D_q ......
{0, 1}
QuantificationResult
G(E_st(X))
Input of D_r
......
SegmentationResult
Iterative training
Discriminators D_s for intra label relatedness of segmentation
SegmentationResult
Iterative training
Three discriminators are designed to regularize three different label Relatedness for improving generating performance
Discriminators D_q for intra label relatedness of quantification
Discriminators D_r for inter label relatedness between Seg and Quan
CineCMR
Figure 7: Three inter-/intra-task label relatedness discriminators effectively improve the perfor-mance of the generator by directly regulating the distribution of results to approximate the clinicalstandards.
The generation loss Lg(E,G,X) are formulated as follows:339
Lg = minE,G
(Y , G(E(X)))
=1
n
n∑
i=1
(‖yi − yi‖1 + ‖yi − yi‖22)
(2)
where Y = (y1, y2, ..., yn) is the ground truth.340
Adversarial loss. To train the adversarial training more stably, the adversarial341
loss is adopted from WGAN-GP (Gulrajani et al., 2017) for three discriminators.342
In particular, WGAN-GP avoids potential discontinuity with respect to the gen-343
erators parameters and local saturation leading to vanishing gradients by using a344
continuous Earth-Mover distance to replace the Jensen-Shannon divergence dis-345
tance in the GAN formulation. It also adds a gradient penalty to penalize the norm346
of the gradient of the critic about its input, thereby avoiding undesired behaviors.347
Thus, adversarial loss La of inter-task label relatedness La1(E,G,X,Ds, Dq) and348
17
intra-task label relatedness La2(G,X, Y,Dr) formula is given as:349
La1 = EE(X)∼P (X)
[D{s,q}(G(E(X)))]
− Ey{s,q}∼P{s,q}
[D{s,q}(y{s,q}, X)]
+λ1 EI{s,q}∼PI{s,q}
[(∥∥∥∇I{s,q}D{s,q}(I{s,q})
∥∥∥2− 1)2]
(3)
The La1 is a min-max objective function to train generator G and the discrim-350
inator segmentation Ds and the discriminator quantification Dq with condition X .351
The I is a penalty on the gradient norm for random samples, and λ is a penalty352
coefficient. The encoded pyramid feature E(X) is imposed on Ds or Dq, the353
distribution of the training data is P{s,q}, and the distribution of the cine MR is354
P (X).355
Similarly, the discriminator Dr, and G with condition X can be trained by:356
La2 = EE(X)∼P (X)
[Dr(G(E(X)))]
− EY∼PY
[Dr(Y,X)]
+λ1 EIY ∼PIY
[(∥∥∥∇IYDr(IY )
∥∥∥2− 1)2]
(4)
La2 drives the discriminator Dr on both the segmentation image and quantifi-357
cation indices to force the generator to further improve the performance. The La1358
of Ds, Dq and La2 of Dr use iterative (separate) strategy training.359
Reciprocal loss. To further improve generalization and stability after adver-360
sarial training, the reciprocal loss is used to minimize the inter-task label related-361
ness difference between the ys and yq while maintaining the performance. The in-362
tuition behind mutual loss is to maintain the accuracy of the results while reducing363
the difference between two quantification results: 1) the result directly from net-364
work regression and 2) the result computed from the segmentation image. These365
two sets of results should be the same because they are obtained from same LGE366
image. Thus, the reciprocal loss as an auxiliary term is added behind the genera-367
tor loss and adversarial loss. It drives the network to have a reciprocal constraint368
for these two results and an inductive bias for further improving generalization369
and stability after obtaining accurate results of adversarial training. Note that the370
percentage of infarct size, the percentage of segments, and the transmurality are371
18
not included. The reciprocal loss Lr(G,X, yq, ys) is defined as:372
Lr = EY∼PY
[||yq − yq||21 + ||yq − yq||22] (5)
where yq is the quantification results obtained from segmentation image.373
Finally, To generate the output Y, the full objective L is built to linearly374
combine all previous partial losses (generation loss Lg, adversarial loss La{1,2},375
and reciprocal loss Lr):376
L = λ2Lg(E,G) + λ3La{1,2}(E,G,X, Y,Ds, Dq, Dr)
+λ4Lr(G,X, yq, ys)(6)
where λ2 , λ3 and λ4 are the hyper-parameters that balance the relative importance377
of the different terms. These hyper-parameters tune the values that achieved the378
overall highest performance in the experiment. Then, we define the following379
minimax problem:380
argminE,G
maxD∈{(DS ,Dq),(Dr)}
L (7)
4. Materials and Implementation Details381
Materials A total of 165 patients were retrospectively selected from two clin-382
ical databases for a retrospective observational study, including 140 MI patients383
and 25 NON-MI patients. All patients were admitted with acute ST-segment el-384
evation MI (in at least 2 contiguous electrocardiogram leads). All patients com-385
pleted cine (12375 images) and LGE (495 images) MR imaging scans. Cine MR386
imaging was performed using a 3T MR system (MAGNETOM Verio, Siemens387
Healthcare) with a 32-channel cardiac coil. After the scout MR images were388
obtained, cine images were acquired during repeated breath-holds to cover the389
entire heart. LGE MR imaging was performed in the same orientations and with390
the same slice thickness as cine imaging using a segmented inversion-recovery391
gradient-echo sequence ten minutes after the intravenous injection of gadolinium392
(Magnevist, 0.2 m mol/kg; Bayer Schering Healthcare). The inversion time was393
optimized to null the myocardium during the image acquisition. The imaging394
parameters were as follows: TR = 10.5 ms, TE = 5.4 ms, FA = 30◦, matrix of395
256×162, slice thickness of 6 mm, BW of 140 Hz/px, and 25 cardiac phases,396
each pixel is 1.26x1.26 mm.397
The ground truth was obtained by two radiologists. In detail, two experienced398
(over 10 years) radiologists (Dr. N. Zhang and Dr. L. Xu) manually assessed399
19
LGE MR images (the well-known gold standard in the clinical MI assessment)400
and segment the infarction area to obtain the ground truth of segmentation, and401
if a disagreement occurred, it was resolved by a consensus between the experts.402
Then, Dr. N. Zhang measured the ground truth of the segmentation to obtain the403
ground truth of the quantification.404
Implementation details. DSTGANs are designed to generate 64 × 64 bi-405
nary images and 1 × 10 indices directly from X ∈ (x1, x2, ..., xT ,RH×W×T ),406
where H and W are the height and width of each temporal frame, respectively407
(H = W = 64), and T is a temporal step set to T = 25. The input vector is408
first encoded to a feature of the pyramid, which consists of a 32 × 32 × 256T ,409
16 × 16 × 256T , 8 × 8 × 256T , 4 × 4 × 256T tensor, where T is the number of410
temporal resolutions in the tensor. Batch normalization (Goodfellow et al., 2016)411
and LeakyReLU activation (Goodfellow et al., 2016) are applied after every block412
of the encoder except the last one. The residual blocks consist of 33 stride 1 con-413
volutions, batch normalization and LeakyReLU. The upsampling blocks consist414
of nearest-neighbor upsampling followed by a 33 stride 1 convolution. All codes415
are based on Keras (Tensorflow), with an 8x NVIDIA P100 for training. All net-416
works are trained using the ADAM solver (Kingma and Ba, 2014) with a batch417
size of 6 and an initial learning rate of 0.001. The λ1 is 10, λ2 is 0.5, λ3 is 1,418
and λ4 is 10. The iterative training strategy is used, and stage 1 iteratively trains419
{E,G} and {Ds, Dq} for 800 epochs by fixingDr. Then, stage 2 iteratively trains420
G and Dr for another 600 epochs by fixing {E,G} and {Ds, Dq}. There are two421
iterative training processes. A network with heart localization layers, as described422
in (Xu et al., 2017), was used to automatically crop the cine MR images to 64423
× 64 region-of-interest sequences including the LV. Moreover, because LGE MR424
imaging was performed in the same orientations and with the same slice thickness425
as cine imaging, no extra steps were required to align them. All experiments were426
assessed with a 10-fold cross-validation test (no patients are in different groups).427
Evaluation criteria. The performance of the proposed DSTGAN framework428
is evaluated based on the generation accuracy of the MI segmentation and quan-429
tification. For the segmentation task, the overall pixel classification accuracy, sen-430
sitivity, specificity, Dice coefficient (Dice =2|Y ∩Y ||Y |+|Y | × 100%) and intersection-431
over-union (IoU = |Y ∩Y ||Y ∪Y | × 100%) between the ground truth and the generated432
image are calculated to evaluate the accuracy. For the quantification task, the433
mean absolute error (MAE =∑n
i=1|Y−Y |n
) and statistical hypothesis testing (p-434
value) between the ground true and generated indices are calculated to evaluate435
20
LGE
image
Ground truth
On
segmentation
Our method
for
segmentation
Difference
after overlap
Figure 8: The infarction areas segmented by our method (third row) are consistent with the groundtruth (second row).
the accuracy. Note that the MAE of the centroid is the sum of the MAE in X-axis436
and the MAE in Y-axis. The p-value is obtained by an independent t-test between437
the prediction values and the ground truth. If it is greater than 0.05 (i.e., p > .05),438
the two groups can be treated as consistent. If p < 0.05, the two groups violate439
the assumption of homogeneity of variances and can be treated as inconsistent.440
5. Experiments and Results441
The DSTGAN demonstrates high performance with a pixel classification ac-442
curacy of 96.98%. The MAE of the centroid point is 0.96 mm with the ground443
truth obtained manually by human experts, which demonstrates the effectiveness444
of this method in MI segmentation and quantification.445
5.1. Accurate MI segmentation and precise MI quantification446
The experimental result shows that the DSTGAN can accurately segment the447
MI, as shown in Figure 8. The DSTGAN achieves an overall pixel classification448
accuracy of 96.98% with a sensitivity of 92.15% and a specificity of 98.70%. The449
Dice coefficient is 92.89±0.33%, and the IoU is 82.16±0.27%; the ROC and PR450
curves are shown in Fig. 9. The ground truth and the result are binary images;451
each pixel is assessed for infarction or normality (0 or 1).452
21
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1-Specificity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Sensitiv
ity
ROC Curves
DSTGANsCTGSTVEConvLSTM3DConvLSTM
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Pre
cis
ion
PR graph
DSTGANsCJGSTVEConvLSTM3DConvLSTM
Figure 9: The ROCs and PRs demonstrate that the DSTGAN outperforms each separate compo-nent and other spatiotemporal learning methods.
DSTGANs can produce a good quantification of the MI, as shown in Ta-453
bles 1 and 2. The MAE computed between the ground truth and our estima-454
tion of the infarction size is 17.15±8.91 mm2; the percentage of the infarcted455
size is 9.24±7.73%; the percentage of the segment is 4.07±2.91%; the perime-456
ter is 3.95±2.21 mm; the centroid is 0.96±0.77 mm; the major axis length457
is 2.04±1.10 mm; the minor axis length is 1.07±0.80 mm; the orientation is458
5.47±2.24◦; and the transmurality is 13.18±9.83%. Table 3 summarizes the com-459
parative results on each quantification index subgroup. The ground truth was ob-460
tained by radiologists who measured the segmentation results of DE-MR images.461
462
Tables 1 and 2 also demonstrated that the joint learning of the segmentation463
task and quantification task outperforms each individual task. The joint learn-464
ing improves the effectiveness and stability for both segmentation and quantifi-465
cation in the terms of accuracy (1% improved), Dice (2% improved), IOU (3%466
improved), infarcted size (1 mm improved), the percentage of the infarcted size467
(4% improved), the percentage of the segment (0.5% improved), the perimeter (1468
mm improved), the major axis length (0.1 mm improved), the minor axis length469
(0.7mm improved), the orientation (1◦ improved), and the transmurality (3% im-470
proved).471
22
Table 1: Each technological innovation component in DSTGANs has effectively improved theaccuracy of the MI segmentation and part of the quantification
Segmentation Quantification
Accuracy Dice IOU Size Per-size Per-Seg(%) (%) (%) (mm2) (%) (%)
ConvLSTM 89.27 83.70 66.10 72.73 10.54 8.91(Baseline) ±3.04 ±3.27 ±6.91 ±62.89 ±8.36 ±7.07+ STVCB 90.94 83.68 69.63 80.37 11.63 6.95
±1.15 ±3.46 ±3.82 ±67.14 ±9.91 ±5.27+ Feature pyramid 92.37 85.68 70.51 49.32 10.14 4.46
structure ±0.81 ±1.01 ±2.66 ±42.41 ±8.72 ±3.16+ CTG 95.10 90.68 73.47 27.26 12.64 5.17
±0.56 ±0.73 ±1.41 ±24.91 ±9.85 ±4.27+ Ds 95.92 91.26 77.34 - - -
±0.58 ±0.58 ±0.30 - - -+ Dq - - - 18.72 10.11 4.62
- - - ±14.15 ±7.04 ±3.30+ Ds and Dq 96.71 91.43 79.52 19.24 8.76 4.34
±0.47 ±0.30 ±0.32 ±11.93 ±6.91 ±3.57+ Dr 96.93 92.11 78.85 20.61 9.69 1.42
±0.23 ±0.38 ±0.37 ±10.04 ±6.72 ±3.34+ Conditional 96.98 92.89 80.16 17.15 9.24 4.07
±0.31 ±0.33 ±0.27 ±8.91 ±7.73 ±2.91
MAE is used to evaluate all the quantification indicesSize = infarct size, Per-size = percentage of infarct size, Per-Seg = percentage of segments
5.2. Advantages of the spatiotemporal feature encoder472
Figures 9 and 10 and Tables 1, 2 and 3 indicate that the STVE performs bet-473
ter than all other frameworks because it creates a pyramid structure to produce a474
coarse to fine hierarchical feature and achieves a strong spatial and temporal rep-475
resentation of multiple scales and aspect ratios. To evaluate the feature-learning476
performance, we replace the STVE with a version without the pyramid structure477
and without STVCBs and with recently popular methods (3DConvs, ConvLSTMs,478
3DConvs + LSTMs, CNNs and LSTMs) in our framework. These experimen-479
tal results demonstrate that the integration of different spatiotemporal scales and480
aspect ratios and the hierarchical, coarse-to-fine spatiotemporal feature learning481
23
Table 2: Table 1 ContinuedQuantification
Peri Cent Majo Mino Orie Tran(mm) (mm) (mm) (mm) (◦) (%)
ConvLSTM 10.17 4.55 9.49 8.64 12.62 18.61(Baseline) ±8.20 ±2.97 ±7.31 ±7.08 ±10.35 ±15.96+ STVCB 7.37 2.78 6.84 5.35 8.27 15.07
±6.94 ±1.28 ±5.01 ±3.50 ±6.22 ±13.72+ Feature pyramid 6.94 1.42 7.08 4.58 8.14 18.31
structure ±5.40 ±0.72 ±5.49 ±3.38 ±7.15 ±16.83+ CTG 8.61 1.13 5.34 2.23 7.74 15.27
±6.68 ±0.88 ±3.97 ±1.39 ±5.81 ±13.35+ Ds - - - - - -+ Dq 4.86 0.91 2.12 1.76 6.58 16.37
±2.28 ±0.74 ±1.50 ±0.94 ±4.29 ±10.92+ Ds and Dq 5.07 0.97 2.31 1.60 6.36 16.18
±1.16 ±0.57 ±1.53 ±0.77 ±3.48 ±11.40+ Dr 4.21 0.92 2.68 1.16 7.60 19.86
±3.04 ±0.72 ±1.78 ±0.82 ±4.07 ±12.09+ Conditional 3.95 0.96 2.04 1.07 5.47 13.18
±2.21 ±0.77 ±1.10 ±0.80 ±2.24 ±9.38
MAE is used to evaluate all the quantification indicesPeri = perimeter, Cent = centroid, Majo = major axis length, Mino = minor axis length, Orie= orientation, Tran = transmurality
framework is more reliable and robust than the cases that do not use or inde-482
pendently use them. This integration provides our framework with a more com-483
prehensive understanding than all existing methods that learn the morphological484
and kinematic abnormalities of the left ventricle. Furthermore, our experiments485
demonstrate that the STVCB, which combines ConvLSTM and 3DConv to simul-486
taneously extract different spatiotemporal resolution features, is the best choice487
among similar methods. This combination avoids the disadvantages of ConvL-488
STM with only a single spatiotemporal resolution and 3DConv, which is only489
suitable for short-term spatiotemporal learning, and directly models the advanced490
temporal and spatial abstraction of MIs by considering spatial and temporal cor-491
relations.492
24
Table 3: STFPNs work better than other popular spatiotemporal learning constructs.Dice Infracted size (mm)
STVE 92.89% 17.153DConvs 81.67% 65.80ConvLSTMs 86.37% 49.783DConvs + LSTMs 87.46% 42.01LSTMs 74.07% 116.74CNNs 71.91% 224.44
5.3. Advantage of the conditional GAN architecture493
Figure 9, Tables 1 and 2 show that the DSTGAN has better segmentation and494
quantification performance than STVE + CTG without the GAN architecture only495
because it enhances the inter-/intralabel relatedness via adversarial learning. To496
evaluate this ability, discriminators Ds, Dq and Dr are individually implemented497
in the CTG for the estimated tasks. Among these results, Tables 1 and 2 indi-498
cate that the integration of the top-down, coarse-to-fine feature map generation499
process and the cross-task connection sharing process enables improving the ac-500
curacy of the task generation. This integration improves the loss of information501
when a feature map is multiple-sampled or upsampled and improves the shared502
representation between segmentation and quantification tasks by learning the re-503
ciprocal interaction between them. Tables 1 and 2 and Figure 9 indicate that the504
three types of discriminators effectively improve the accuracy and robustness of505
the generated results by learning the relatedness within and among the task labels,506
respectively, and this architecture is flexible and easy to implement. Moreover,507
Tables 1 and 2 indicate that the conditional information effectively controls the508
generation process to reduce the fluctuation of the result and improve the robust-509
ness of the network.510
5.4. Advantage of the jointing using L1 and L2 loss511
Table 4 demonstrates that our generator loss of jointly using the L1 and L2512
term achieve an overall better solution than only using the L1 term. Jointly us-513
ing L1 and L2 improves the performance in the aspects of infarction size, the514
percentage of the segment, the major axis length, the minor axis length, and the515
orientation. This is because our network involves multitask learning, and L2 of516
L1 + L2 outperforms L1 in the regression part of the efficiency and effectiveness.517
Moreover, jointly using L1 and L2 significantly improves the stability for all seg-518
mentation and quantification results. This is because the addition of the L2 term519
25
Table 4: Jointly using L1 and L2 loss achieves a better solution than only the L1 term.Dice Size Per-size Per-Seg Peri
L1 92.90 ± 0.51 17.94 ± 9.11 9.06 ± 8.62 4.61 ± 3.27 3.90 ± 2.88L1+L2 92.89 ± 0.33 17.15 ± 8.91 9.24 ± 7.73 4.07 ± 2.91 3.95 ± 2.21
Cent Majo Mino Orie TranL1 0.95 ± 0.77 2.21 ± 1.11 1.13 ± 0.82 5.49 ± 3.09 13.02 ± 10.52
L1+L2 0.96 ± 0.77 2.04 ± 1.10 1.07 ± 0.80 5.47 ± 2.24 13.18 ± 9.38
MAE is used to evaluate all the quantification indicesPeri = perimeter, Cent = centroid, Majo = major axis length, Mino = minor axis length, Orie= orientation, Tran = transmurality
avoids the issues that the derivatives of L1 are not continuous in all tasks, thus520
more stable results achieved.521
5.5. Comparison with state-of-the-art methods522
Table 5 demonstrates that DSTGANs achieve higher segmentation and quan-523
tification accuracy and more quantification terms than the existing methods in seg-524
mentation/quantification (Seq/Qua). MuTGAN corresponds to our MICCAI 2018525
work. Both terms ‘Seq’ and ’Qua’ indicate that this method can directly segment526
or quantify myocardial infarction without additional steps. The term ‘NaN’ indi-527
cates that this method cannot estimate this index. Overall, by comparison, DST-528
GANs consistently outperform our previous method and all state-of-the-art MI529
segmentation methods (Xue et al., 2017; Popescu et al., 2016; Bleton et al., 2015)530
in terms of accuracy, Dice coefficient, IoU, infarct size, perimeter and centroid,531
and the MI’s transmurality and percentage of infarct size of the LV are obtained532
for the first time. These significant improvements in segmentation, particularly533
in IoU, indicate that DSTGANs improve the existing methodshave large numbers534
of badly scattered points, unconnected areas and severe overestimation problems535
(Xue et al., 2017). A possible reason is that our method systematically models the536
motion and deformation of 2D+T data instead of the pixel-by-pixel estimation in537
existing methods and uses the inter-/intra-label relatedness to regulate the train-538
ing process. Furthermore, the improvement in MI quantification demonstrates the539
superiority of DSTGANs as a direct quantification method and a combination of540
quantification and segmentation. This work exceeds the work in MICCAI-2018541
through a more effective spatiotemporal feature learning framework, a more de-542
tailed generation structure, and more advanced training strategies.543
26
Table 5: DSTGANs simultaneously segment and quantify MIs without a contrast agent and yieldhigher performance than some state-of-the-art methods.
DSTGANs MuTGAN Xu et al Popescu et al Bleton et al(Xu et al., 2017)(Popescu et al., 2016)(Bleton et al., 2015)
Seg/Qua Seg and QuaSeg and Qua Seg only Seg only Seg onlyAccuracy 96.98% 96.57% 94.73% 86.47% 84.39%
Dice 92.89% 90.18% 89.40% 75.33% 71.51%IoU 82.16% 79.51% 76.28% 68.15% 63.72%
Infarct size 17.15 23.24 93.16∗ 159.93∗ 194.75∗
Perimeter 3.95 5.27 17.83∗ 31.62∗ 44.37∗
Centroid 0.96 1.16 6.49∗ 8.03∗ 10.92∗
Pct. infarct size 4.07 NaN NaN NaN NaNTransmurality 13.18 NaN NaN NaN NaN
∗ Quantification result estimated from segmentation result, and Pct. = Percentage ofNote that the MuTGAN corresponds to our MICCAI 2018 work.
6. Discussion544
To the best of our knowledge, this is the first report that describes a non-545
contrast agent approach that can provide clinicians with all the required factors546
for the current clinical assessment of MI directly from cine MR imaging.547
Our approach satisfies a clinical desire to replace contrast agent administra-548
tion with an inexpensive, end-to-end, low-risk, reproducible and steady tool with549
high clinical impact that can change the clinical procedure and management of550
MI patients. This is vital for clinical studies because of the limitations of contrast551
agent administration (e.g., using gadolinium chelate), which may not be suitable552
for patients with chronic kidney disease according to the US Renal Data Sys-553
tem. Specifically, >40% patients with chronic kidney disease also suffer from554
cardiovascular disease (Fox et al., 2010), and 20% of acute MI patients have ac-555
companying chronic kidney disease (Fox et al., 2010). In contrast, our frame-556
work only requires cine images, which are inevitably acquired as a routine part557
of a cardiac examination for assessment of function, and produces results with558
high accuracy in segmenting and quantifying MIs that are comparable to the LGE559
technique. This is evidenced by Table 6, which indicates that no significant dif-560
ferences (P > 0.05) were found between the results obtained by our method and561
the ground truth obtained by experienced experts.562
For the first time, our approach provides a clinical tool that can automatically563
quantify the transmurality and percentage infarct size of the LV of the MI di-564
rectly from cine MR images. This ability can help clinicians to more accurately565
1) plan the patient treatment because these indices are the primary determinant of566
27
Table 6: No significant differences (P > 0.05) were found between the quantification resultsobtained using our methods and the clinician-measured values in the LGE image.
Characteristic DSTGANs Ground Truth PInfarct Size (cm2) 9.31±2.69 7.74±2.20 .146Pct. infarct size (%) 28.94±19.31 24.47±16.25 .060Pct. segments (%) 27.39±16.43 25.97±12.29 .079Perimeter (mm2) 172.55±73.49 170.85±82.79 .102Centroid X:30.24±13.66 X:30.03±13.00 .743
Y:34.34±10.91 Y:33.72±11.28Major axis length (mm) 73.74±26.15 72.16±28.42 .515Minor axis length (mm) 33.64±21.22 31.40±19.99 .260Orientation (◦) 109.44±56.49 106.02±56.28 .112Transmurality (%) 77.60±33.54 64.59±27.38 .089
Pct. = Percentage of
LV remodeling in long-term survivors of MI (Ørn et al., 2007), 2) diagnose the567
LV constriction damage because the percentage infarct size of the LV is a major568
determinant of the LV performance after MI (Ørn et al., 2007) and 3) determine569
the possibility of cardiac functional recovery after revascularization because lower570
transmurality indicates a high likelihood (Ørn et al., 2007). However, the direct571
quantification of these two indices is extremely challenging for a diagnosis based572
on nonenhanced imaging because 1) small transmural and subendocardial scars573
often show no signal abnormalities on cine MR images and 2) additional informa-574
tion on the LV wall that is not included in our training data is required, with one575
method requiring LV wall thickness and the other requiring LV wall size. Even so,576
our approach successfully quantifies these two indices directly from cine MR im-577
ages because it can learn the detailed pixel’s motion and deformation information578
of all tissues in the cine MR images during training (some of the characteristics of579
the LV wall have been learned and displayed), as shown in Figure 10.580
Our proposed DSTGAN can be extended to similar applications because of581
the following technical advantages: 1) Multi-scale kernel 3D convolution and582
spatiotemporal feature pyramids are first constructed to effectively handle the high583
spatiotemporal variability in different spatial scales and temporal periods, and they584
overcome the limitation of variability with great gaps in spatiotemporal patterns585
and disturbances in the pixel gray levels among different locations or periods. 2)586
The task-aware objective function learns a reciprocal commonality representation587
of two related tasks to train mappings from the input cine MR images to two re-588
28
Figure 10: The feature map clearly demonstrates the ability of our method to learn the motionand deformation of cine MR images. The second row represents the 32 × 32 feature maps of thecorresponding segmentation ground truth (first row). The high-intensity regions in these featuremaps approximately correspond to the most responsive area for the deformation and motion. Thelow-intensity regions represent the infarction area because infarction causes deformation and mo-tion abnormalities (motion reduced and myocardium thinned). In addition, our method can also beunsupervised to learn some of the characteristics of the LV wall through the unique spatiotemporalinformation of the myocardium.
lated outputs and ensures that the generated results indeed approach the ground589
truth via adversarial training. This function overcomes the limitation of multi-590
tasking adversarial training, which requires notably different loss formulations to591
address different distributions of features, which are difficult to train. 3) Con-592
ditional adversarial learning with multiple discriminators successfully iteratively593
optimizes the integrated distribution with the input as conditional information us-594
ing the label contiguity to produce deterministic outputs. This learning strategy595
to avoid a basic conditional GAN has a stochastic output because of noise z and596
enhances the generation performance.597
7. Conclusion598
A deep spatiotemporal adversarial network has been proposed for simultane-599
ous segmentation and quantification of MIs without the use of contrast agents. The600
DSTGAN was conducted on 165 subjects and yielded a pixel classification accu-601
racy of 96.98%; the MAE of the infarction size was 17.15 mm2. These results602
demonstrate that our proposed method can be an efficient and accurate clinical603
tool to systemize MI assessment and provide timely interpretations for clinical604
29
follow-up.605
References606
Afshin, M., Ayed, I.B., Islam, A., Goela, A., Peters, T.M., Li, S., 2012. Global607
assessment of cardiac function using image statistics in mri, in: International608
Conference on Medical Image Computing and Computer-Assisted Intervention,609
Springer. pp. 535–543.610
Bijnens, B., Claus, P., Weidemann, F., Strotmann, J., Sutherland, G.R., 2007.611
Investigating cardiac function using motion and deformation analysis in the612
setting of coronary artery disease. Circulation 116, 2453–2464.613
Bleton, H., Margeta, J., Lombaert, H., Delingette, H., Ayache, N., 2015. My-614
ocardial infarct localization using neighbourhood approximation forests, in: In-615
ternational Workshop on Statistical Atlases and Computational Models of the616
Heart, Springer. pp. 108–116.617
Duchateau, N., De Craene, M., Allain, P., Saloux, E., Sermesant, M., 2016. Infarct618
localization from myocardial deformation: prediction and uncertainty quantifi-619
cation by regression from a low-dimensional space. IEEE transactions on med-620
ical imaging 35, 2340–2352.621
Falsetti, H.L., Marcus, M.L., Kerber, R.E., Skorton, D.J., 1981. Quantification of622
myocardial ischemia and infarction by left ventricular imaging. Circulation 63,623
747–751.624
Flett, A.S., Hasleton, J., Cook, C., Hausenloy, D., Quarta, G., Ariti, C., Muthu-625
rangu, V., Moon, J.C., 2011. Evaluation of techniques for the quantification of626
myocardial scar of differing etiology using cardiac magnetic resonance. JACC:627
cardiovascular imaging 4, 150–156.628
Fox, C.S., Muntner, P., Chen, A.Y., Alexander, K.P., Roe, M.T., Cannon, C.P.,629
Saucedo, J.F., Kontos, M.C., Wiviott, S.D., 2010. Use of evidence-based thera-630
pies in short-term outcomes of st-segment elevation myocardial infarction and631
non–st-segment elevation myocardial infarction in patients with chronic kidney632
disease. Circulation 121, 357–365.633
Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y., 2016. Deep learning. vol-634
ume 1. MIT press Cambridge.635
30
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,636
Courville, A., Bengio, Y., 2014. Generative adversarial nets, in: Advances in637
neural information processing systems, pp. 2672–2680.638
Gu, B., Shan, Y., Sheng, V.S., Zheng, Y., Li, S., 2018. Sparse regression with639
output correlation for cardiac ejection fraction estimation. Information Sciences640
423, 303–312.641
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C., 2017. Im-642
proved training of wasserstein gans, in: Advances in Neural Information Pro-643
cessing Systems, pp. 5767–5777.644
Heiberg, E., Ugander, M., Engblom, H., Gotberg, M., Olivecrona, G.K., Erlinge,645
D., Arheden, H., 2008. Automated quantification of myocardial infarction from646
mr images by accounting for partial volume effects: animal, phantom, and hu-647
man study. Radiology 246, 581–588.648
Huang, Z., Xu, W., Yu, K., 2015. Bidirectional lstm-crf models for sequence649
tagging. arXiv preprint arXiv:1508.01991 .650
Ingkanisorn, W.P., Rhoads, K.L., Aletras, A.H., Kellman, P., Arai, A.E., 2004.651
Gadolinium delayed enhancement cardiovascular magnetic resonance corre-652
lates with clinical measures of myocardial infarction. Journal of the American653
College of Cardiology 43, 2253–2259.654
Kali, A., Cokic, I., Tang, R.L., Yang, H.J., Sharif, B., Marban, E., Li, D., Berman,655
D., Dharmakumar, R., 2014. Determination of location, size and transmurality656
of chronic myocardial infarction without exogenous contrast media using car-657
diac magnetic resonance imaging at 3t. Circulation: Cardiovascular Imaging ,658
CIRCIMAGING–113.659
Karim, R., Housden, R.J., Balasubramaniam, M., Chen, Z., Perry, D., Uddin, A.,660
Al-Beyatti, Y., Palkhi, E., Acheampong, P., Obom, S., et al., 2013. Evalua-661
tion of current algorithms for segmentation of scar tissue from late gadolinium662
enhancement cardiovascular magnetic resonance of the left atrium: an open-663
access grand challenge. Journal of Cardiovascular Magnetic Resonance 15,664
105.665
Kingma, D.P., Ba, J.L., 2014. Adam: Amethod for stochastic optimization, in:666
Proc. 3rd Int. Conf. Learn. Representations.667
31
Ledesma-Carbayo, M.J., Kybic, J., Desco, M., Santos, A., Suhling, M., Hunziker,668
P., Unser, M., 2005. Spatio-temporal nonrigid registration for ultrasound car-669
diac motion estimation. IEEE transactions on medical imaging 24, 1113–1126.670
Lin, T.Y., Dollar, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J., 2017.671
Feature pyramid networks for object detection., in: CVPR, p. 4.672
Luc, P., Couprie, C., Chintala, S., Verbeek, J., 2016. Semantic segmentation using673
adversarial networks. arXiv preprint arXiv:1611.08408 .674
Mirza, M., Osindero, S., 2014. Conditional generative adversarial nets. arXiv675
preprint arXiv:1411.1784 .676
Ordovas, K.G., Higgins, C.B., 2011. Delayed contrast enhancement on mr images677
of myocardium: past, present, future. Radiology 261, 358–374.678
Ørn, S., Manhenke, C., Anand, I.S., Squire, I., Nagel, E., Edvardsen, T., Dickstein,679
K., 2007. Effect of left ventricular scar size, location, and transmurality on680
left ventricular remodeling with healed myocardial infarction. The American681
journal of cardiology 99, 1109–1114.682
Petitjean, C., Dacher, J.N., 2011. A review of segmentation methods in short axis683
cardiac mr images. Medical image analysis 15, 169–184.684
Popescu, I.A., Irving, B., Borlotti, A., DallArmellina, E., Grau, V., 2016. Myocar-685
dial scar quantification using slic supervoxels-parcellation based on tissue char-686
acteristic strains, in: International Workshop on Statistical Atlases and Compu-687
tational Models of the Heart, Springer. pp. 182–190.688
Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks689
for biomedical image segmentation, in: International Conference on Medical690
image computing and computer-assisted intervention, Springer. pp. 234–241.691
Schlegl, T., Waldstein, S.M., Bogunovic, H., Endstraßer, F., Sadeghipour, A.,692
Philip, A.M., Podkowinski, D., Gerendas, B.S., Langs, G., Schmidt-Erfurth,693
U., 2017. Fully automated detection and quantification of macular fluid in oct694
using deep learning. Ophthalmology .695
Schuijf, J.D., Kaandorp, T.A., Lamb, H.J., van der Geest, R.J., Viergever, E.P.,696
van der Wall, E.E., de Roos, A., Bax, J.J., 2004. Quantification of myocardial697
infarct size and transmurality by contrast-enhanced magnetic resonance imag-698
ing in men. The American journal of cardiology 94, 284–288.699
32
Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F., 2017. Cdc:700
Convolutional-de-convolutional networks for precise temporal action localiza-701
tion in untrimmed videos, in: Computer Vision and Pattern Recognition, IEEE.702
pp. 1417–1426.703
Shou, Z., Wang, D., Chang, S.F., 2016. Temporal action localization in untrimmed704
videos via multi-stage cnns, in: Proceedings of the IEEE Conference on Com-705
puter Vision and Pattern Recognition, pp. 1049–1058.706
Suinesiaputra, A., Ablin, P., Alba, X., Alessandrini, M., Allen, J., Bai, W., Cimen,707
S., Claes, P., Cowan, B.R., D’hooge, J., et al., 2017. Statistical shape modeling708
of the left ventricle: myocardial infarct classification challenge. IEEE journal709
of biomedical and health informatics .710
Sun, H., Zhen, X., Bailey, C., Rasoulinejad, P., Yin, Y., Li, S., 2017. Direct es-711
timation of spinal cobb angles by structured multi-output regression, in: Inter-712
national Conference on Information Processing in Medical Imaging, Springer.713
pp. 529–540.714
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M., 2015. Learning spa-715
tiotemporal features with 3d convolutional networks, in: Proceedings of the716
IEEE international conference on computer vision, pp. 4489–4497.717
Wong, K.C., Tee, M., Chen, M., Bluemke, D.A., Summers, R.M., Yao, J., 2016.718
Regional infarction identification from cardiac ct images: a computer-aided719
biomechanical approach. International journal of computer assisted radiology720
and surgery 11, 1573–1583.721
Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c., 2015.722
Convolutional lstm network: A machine learning approach for precipitation723
nowcasting, in: Advances in neural information processing systems, pp. 802–724
810.725
Xu, C., Xu, L., Gao, Z., Zhao, S., Zhang, H., Zhang, Y., Du, X., Zhao, S., Ghista,726
D., Li, S., 2017. Direct detection of pixel-level myocardial infarction areas727
via a deep-learning algorithm, in: International Conference on Medical Image728
Computing and Computer-Assisted Intervention, Springer. pp. 240–249.729
Xue, W., Brahm, G., Pandey, S., Leung, S., Li, S., 2018. Full left ventricle quan-730
tification via deep multitask relationships learning. Medical image analysis 43,731
54–65.732
33
Xue, W., Nachum, I.B., Pandey, S., Warrington, J., Leung, S., Li, S., 2017. Direct733
estimation of regional wall thicknesses via residual recurrent neural network,734
in: International Conference on Information Processing in Medical Imaging,735
Springer. pp. 505–516.736
Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L., 2016. End-to-end learning of737
action detection from frame glimpses in videos, in: Proceedings of the IEEE738
Conference on Computer Vision and Pattern Recognition, pp. 2678–2687.739
Zhen, X., Islam, A., Bhaduri, M., Chan, I., Li, S., 2015a. Direct and simultaneous740
four-chamber volume estimation by multi-output regression, in: International741
Conference on Medical Image Computing and Computer-Assisted Intervention,742
Springer. pp. 669–676.743
Zhen, X., Wang, Z., Islam, A., Bhaduri, M., Chan, I., Li, S., 2016. Multi-scale744
deep networks and regression forests for direct bi-ventricular volume estima-745
tion. Medical Image Analysis 30, 120 – 129. doi:https://doi.org/10.746
1016/j.media.2015.07.003.747
Zhen, X., Wang, Z., Yu, M., Li, S., 2015b. Supervised descriptor learning for748
multi-output regression, in: Proceedings of the IEEE conference on computer749
vision and pattern recognition, pp. 1211–1218.750
Zhen, X., Yu, M., He, X., Li, S., 2017a. Multi-target regression via robust low-751
rank learning. IEEE transactions on pattern analysis and machine intelligence752
40, 497–504.753
Zhen, X., Zhang, H., Islam, A., Bhaduri, M., Chan, I., Li, S., 2017b. Direct754
and simultaneous estimation of cardiac four chamber volumes by multioutput755
sparse regression. Medical image analysis 36, 184–196.756
34
Conflicts of interest: none757
35