blockwise parallel decoding for deep autoregressive modelsmitchell/files/nips-2018_poster... ·...

Blockwise Parallel Decoding for Deep Autoregressive Models Mitchell Stern UC Berkeley Noam Shazeer Google Brain Jakob Uszkoreit Google Brain Overview Some recent sequence-to-sequence models like the Transformer (Vaswani et al., 2017) can score all output posiQons in parallel. We propose a simple algorithmic technique that exploits this property to generate mulQple tokens in parallel at decoding Qme with liTle to no loss in quality. Our fastest models exhibit wall-clock speedups of up to 4x over standard greedy decoding on the tasks of machine translaQon and image super-resoluQon. Basic Approach Combined Approach ImplementaQon and Training • Augment the decoder architecture to predict the next k tokens in parallel with sub-models p 1 , …, p k • Either use a frozen base model to ensure comparable quality, or employ ﬁne-tuning to improve internal consistency and achieve beTer future predicQon • OpQonally use sequence-level knowledge disQllaQon to construct a training set with greater predictability arising from consistent mode breaking from the teacher model Examples English-German machine transla1on using a model trained with k = 10: Input: The James Webb Space Telescope (JWST) will be launched into space on board an Ariane5 rocket by 2018 at the earliest. Output: Das James Webb Space Teleskop (JWST) wird bis spätestens 2018 an Bord einer Ariane5- Rakete in den Weltraum gestartet. • Step 1 10 tokens [Das_, James_, Web, b_, Space_, Tele, sko, p_, (_, J] • Step 2 5 tokens [W, ST_, ) _, wird_, bis_] • Step 3 4 tokens [späte, stens_, 2018_, an_] • Step 4 10 tokens [Bord_, einer_, Ari, ane, 5_, -_, Rak, ete_, in_, den_] • Step 5 2 tokens [Weltraum, _] • Step 6 3 tokens [gestartet_, ._, <EOS>] Image super-resolu1on using a model trained with k = 10 and allowing for approximate pixel matches (leb: input, middle: greedy decode, right: parallel decode): Results EN-DE machine translaQon: dev BLEU score and mean accepted block size EN-DE machine translaQon: test BLEU score and wall-clock speedup Image super-resoluQon: mean accepted block size Image super-resoluQon: human evaluaQon Wall-clock speedup vs. mean accepted block size Predict the next k tokens using the base scoring model and k-1 auxiliary models; verify the predicQons in parallel using the base model; accept the preﬁx that agrees with the greedy predicQons. Combining the scoring and proposal models allows us to merge the current verify substep with the next predict substep, reducing the number of parallel model invocaQons during inference by a factor of 2.

Upload: hoangque

Post on 10-Mar-2019

216 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: Blockwise Parallel Decoding for Deep Autoregressive Modelsmitchell/files/nips-2018_poster... · Output: Das James Webb Space Teleskop (JWST) wird bis spätestens 2018 an Bord einer

BlockwiseParallelDecodingforDeepAutoregressiveModelsMitchellStern

UCBerkeleyNoamShazeerGoogleBrain

JakobUszkoreitGoogleBrain

OverviewSomerecentsequence-to-sequencemodelsliketheTransformer(Vaswanietal.,2017)canscorealloutputposiQonsinparallel.WeproposeasimplealgorithmictechniquethatexploitsthispropertytogeneratemulQpletokensinparallelatdecodingQmewithliTletonolossinquality.Ourfastestmodelsexhibitwall-clockspeedupsofupto4xoverstandardgreedydecodingonthetasksofmachinetranslaQonandimagesuper-resoluQon.

BasicApproach

CombinedApproach

ImplementaQonandTraining•  Augmentthedecoderarchitecturetopredictthenextk

tokensinparallelwithsub-modelsp1,…,pk

•  Eitheruseafrozenbasemodeltoensurecomparablequality,oremployfine-tuningtoimproveinternalconsistencyandachievebeTerfuturepredicQon

•  OpQonallyusesequence-levelknowledgedisQllaQontoconstructatrainingsetwithgreaterpredictabilityarisingfromconsistentmodebreakingfromtheteachermodel

ExamplesEnglish-Germanmachinetransla1onusingamodeltrainedwithk=10:

Input:TheJamesWebbSpaceTelescope(JWST)willbelaunchedintospaceonboardanAriane5rocketby2018attheearliest.

Output:DasJamesWebbSpaceTeleskop(JWST)wirdbisspätestens2018anBordeinerAriane5-RaketeindenWeltraumgestartet.

•  Step1 10tokens [Das_, James_, Web, b_, Space_, Tele, sko, p_, (_, J]•  Step2 5tokens [W, ST_, ) _, wird_, bis_]•  Step3 4tokens [späte, stens_, 2018_, an_]•  Step4 10tokens [Bord_, einer_, Ari, ane, 5_, -_, Rak, ete_, in_, den_]•  Step5 2tokens [Weltraum, _]•  Step6 3tokens [gestartet_, ._, <EOS>]

Imagesuper-resolu1onusingamodeltrainedwithk=10andallowingforapproximatepixelmatches(leb:input,middle:greedydecode,right:paralleldecode):

ResultsEN-DEmachinetranslaQon:devBLEUscoreandmeanacceptedblocksize

EN-DEmachinetranslaQon:testBLEUscoreandwall-clockspeedup

Imagesuper-resoluQon:meanacceptedblocksize

Imagesuper-resoluQon:humanevaluaQon

Wall-clockspeedupvs.meanacceptedblocksize

Predictthenextktokensusingthebasescoringmodelandk-1auxiliarymodels;verifythepredicQonsinparallelusingthebasemodel;accepttheprefixthatagreeswiththegreedypredicQons.

Combiningthescoringandproposalmodelsallowsustomergethecurrentverifysubstepwiththenextpredictsubstep,reducingthenumberofparallelmodelinvocaQonsduringinferencebyafactorof2.

Hubblov vesmírny teleskop

Optimal blockwise subcarrier allocation policies in single ... · Optimal blockwise subcarrier allocation policies in single-carrier FDMA uplink systems Antonia Masucci, Elena Veronica

Teleskop-Raupenkran LTR 1060

Tugas 1 Teleskop

JWST TECHNICAL REPORT

BLOCKWISE LIST OF VILLAGES IN EAST SINGHBHUM DISTRICT

Poznański Teleskop Spektroskopowy

KOSMICZNY TELESKOP HUBBLE’A

JWST at SXSW

Private School Data Blockwise

Teleskop Astronomi

Rapid blockwise multi-resolution clustering of facial ... · Rapid blockwise multi-resolution clustering of facial images for intelligent watermarking ... and watermark robustness