optimizing runtime performance of neural net architectures ......neural net revolution = company...

45
GTC 2019, San Jose CA Optimizing Runtime Performance of Neural Net Architectures for High Scalability in Speech Recognition Servers John Kominek, CTO, Voci Technologies March 21, 2019 S9535

Upload: others

Post on 01-Feb-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

  • GTC 2019, San Jose CA

    Optimizing Runtime Performance

    of Neural Net Architectures

    for High Scalability

    in Speech Recognition Servers

    John Kominek, CTO, Voci Technologies

    March 21, 2019

    S9535

  • © 2019 | 2GTC 2019, San Jose CA

    High Density Speech Recognition on Nvidia GPUs

    ▪ Training of neural nets receives a lot of attention

    – Consumes the entire resources of the GPU

    ▪ Less attention is given to evaluation but is critical to

    large commercial deployments

    – Each stream uses a fraction of GPU resources

    – But you want to get maximal use of the card

  • © 2019 | 3GTC 2019, San Jose CA

    An Untold Story of Neural Nets in ASR

    ▪ Experience Learned

    – Insight into what's easy and what's hard

    – The multi-threading trick that work surprisingly well

    – The mystery (and pain) of negative scaling

    – Kepler, Maxwell, Pascal, Volta – how do they stack up?

    ▪ Intermediate level talk

    – Light on math, plenty of insider jargon

  • © 2019 | 4GTC 2019, San Jose CA

    Company HighlightsDeliver the world’s best speech to text platform for analytics

    Pre 2011

    2012

    2014

    2011 2015 20182017

    • CMU government funded projects leads to founding Voci

    • Series A funding

    • Emotion, Sentiment, Gender Labeling

    • Deep learning

    • Integration

    2016

    • Language ID introduced

    • Expanded language models

    • Over 10m minutes transcribed

    2019

    • 50 Employees.

    • >10m Bookings

    • >8 Billion est. minutes transcribed

    • New Website launched

    • Focused business model

    • First generation speech engine –V-Blaze developed

    • Speed, Accuracy, Scalability

    • Real-time transcription

    • AI powers ASR

    • V-Spark introduced

    • Speaker Separation

    • Partner enablement

    • V-Cloud Introduced

    • Speaker ID

    • 40 New logos

    • V-Blaze 5.0 released

    • Custom language models

    • Over 100m minutes transcribed

    • Series B funding

    • 5 Billion minutes transcribed

    • Biometrics introduced

    • 200 man-years of development

  • © 2019 | 5GTC 2019, San Jose CA

    Neural Net Revolution = Company Crossroads

    ▪ Up to 2013

    – FPGA implemented large fully continuous GMM

    models with integrated statistical language model

    evaluation and search. Fastest ASR engine in the

    world at the time.

    ▪ 2013

    – G. Hinton et al. established superiority of deep neural

    networks. Led to a rare seismic shift in field of speech

    recognition.

  • © 2019 | 6GTC 2019, San Jose CA

    Voci's Technology Shift from FPGAs to GPUs

    ▪ 2013-2014 – Technology bakeoff

    – Implemented DNN evaluation on Xilinx Virtex-5 pitted

    against CUDA implemented on K20

    – Nvidia platform won convincingly

    – Matrix multiplication primitives are tailor made for

    deep feedforward network evaluation

    – Migrated to open source Kaldi toolkit for model

    training

  • © 2019 | 7GTC 2019, San Jose CA

    Voci V-Blaze Runs on Extensive Array of GPUs

    ▪ Servers: Tesla K20, K40, K80, M10, M60, P100, V100

    ▪ Embedded: Jetson Tegra TK1, TX1, TX2, CX2

    ▪ Laptops: GeForce GTX 960M, GTX 1050, GTX 1050 Ti

    ▪ In the cloud on AWS

    ▪ Redhat/CentOS, Debian/Ubuntu

  • © 2019 | 8GTC 2019, San Jose CA

    Pictures of Server Rooms are Boring, so...

    Voci powering advanced automotive conversational systems

  • © 2019 | 9GTC 2019, San Jose CA

    If it's a Neural Net, Throw it on the GPU

    ▪ Voci V-Blaze runs

    – DNN

    – LSTM, BLSTM

    – CNN

    – TDNN

    – RNNLM

    – Combinations: e.g. DNN + CNN + BLSTM

  • © 2019 | 10GTC 2019, San Jose CA

    A Story of Joy, Struggle, and Triumph

    ▪ Easy to accelerate: Feedforward DNN

    ▪ Hard to accelerate: Bidirectional LSTM

  • © 2019 | 11GTC 2019, San Jose CA

    Evaluating Feedforward DNNs

    ▪ Single threaded evaluation is a straightforward sequence

    of matrix multiplications and non-linear range-

    compression functions

    ▪ Invoke the appropriate cudnn functions … and voila,

    marketing gold

  • © 2019 | 12GTC 2019, San Jose CA

    Single Threaded Performance

  • © 2019 | 13GTC 2019, San Jose CA

    Multi-Core Performance is What Matters

    ▪ There are plenty of CUDA cores left over

    ▪ There are untapped Xeon cores

    ▪ How well does neural net inference scale as the compute

    load is increased and more cores (GPU or CPU) are

    invoked?

  • © 2019 | 14GTC 2019, San Jose CA

    Increasing Load on an M10, DNN Evaluation

  • © 2019 | 15GTC 2019, San Jose CA

    Increasing Load on an M60, DNN Evaluation

  • © 2019 | 16GTC 2019, San Jose CA

    Increasing Load on a P100, DNN Evaluation

  • © 2019 | 17GTC 2019, San Jose CA

    Increasing Load on a V100, DNN Evaluation

  • © 2019 | 18GTC 2019, San Jose CA

    Meaning of Compute Load

    ▪ A compute load of 1 is one process pumping audio to

    the GPU as fast as results can be returned

    ▪ Load = number of such processes in parallel

    ▪ Independent processes, not threads

  • © 2019 | 19GTC 2019, San Jose CA

    Voci Engineers are Scofflaws!

    ▪ The Nvidia programming guidelines recommend multi-

    threading. Separate processes do not truly run in parallel.

    To run in parallel, program threads.

    ▪ We're like, "yeah, whatever."

  • © 2019 | 20GTC 2019, San Jose CA

    Translating Compute Load to Speed

    ▪ Depends on the size of the neural net

    • Small = 1024x6 ~ 12 million connections

    • Medium = 2048x6 ~ 34 million

    • Large = 4096x6 ~ 110 million

    ▪ Speed reported as x times faster than real time

  • © 2019 | 21GTC 2019, San Jose CA

    DNN Evaluation Speed vs Model Size (P100)

  • © 2019 | 22GTC 2019, San Jose CA

    DNN Evaluation by Tesla Generation (K80)

  • © 2019 | 23GTC 2019, San Jose CA

    DNN Evaluation by Tesla Generation (M10)

  • © 2019 | 24GTC 2019, San Jose CA

    DNN Evaluation by Tesla Generation (M60)

  • © 2019 | 25GTC 2019, San Jose CA

    DNN Evaluation by Tesla Generation (P100)

  • © 2019 | 26GTC 2019, San Jose CA

    DNN Evaluation by Tesla Generation (V100)

  • © 2019 | 27GTC 2019, San Jose CA

    Comparison to Pure CPU Performance Curve

  • © 2019 | 28GTC 2019, San Jose CA

    V100 Provides Best Peak Power Efficiency

  • © 2019 | 29GTC 2019, San Jose CA

    So much for Easy, Now for Hard

    https://github.com/dophist/kaldi-lstm

  • © 2019 | 30GTC 2019, San Jose CA

    Speed of Open Source Kaldi Implementation

  • © 2019 | 31GTC 2019, San Jose CA

    Visual Profiler Reveals the Problem

    DNN

    BLSTM – kernel synchronization dominates

  • © 2019 | 32GTC 2019, San Jose CA

    Highly Suspicious Power/Utilization Pattern

  • © 2019 | 33GTC 2019, San Jose CA

    The Shock of Negative Scaling

    Instead of saturating,speed decreases!

  • © 2019 | 34GTC 2019, San Jose CA

    M10/M60 Scale According to GPU Count, then Drop

  • © 2019 | 35GTC 2019, San Jose CA

    What to do?

    ▪ Separate processes were interfering with each other

    ▪ Three avenues forward

    – Switch to older cards that present less powerful but

    multiple GPU interfaces (the M10)

    – Re-engineer the infrastructure code to be a multi-

    threaded, single-process server

    – See how far optimizing the code will take you

  • © 2019 | 36GTC 2019, San Jose CA

    4 Custom Optimizations

    ▪ Kernel merging (15%)

    ▪ Matrix transpose into row major form (10%)*

    ▪ Reverse direction compute stream pairs (24%)

    ▪ Application-specific data parallelism (26%)

    ▪ Together: increase single process speed by 2x

    – * J. Appleyard, Optimizing Recurrent Neural Networks in cuDNN5, GTC 2016

  • © 2019 | 37GTC 2019, San Jose CA

    Application Specific Data Parallelism

    ▪ Serialism inherent in recurrent loops can be approximated

    Time

    fwd/bwd compute stream pair

    fwd/bwd compute stream pair

    fwd/bwd compute stream pair

    ....

    cudaStreamCreate(&stream_fwd)

    cudaStreamCreate(&stream_bwd)

  • © 2019 | 38GTC 2019, San Jose CA

    Before and After – from 37.5x to 375x

  • © 2019 | 39GTC 2019, San Jose CA

    Negative Scaling Eliminated

  • © 2019 | 40GTC 2019, San Jose CA

    Power/Utilization after Optimization is Sane Again

  • © 2019 | 41GTC 2019, San Jose CA

    V100 Still Leads on Power Efficiency

  • © 2019 | 42GTC 2019, San Jose CA

    Best Price/Performance Provided by M10

  • © 2019 | 43GTC 2019, San Jose CA

    Unsurprising Findings

    ▪ What's easy and what's hard

    – DNNs are easy, BLSTMs are hard

    ▪ Kepler, Maxwell, Pascal, Volta comparison

    – V100 is fastest

    – V100 has best power efficiency

    – M10 has best price/performance

  • © 2019 | 44GTC 2019, San Jose CA

    Unexpected Findings

    ▪ The multi-threading trick that work surprisingly well to

    achieve high performance scaling

    – Don't multi-thread (even though you should)

    ▪ Negative scaling can happen – and can be overcome

    – It's still kind of a mystery, though

    – For advanced details, join our company

  • www.vocitec.com

    © 2019 | 45GTC 2019, San Jose CA

    The only true enterprise speech-to-text platform that solves real business challenges

    [email protected], [email protected] (CEO)

    www.vocitec.com

    mailto:[email protected]