Download - Big Data Ops vs Flops.pdf
Big Data - Ops vs Flops
Steve Wallach
swallach”at”conveycomputer “dot”com
Convey <= Convex++
Why is this interesting?
• Big Data is on everyone’s mind. It is in the
NEWS.
– HPC Classic (Flops) & Big Data (Ops)
• Power efficiency is on everyone’s mind. It is in
the NEWS.
• Are today’s processor architectures a match for
Exascale applications that are in the NEWS?
– Ops vs Flops
• Deep dive into some micro-architecture discussions
• A smarter computer for Exascale. It is in the
NEWS
swallach - 2012 December - CHPC 2
From a NEWS Perspective
• World Economic Forum
– declared data a new class of
economic asset, like currency or
gold.
• growing at 50 percent a year, or
more than doubling every two
years, estimates IDC
• “data-driven decision making” achieved
productivity gains that were 5 percent
to 6 percent higher than other factors
could explain.
• “false discoveries.” “many bits
of straw look like needles.”
swallach - 2012 December - CHPC 3
February, 11, 2012, NY Times, “The Age of Big Data”,
STEVE LOHR
Support from Washington DC
OBAMA ADMINISTRATION UNVEILS “BIG DATA”
INITIATIVE: ANNOUNCES $200 MILLION IN NEW
R&D INVESTMENTS
Office of Science and Technology Policy Executive Office of the President
New Executive Office Building
Washington, DC 20502
March 29, 2012
�
swallach - 2012 December - CHPC 4
Perhaps another reason
swallach - 2012 December - CHPC 5
Venture Capitalist
Seriously
• “Computing may be on the cusp
of another such wave.”
• “The wave will be based on
smarter machines and software
that will automate more tasks
and help people make better
decisions.”
• “The technological building
blocks, both hardware and
software, are falling into place,
stirring optimism.”
swallach - 2012 December - CHPC 6
September 8, 2012, NY Times, “Tech’s New Wave,
Driven by Data”
STEVE LOHR
Thinking on Technology
• Flops and Ops are
applicable to Exascale
• More items in common
then initially obvious
• Examine in more detail
• The role of Time
Sharing
swallach - 2012 December - CHPC 7
. The role of cloud
computing
Compare Flops with Ops
• ExaFlop (FP Intensive) – Historical Metrics/Features
• Memory Capacity – Byte/Flop
• Interconnect Bandwidth – .1 Bytes/Flop
• Memory Bandwidth – 4 Bytes/sec/Flop
• Linpack
– Relatively static data set size • 3D … X,Y,Z - multiple points per cell
• WaveFront Algorithms
– Node level synchronization • Domain Decomposition
– 64 bit memory references • 32 and 64 bit float
– Cache Friendly (In general?)
• ExaOp (Data Intensive) • Transactions per sec
• Transaction Latency
• Graph edges per sec
• YES/NO
– Data Sets are dynamic
• New data always being added
– Never enough physical memory
Memory/compute algorithms
• Limited disk access
– Flash memory/Memristor/Phase Change?
– Memcached
– Graphs, Trees, Bit and Byte strings
• Fine grain synchronization
– Threads/Transactions
• Granularity of memory references
• User defined strings
• NORA (Non-Obvious Relationship
Analysis)
– Cache Unfriendly
swallach - 2012 December - CHPC 8
Solving Flops – HPC Classic
• Multi (Many) Core
• Multiple FMAC’s – Floating Point
• Vector Accumulators
• Structured Data
• Attached accelerators – GPU, FPGA, …
• High Speed Interconnects – MPI Pings
• Open source numerical libraries – (e.g., LAPACK)
• Memory system – Classic – highly interleaved
– Micro - multiple cache levels
swallach - 2012 December - CHPC 9
Solving Ops
-An Architectural Perspective- • Very Large Physical Memory
– ExaOps ExaBytes
– 64 bits of address is not enough • Run out by 2020
• 1 to 1.5 bits increase per 1 to 1.5 years
• Single Level, Global Physical Memory – Simplifies Programming Model
– Extensible over time
– Updates in Real Time
– Runtime binding
– PGAS Languages
– MAP REDUCE / HADOOP
• Multiple machine state models can exist
• Heterogeneous Processing
• Fine grain synchronization – Threads
– Bit and Bytes
– Adjacency Matrices
swallach - 2012 December -
CHPC
10
swallach - 2012 December - CHPC
Now What
• Uniprocessor Performance has to be increased
– Heterogeneous here to stay
– The easiest to program will be the correct technology
• Smarter Memory Systems (PIM) – Synchronization
– Complex Data Structures
– Dram, NVRAM, Disk
• New HPC/Big Data Software must be developed.
– SMARTER COMPILERS
– ARCHITECTURALLY TRANSPARENT
– Domain Specific Languages
• New algorithms
• New ways of referencing DATA at the processor level
11
Ops and Flops – Processor
Memory System • Caches create in order memory
references for ALU’s from out of
order main memory references.
– Cache block is in order
• Direct memory references tend to
be out of order
– Better for random references
– Hundreds (thousands) of
outstanding loads and stores
• Which memory reference approach
is preferred?
• How does HMC (hybrid memory
cube) change the approach?
swallach - 2012 December - CHPC 12
Ops and Flops in one system?
• For in order memory, a vector
ISA is very efficient. For out of
order, a thread model is more
efficient (also hides latency
better)
– Threads execute independently
but synchronized.
– Supported by OpenMP
swallach - 2012 December - CHPC 13
Do I = 1,n
A(i) = B(i) +C(i)
ENDDO
swallach - 2012 December - CHPC 14 -
NVIDIA
Exascale Workshop Dec 2009, San Diego
swallach - 2012 December - CHPC 15
In Beta Test… Fox Coprocessor
Four Huge Virtex6
Application Engines 32 High Density
SG-DIMMs DDR3
Memory
Controllers
• Shared, Virtual Memory System
• Scatter/Gather
– Up to 4 TB (1 TB with current drams)
– 128 GB/s bandwidth
– TB/s of internal register bandwidth
• Atomic Memory Operators
– Native to memory controllers
– 8-Byte Accessible
• Tag Bit Semantics
– lock or “tag” bit for each 8-byte word
• Thousands of simultaneous threads
• Scalable coprocessor memory
interconnect
swallach - 2012 December - CHPC 15
Atomic Operations & FE Memory
Bits • Atomic operations avoid round
trips to memory to acquire lock,
update data, release lock
− Accessible from coprocessor
instruction space
− Accessible from host via lock engine
and compiler intrinsics
• Full/Empty Bits
− Extra bit stored with each word
− Can be used to signify when data is
valid/ready
− Accessible from host via lock
engine & compiler intrinsics
Atomic Operations
Add, Sub
Min, Max
Exch
Inc, Dec
CAS
And, Or, Xor
Tag Bit Semantic Operations
WriteEF, WriteFF
WriteXF, WriteXE
ReadFE, ReadEF, ReadFF
ReadXX
IncFF
swallach - 2012 December - CHPC 18
• Convey Coprocessor
Interconnect (CCI) enables
shared memory between
coprocessors – Direct connection of up to 8
coprocessors
– Architected for 32 coprocessors
in a 32TB address space
– Load/store and atomic memory
operations between
coprocessors
• Extended Memory
Allocators – Mimic the memory allocation
and layout of SHMEM or UPC
Wolfpack
…
Convey
Copro
cessor In
terconnect
AEH AE0 AE1 AE2 AE3
AEH AE0 AE1 AE2 AE3
AEH AE0 AE1 AE2 AE3
AEH AE0 AE1 AE2 AE3
HOST
swallach - 2012 December - CHPC 19
• Scalable, MIMD Personality Framework
– Simple, RISC-like instruction set
– Atomic memory operations
– Fine-grained synchronization using tag-bit semantics
– Software driven, hardware-based low-latency thread/task
scheduler
– Every instruction explicitly determines to continue or pend
• Programming model is OpenMP 3.1
– Support for OpenMP code directives
– Simple, portable parallelism
CHOMP: Convey Hybrid
swallach - 2012 December - CHPC 20
CHOMP Software Stack
swallach - 2012 December - CHPC 21
Development Frameworks for Multiple
Architectural Models
• CHOMP for Arrays of Customized Cores
− MIMD parallelism with many simple cores
− OMP based shared memory programming model
− Compile and run user model
• Convey Vectorizer for Custom SIMD
− vector personalities optimized for specific data types
− automatic vectorization of C/C++ and Fortran
− support for customized instructions via intrinsics
• PDK for Algorithmic Personalities
− specific routines in hardware
− pipelining and replication for very high performance
− PDK supports implementations in Verilog or with
high level design tools
X(I,J,K) = S0*Y(I ,J ,K ) + S1*Y(I-1,J ,K ) + S2*Y(I+1,J ,K ) + S3*Y(I ,J-1,K ) + S4*Y(I ,J+1,K ) + S5*Y(I ,J ,K-1) + S6*Y(I ,J ,K+1)
ACTGTGACATGCTGACAT
GCTAG Stencils [FD/FV/FE]
Genomics/Proteomics
Graph Theory
Edge Vector
Vertex Vector
, etc,
etc
swallach - 2012 December - CHPC 22
swallach - 2012 December - CHPC 23
swallach - 2012 December - CHPC 24
swallach - 2012 December - CHPC 25
System Level - Trends
• Global Flat, Virtual Address Space
– Common to Flops and Ops
– Program using a PGAS Language
• Single Program Multiple Data (SPMD)
– 64 bits growing to 128 bits in 2020 and beyond
�
swallach - 2012 December - CHPC 26
Physical Page Address- 40
Node
Field Control
Control – Physical Memory Level
Dram – Flash – Disk
Node Field – PGAS model
Data Semantics – PIM Control
private vs. shared
4 KB – 1 MB Page Size
4 PetaBytes To 1 ExaBytes
of Physical Address
PTE Data
Semantic
swallach - 2012 December - CHPC
Exascale Programming Model
1: #include<upc_relaxed.h>
2: #define N 200*THREADS
3: shared [N] double A[N][N]; NOTE: Thread is 16000
4: shared double b[N], x[N];
5: void main()
6: {
7: int i,j;
8: /* reading the elements of matrix A and the
9: vector x and initializing the vector b to zeros
10: */
11: upc_forall(i=0;i<N;i++;i)
12: for(j=0;j<N;j++)
13: b[i]+=A[i][j]*x[j] ;
14: }
Example 4.1-1: Matrix by Vector Multiply
27
Optical Technology
IBM HP
swallach - 2012 December - CHPC 28
March 15, 2012, HP Scientists Envision 10-Teraflop
Manycore Chip, HPC Wire
http://domino.research.ibm.com
Why OPTICS? • Bandwidth per channel exceeds electronics
• One connector per node
– NIC does parallel to serial conversion(s) – outgoing
– NIC does serial to parallel - incoming
• Scalable bandwidth per node, using WDM
– Constant mechanics
– No distance considerations within a computer room
– No noise/cross talk
– More colors more bandwidth
• Same network/switch for Flops and Ops – Flops has more channels then ops (perhaps)
– Facilitates a REAL Cloud Computer
swallach - 2012 December - CHPC 29
swallach - 2012 December - CHPC
Cloud/Big Data System
I/O
ALL-OPTICAL
SWITCH
1
2 3
16000
14000
12000
8000
4 5
1000
2000
3000
4000
5000 7000
6000
Single Node
Ops/Flops
10 meters= 50 NS Delay
...
...
...
...
LAN/WAN 30
One color one channel
On demand:
-compute
-storage
-link bandwidth
swallach - 2012 December - CHPC
Finally
The system which is the simplest to
program will win. USER cycles are more
important that CPU cycles
31
FLOPs OPs Flops/Ops
/Watt