intelligent pre-fetching

Improvements in ROOT IO CacheImprovements in ROOT IO CacheWhen accessing remote data setsWhen accessing remote data sets

Intelligent Pre-fetchingIntelligent Pre-fetching

Data access can be between 10 and 100 times faster!!!

Read multiple buffers with one request

A simulation with a simple example A real-life measurement

4795 Reads

89 Reads

9 Reads4 Reads

The file is on a CERN machine (LAN 100MB/s). A is a local read (same machine). B is on a LAN (100 Mb/s - PIV 3 GHz). C is on a wireless network (10Mb/s - Mac duo 2Ghz). D is in Orsay (LAN 100Mb/s, WAN 1Gb/s - PIV 3 Ghz). E is in Amsterdam (LAN 100Mb/s,WAN 10Gb/s-AMD64). F is on ADSL (8Mb/s - Mac duo 2Ghz). G is in Caltech (10Gb/s). The results are in real-time seconds

Balance between:Balance between:

•Size of the buffer•Performance gain•Number of cache misses

Ideal size:Ideal size:

•Around 10%

Gain (overall):Gain (overall):

•Close to 13%

Parallel UnzippingParallel Unzipping

Root [0] TFile f("h1big.root");

Root [1] f.DrawMap();

2 colored branches

283813 entries

280 Mbytes

152 branches

Taking advantage of multi-core machines

GetUnzipBuffer(pos,len)StartThreadUnzip()

ThreadUnzip

UnzipCache()

WaitForSignal()

if buffer IS found

return buffer(pos,len)

SendSignal()

ReadBuffer(zip, pos,len)UnzipBuffer(unzip, zip)

return buffer(pos,len)

TBasket TTreeCache

Client Server Client Server

n = number of requests CPT = process time (client)RT = response time (server)L = latency (round trip)

Total time =n (CPT + RT + L)

The equation depends on both variables

Old Method

Total time =n (CPT + RT)+ L

The equation does not depend on the latency anymore !!!

New Method

Remote access example

• Distance CERN-SLAC: 9393Km

• Maximum speed: 2.9x106 Km/s

• Lowest latency (RTT): 62.6 ms

• Measured latency : 166 ms

Latency is proportional to distance and can not be reduced!!!

Scattered reads

Trees represent data in a very efficient way

• Data is grouped in branches. • We read only subsets of the branch buffers.

We need buffers that are not contiguous but we know their position and size.

High latency and numerous reads create a problem

if buffer IS NOT found

Since we know which buffers to read, an additional thread can unzip them in advance.

It stabilizes at 12.5%

(maximum gain)

Current transfer protocols are tuned to work with large pipelined data buffers where bandwidth is the key parameter. On the other hand, interactive data analysis requires access to scattered records in remote files and latency is the main factor.To alleviate the latency problem, an efficient pre-fetching/cache algorithm has been implemented in recent versions of ROOT.

Cache Size

Latency(ms) 0KB 64KB 10MB

A 0.0 3.4 3.4 3.4

B 0.3 8.0 6.0 4.0

C 2.0 11.6 5.6 4.9

D 11.0 124.7 12.3 9.0

E 22.0 230.9 11.7 8.4

F 72.0 743.7 48.3 28.0

G 240.0 >1800.0 125.4 9.9** TCP/IP Jumbo frames (9KB) and a TCP/IP window size of 10 Mbytes. We hope to bring down this number to about 4 seconds by reducing the number of transactions when opening the file.

intelligent pre-fetching

Documents