vidas: object-based virtualized data sharing...

33
Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas VIDAS: OBJECT-BASED VIRTUALIZED DATA SHARING FOR HIGH PERFORMANCE STORAGE I/O ScienceCloud New York City, June 2013 Pablo Llopis, Javier Garcia Blas, Florin Isaila, Jesus Carretero Computer Architecture and Technology Group (ARCOS) University Carlos III of Madrid, Spain viernes, 14 de junio de 13

Upload: buiquynh

Post on 10-Mar-2018

250 views

Category:

Documents


1 download

TRANSCRIPT

Grupo de Arquitectura de Computadores, Comunicaciones y Sistemas

VIDAS: OBJECT-BASED VIRTUALIZED D AT A S H A R I N G F O R H I G H PERFORMANCE STORAGE I/O

ScienceCloud New York City, June 2013

Pablo Llopis, Javier Garcia Blas, Florin Isaila, Jesus Carretero

Computer Architecture and Technology Group (ARCOS)University Carlos III of Madrid, Spain

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Agenda

1. Motivation

2. Goals and common problems

3. Common storage I/O virtualization solutions

4. VIDAS Design and Implementation

5. Evaluation

6. Conclusion

2

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Motivation3

‣ HPC in virtualized clouds gaining popularity

‣ Increasing amount of data stresses importance of I/O performance

‣ I/O overhead in virtualized environments is still significant

‣ Agreement that POSIX not suitable for high performance

‣ Virtualized environments trade off protection for performance

‣ Lack of I/O coordination mechanisms across domains

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Goals and contributions4

Propose new node-level abstractions and mechanisms in virtualized environments which:

‣ Enable building efficient virtualized data sharing

‣ Coordinate I/O across domains

‣ Provide shared access spaces across node-local domains

‣ Relax POSIX consistency

‣ Allow flexible data write and data read policies

‣ Expose data locality

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Agenda

1. Motivation

2. Goals and common problems

3.Common storage I/O virtualization solutions

3.1. Block-level solution

3.2. Filesystem-level solution

4. VIDAS Design and Implementation

5. Evaluation

6. Conclusion

5

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Virtualized block device drivers

Host

Guest Domain

Hypervisor

Device Driver

Appl icat ion

Host

Guest Domain

Hypervisor

Appl icat ion

Host

Guest Domain

Hypervisor

Device Driver

Appl icat ion

Device Driver Front-end

Device Driver Back-end

Device DriverDevice Driver

Device Driver

Scheduler

Buffer Cache

A1 A2 A3

A1 virtualized block-level device on top of a physical device driver

A2 virtualized block-level device on top of filesystem

A3 paravirtualized device drivers

6

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Virtualized filesystems

B1 virtualized filesystem within a file (stacked filesystems)

B2 paravirtualized filesystem

B3 sophisticated paravirtualized filesystem, inter-domain coordination

7

Host

Guest Domain

Hypervisor

Device Driver

Scheduler

Buffer Cache

Device Driver

Scheduler

Buffer Cache

Appl icat ion

Host

Guest Domain

Hypervisor

Appl icat ion

Device Driver

Scheduler

Host

Guest Domain

Hypervisor

Device Driver

Appl icat ion

Scheduler

Buffer Cache

Paravirtualized filesystem front-end

Paravirtualized filesystem back-end

Device Driver Front-end

Device Driver Back-end

B1 B2 B3

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Virtualized filesystems

B1 virtualized filesystem within a file (stacked filesystems)

B2 paravirtualized filesystem

B3 sophisticated paravirtualized filesystem, inter-domain coordination

8

Host

Guest Domain

Hypervisor

Device Driver

Scheduler

Buffer Cache

Device Driver

Scheduler

Buffer Cache

Appl icat ion

Host

Guest Domain

Hypervisor

Appl icat ion

Device Driver

Scheduler

Host

Guest Domain

Hypervisor

Device Driver

Appl icat ion

Scheduler

Buffer Cache

Paravirtualized filesystem front-end

Paravirtualized filesystem back-end

Device Driver Front-end

Device Driver Back-end

B1 B2 B3

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Agenda

1. Motivation

2. Goals and common problems

3.Common storage I/O virtualization solutions

4. VIDAS Design and Implementation

4.1. Abstractions

4.2. Design

4.3. Implementation

5. Evaluation

9

viernes, 14 de junio de 13

Guest 0 Guest 1 Guest N-1

Storage

ScienceCloud ’13 New York City, 17th June 2013 _

VIDAS10

viernes, 14 de junio de 13

Guest 0 Guest 1 Guest N-1

Storage

ScienceCloud ’13 New York City, 17th June 2013 _

VIDAS10

Object 0 Object 1 Object 2 Object M-1

viernes, 14 de junio de 13

Guest 0 Guest 1 Guest N-1

Object 0 Object 1 Object 2 Object M-1

offset0, sz0 offset0, size0

write update

ScienceCloud ’13 New York City, 17th June 2013 _

Abstractions11

Objects Containers

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Containers12

Guest 0 Guest 1 Guest N-1

Object 0 Object 1 Object 2 Object M-1

offset0, sz0 offset0, size0

write update

Containers are access control domains which allow to share objects among a set of guests.

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Objects: POSIX differences13

Data writes and updates guided by configurable policy:write-through or write-back

Guest 0 Guest 1 Guest N-1

Object 0 Object 1 Object 2 Object M-1

offset0, sz0 offset0, size0

write update

Provide locality awareness

Strong consistency not enforced, but optional.

Objects are uniquely associated to an external storage resource through its name

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

API Overview14

Container operationsint container create(char *name, int domain ids[])

int container destroy(char* name)int container attach(char *name)int container leave(char* name)

Object metadata operationsobj handle t object create(char* ext storage rsc, size t o↵set, size t size, char* cname)

obj handle t object join(char* ext storage rsc, size t o↵set, size t size, char* cname)

int object get locality(char* ext storage rsc, obj handle t *objects[])

int object leave(obj handle t o)int object destroy(obj handle t o)int object getattr(obj handle t o, char *name,

void *value, size t size)int object setattr(obj handle t o, char *name,

void *value, size t size)

int obj write(obj handle t o, char *buf, size t o↵set, size t sz)

int obj read(obj handle t o, char *buf, size t o↵set, size t sz)

int obj flush(obj handle t o)int obj update(obj handle t o)

Object data operations

Object synchronization operationsint object wait(obj handle t o, char **bufp)

int object notify(obj handle t o, char *buf)

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Agenda

1. Motivation

2. Goals and common problems

3.Common storage I/O virtualization solutions

4. VIDAS Design

5. VIDAS Implementation

5.1. Inter-domain communication mechanisms

5.2. Implementation

6. Using VIDAS

7. Evaluation

15

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Inter-domain communication16

Ring buffers Shared memory

Domain B

Domain A

Ringbuffer

Domain N

Domain 1 Domain 2

Shareddata

Inter-domain data communication techniques

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Inter-domain communication

• A high-efficient inter-domain data transferring system for virtual machines. (2009) pp. 385–390 Presented at the Proceedings of the 3rd International Conference on Ubiquitous Information Management and Communication, New York, NY, USA: ACM.

• A study of Inter-domain Communication Mechanisms on Xen-based Hosting Platforms. (2009).• Diakthaté F., Perache, M., Namyst, R., & Jourdren, H. (2009). Efficient Shared Memory Message

Passing for Inter-VM Communications. Euro-Par 2008 Workshop• Huang, W., Koop, M. J., Gao, Q., & Panda, D. K. (2007). Virtual machine aware communication

libraries for high performance computing (p. 9). Presented at the SC '07• Inter-domain socket communications supporting high performance and full binary compatibility on

Xen (pp. 11–20). Presented at the Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, New York, NY, USA: ACM

• Wei Huang, Koop, M. J., & Panda, D. K. (2008). Efficient one-copy MPI shared memory communication in Virtual Machines (pp. 107–115). Presented at the 2008 IEEE International Conference on Cluster Computing (CLUSTER), IEEE.

17

All of these optimize inter-vm data sharing and communication, but none of them work around the limitation of being able to share pages with only 2 domains.

viernes, 14 de junio de 13

Domain 2Domain 2

Memorypage

Domain 1 Domain 2

Memorypage

Domain 1

ScienceCloud ’13 New York City, 17th June 2013 _

Xen/Linux grant reference extension18

Hypervisors usually provide a mechanism to share a page with only 2 domains domains at once.

In our work we extended Linux/Xen to provide user-space applications with the ability to share a page with any number of domains.

viernes, 14 de junio de 13

VM A

Appl icat ion

I /O Stack

PV device dr iver

VM B

Appl icat ion

I /O Stack

PV device dr iver

I/O stack, device drivers

Buffer cache

RingbufferRingbuffer

ScienceCloud ’13 New York City, 17th June 2013 _

Paravirtualized data sharing19

VM A

Appl icat ion

VM B

Appl icat ion

Shareddata object

I/O stack, device drivers

While most solutions use mainly ringbuffers for paravirtualization,VIDAS combines ringbuffers and multi-domain shared memory

Point-to-point Collective

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Implementation

Host

Guest DomainGuest Domain

Front-end Object dataObject attributes

Hypervisor

Back-endRing buffer transport

Container indexObject index

ApplicationApplicationobject_createobject_joinobject_destroyobject_leaveobject_get_localityobject_flushobject_updatecontainer_createcontainer_destroy

object_getattrobject_setattrobject_writeobject_readobject_notifyobject_wait

Front-end

Ring buffer transport

20

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Implementation

Host

Guest DomainGuest Domain

Front-end

Hypervisor

Back-endRing buffer transport

Container indexObject index

ApplicationApplication

Front-end

Ring buffer transport

object_createcontainer_create

A container creator specifies which domains are allowed to access the objects within the container

21

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Implementation

Another domain can then access the objects in a container by joining in

Host

Guest DomainGuest Domain

Front-end

Hypervisor

Back-endRing buffer transport

Container indexObject index

ApplicationApplication

Front-end

Ring buffer transport

object_joincontainer_attach

22

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

23

Host

Guest DomainGuest Domain

Front-end Object dataObject attributes

Hypervisor

Back-endRing buffer transport

Container indexObject index

ApplicationApplication

Front-end

Ring buffer transport

object_waitobject_read

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

24

Host

Guest DomainGuest Domain

Front-end Object dataObject attributes

Hypervisor

Back-endRing buffer transport

Container indexObject index

ApplicationApplication

Front-end

Ring buffer transport

object_getattrobject_setattobject_writeobject_notify

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Agenda

1. Motivation

2. Goals and common problems

3.Common storage I/O virtualization solutions

4. VIDAS Design and Implementation

5. Evaluation

6. Conclusion

25

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Evaluation

¨ CPU Intel Xeon 12 core¨ RAM 64GB DDR3 synchronous RAM @ 1333Mhz¨ HDD 1TB Toshiba MK1002TS

¤ Linux LVM¤ ext4

¨ Hypervisor Xen 4.2¨ Privileged domain Linux 3.5.7¨ Guest domains Linux 3.5.7

¨ Synthetic Benchmarks: Broadcast, Independent read, Collective r/w

26

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Bandwidth: In-memory communication

0"

100"

200"

300"

400"

500"

600"

2" 4" 8" 16"

Band

width)(M

B/s))

Number)of)Virtual)Machines)

MPI-Bcast"

VIDAS-Bcast"

ringbuffer-Bcast"

MPI-Bcast: share 128MB of data using MPI_Bcast

VIDAS-Bcast: share 128MB of data using VIDAS

ringbuffer-bcast: transfer 128MB of data to each VM using Xen ringbuffer transports

27

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Independent file read (512MB)

0"

20"

40"

60"

80"

100"

120"

140"

160"

1" 2" 4" 8" 16"

Throughp

ut)(M

B/s))

Number)of)Virtual)Machines)MPI+IO"(NFS)" VIDAS"I/O+Bcast" MPI+IO"(PVFS2)"

MPI-IO (NFS) Concurrent file read on a node-local NFS mountMPI-IO (PVFS2) Concurrent file read on a node-local PVFS2 mount

VIDAS I/O+Bcast Load data into object, broadcast to all VMs

28

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Collective I/O (Two-phase I/O)

0"

20"

40"

60"

80"

100"

120"

140"

160"

1" 2" 4" 8" 16" 1" 2" 4" 8" 16"

Throughp

ut)(M

B/s))

Number)of)Virtual)Machines)

Read))))))))))))))))))))))))))))))))))))))))))))))Write)

MPI+IO"collec1ve"(NFS)" VIDAS"collec1ve" MPI+IO"collec1ve"(PVFS2)"

Non-overlappingly interleaved strided vectors of 2MB blocks

29

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Agenda

1. Motivation

2. Goals and common problems

3.Common storage I/O virtualization solutions

4. VIDAS Design and Implementation

5. Evaluation

6. Conclusion

30

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Conclusion and Contributions31

‣ I/O overhead in virtualized environments is still significant

‣ Agreement that POSIX not suitable for high performance

‣ Virtualized environments trade off protection for performance

‣ Lack of I/O coordination mechanisms across domains

Reduce memory copy operationsReduce domain context switches

Relax POSIX consistencyControl data write and update policiesControl data locality

Created shared access spacesReduced memory copy operations and context switches

Coordinate storage I/O across domainsProposed new multi-domain data sharing mechanisms

viernes, 14 de junio de 13

ScienceCloud ’13 New York City, 17th June 2013 _

Questions?32

Q?viernes, 14 de junio de 13