distributed applications monitoring at system and network level a.brunengo (infn- ge), a.ghiselli...

27
Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi), S.Resconi (INFN-Mi), M.Sgaravatto (INFN-Pd), C.Vistoli (INFN-Cnaf) for the Monarc Collaboration

Upload: edith-patrick

Post on 18-Jan-2016

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

Distributed applications monitoring at

system and network level

A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi), S.Resconi (INFN-Mi), M.Sgaravatto (INFN-Pd), C.Vistoli (INFN-Cnaf)

for the

Monarc Collaboration

Page 2: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 2

Objectives• analysis of measurements regarding resource

utilization in a distributed environment: – CPU usage, – network throughput – wall clock time of a single job

• to locate– system and network bottlenecks,– software and hardware inefficiencies in different scenarios.

• to understand– client behaviour and system resource requirements– network impact to the application behaviour and application efficency using

network

Page 3: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 3

Test description• ATLFast++ stress tests: increasing number of concurrent

jobs with read access to the Data Base• A single job reads ~3000 events 40KB each• Single objectivity federation• System configuration

– Single workstation (without AMS server)– One AMS server - one client machine – One AMS server - many client machines

• Network configuration– LAN (Gigabit Ethernet,Fast Ethernet, Ethernet)– WAN with different bandwidth capacity (2Mbps to 8Mbps)– QOS/Differentiated services WAN link 2Mbps

Page 4: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 4

• System parameters collected:– Client side: CPU use (by user and system), job wall clock time– Server side: CPU use (by user and system), network throughput

• CPU use in client machine is important to evaluate machine load versus number of concurrent jobs with different link speed.

• CPU use on Server is important to evaluate the maximum number of client-jobs that can be served and if this is related with client characteristics and network link capacity.

• Wall clock time execution is important to evaluate system capacity to deliver workload in connection with the number of jobs and network speed.

Application monitoring

Page 5: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 5

• CPU usage (client and server) is collected via periodical ‘vmstat’ commands

• Application itself records elapsed time and Cpu time• Aggregate server throughput is collected tracing the AMS

server process systems calls:– every 2 minutes a script compute and sum n.bytes read/send

and n.bytes write /received.

Application monitoring

Page 6: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 6

Test results

• One Client – One Server – Gigabit Ethernet

Server

1000BaseSX

sunlab1 gsun

Client

sunlab1, gsun: Sun Ultra5, 333 MHz, 128 MB RAM, Solaris 2.7 (14 SpecInt 95)

Page 7: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 7

Server and client connected with GE CPU use on client

0

20

40

60

80

100

120

09-2

2-18

:29:

18

09-2

2-18

:43:

25

09-2

2-18

:57:

33

09-2

2-19

:11:

42

09-2

2-19

:25:

57

09-2

2-19

:40:

13

09-2

2-19

:54:

31

09-2

2-20

:08:

55

09-2

2-20

:23:

08

09-2

2-20

:37:

35

09-2

2-20

:51:

55

09-2

2-21

:06:

04

09-2

2-21

:20:

16

09-2

2-21

:34:

27

09-2

2-21

:48:

37

09-2

2-22

:02:

47

09-2

2-22

:17:

00

09-2

2-22

:31:

13

09-2

2-22

:45:

27

09-2

2-22

:59:

35

09-2

2-23

:16:

10

09-2

2-23

:33:

33

09-2

2-23

:47:

58

09-2

3-00

:02:

06

09-2

3-00

:23:

13

09-2

3-00

:38:

44

09-2

3-00

:53:

05

09-2

3-01

:07:

25

09-2

3-01

:21:

35

09-2

3-01

:42:

10

09-2

3-02

:01:

45

09-2

3-02

:18:

16

09-2

3-02

:32:

39

09-2

3-02

:46:

47

09-2

3-03

:05:

51

09-2

3-03

:41:

12

09-2

3-04

:15:

38

09-2

3-04

:47:

31

09-2

3-05

:13:

50

%C

PU

n. job %cpu user %cpu system %cpu totale

Page 8: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 8

Server and Client connected via GE CPU use on Server

0

20

40

60

80

100

120

09-2

2-18

:29:

17

09-2

2-18

:45:

47

09-2

2-19

:02:

54

09-2

2-19

:20:

16

09-2

2-19

:38:

53

09-2

2-19

:57:

35

09-2

2-20

:15:

24

09-2

2-20

:36:

18

09-2

2-20

:55:

27

09-2

2-21

:16:

10

09-2

2-21

:37:

42

09-2

2-21

:55:

47

09-2

2-22

:17:

26

09-2

2-22

:38:

41

09-2

2-22

:56:

32

09-2

2-23

:18:

44

09-2

2-23

:40:

01

09-2

2-23

:58:

17

09-2

3-00

:20:

01

09-2

3-00

:40:

17

09-2

3-01

:00:

54

09-2

3-01

:19:

06

09-2

3-01

:40:

30

09-2

3-02

:00:

38

09-2

3-02

:21:

14

09-2

3-02

:40:

00

09-2

3-03

:00:

39

09-2

3-03

:19:

02

09-2

3-03

:36:

10

09-2

3-03

:53:

17

09-2

3-04

:10:

28

09-2

3-04

:27:

43

09-2

3-04

:44:

58

09-2

3-05

:02:

27

09-2

3-05

:22:

44

%cp

u

n.job %Cpu User %Cpu system %Cpu totale

Page 9: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 9

Server Client connected via GE Throughpit in kbit/sec

0

5000

10000

15000

20000

25000

30000

35000

40000

1999

-09-

22-1

8:31

:21

1999

-09-

22-1

8:47

:53

1999

-09-

22-1

9:05

:24

1999

-09-

22-1

9:22

:28

1999

-09-

22-1

9:40

:58

1999

-09-

22-2

0:00

:07

1999

-09-

22-2

0:17

:33

1999

-09-

22-2

0:39

:07

1999

-09-

22-2

0:57

:36

1999

-09-

22-2

1:18

:57

1999

-09-

22-2

1:40

:16

1999

-09-

22-2

1:58

:56

1999

-09-

22-2

2:20

:11

1999

-09-

22-2

2:41

:23

1999

-09-

22-2

2:59

:27

1999

-09-

22-2

3:21

:30

1999

-09-

22-2

3:42

:43

1999

-09-

23-0

0:00

:39

1999

-09-

23-0

0:22

:42

1999

-09-

23-0

0:42

:59

1999

-09-

23-0

1:03

:38

1999

-09-

23-0

1:21

:28

1999

-09-

23-0

1:43

:09

1999

-09-

23-0

2:03

:23

1999

-09-

23-0

2:23

:56

1999

-09-

23-0

2:42

:05

1999

-09-

23-0

3:03

:18

1999

-09-

23-0

3:21

:13

1999

-09-

23-0

3:38

:21

1999

-09-

23-0

3:55

:28

1999

-09-

23-0

4:12

:41

1999

-09-

23-0

4:29

:55

1999

-09-

23-0

4:47

:09

1999

-09-

23-0

5:04

:46

1999

-09-

23-0

5:25

:16

kbp

s

Page 10: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 10

Server - Client connected via GEMean Wall Clock Time

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

1 5 10 20 30 40 50 60 70 80 90 100

n.job

se

c.

gsun

Page 11: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 11

Comments:

– Client CPU: 100 % used with 5 jobs

– Server CPU: 100 % used with 50 jobs

– After 40 concurrent jobs, Client CPU decreases as well as network throughput

Page 12: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 12

100BaseT

bbfarm01 bbfarm02

Client Server

CPU Memory DISK OS LINK

bbfarm01 Sun E450 Ultra 2 400MHz, 4 CPU, each 17 SpecInt

512MB RAID A3500

Solaris 2.7 Sun PCI FE

bbfarm02 Sun E450 Ultra 2 400MHz, 4 CPU, each 17 SpecInt

512MB RAID A3500

Solaris 2.7 Sun PCI FE

One Client – One Server – Fast EthernetRome Babar Farm

Page 13: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 13

Server and Client connected via FECPU use on Server

0

10

20

30

40

50

60

70

09-2

2-10

:21:

02

09-2

2-10

:31:

42

09-2

2-10

:43:

15

09-2

2-10

:54:

03

09-2

2-11

:06:

20

09-2

2-11

:19:

37

09-2

2-11

:31:

34

09-2

2-11

:45:

45

09-2

2-11

:58:

46

09-2

2-12

:09:

33

09-2

2-12

:20:

51

09-2

2-12

:32:

14

09-2

2-12

:44:

09

09-2

2-12

:56:

24

09-2

2-13

:08:

42

09-2

2-13

:21:

06

09-2

2-13

:32:

41

09-2

2-13

:44:

15

09-2

2-13

:56:

15

09-2

2-14

:08:

23

09-2

2-14

:20:

27

09-2

2-14

:32:

35

09-2

2-14

:44:

34

09-2

2-14

:56:

32

09-2

2-15

:08:

28

09-2

2-15

:20:

26

09-2

2-15

:31:

24

09-2

2-15

:43:

35

09-2

2-15

:55:

30

09-2

2-16

:07:

26

09-2

2-16

:19:

23

09-2

2-16

:30:

38

09-2

2-16

:41:

52

09-2

2-16

:53:

09

09-2

2-17

:04:

34

%C

PU

Nr.Jobs %CPU (user) %CPU (system) %CPU (total)

Page 14: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 14

Server and Client connected via FECPU use on Client

0

10

20

30

40

50

60

70

80

1999

-09-

22-1

0:18

:57

1999

-09-

22-1

0:31

:20

1999

-09-

22-1

0:43

:44

1999

-09-

22-1

0:56

:07

1999

-09-

22-1

1:08

:31

1999

-09-

22-1

1:20

:57

1999

-09-

22-1

1:33

:21

1999

-09-

22-1

1:45

:53

1999

-09-

22-1

1:58

:23

1999

-09-

22-1

2:10

:46

1999

-09-

22-1

2:23

:10

1999

-09-

22-1

2:35

:34

1999

-09-

22-1

2:47

:58

1999

-09-

22-1

3:00

:22

1999

-09-

22-1

3:12

:46

1999

-09-

22-1

3:25

:10

1999

-09-

22-1

3:37

:33

1999

-09-

22-1

3:49

:57

1999

-09-

22-1

4:02

:21

1999

-09-

22-1

4:14

:45

1999

-09-

22-1

4:27

:09

1999

-09-

22-1

4:39

:33

1999

-09-

22-1

4:51

:57

1999

-09-

22-1

5:04

:21

1999

-09-

22-1

5:16

:45

1999

-09-

22-1

5:29

:09

1999

-09-

22-1

5:41

:32

1999

-09-

22-1

5:53

:56

1999

-09-

22-1

6:06

:20

1999

-09-

22-1

6:18

:44

1999

-09-

22-1

6:31

:08

1999

-09-

22-1

6:43

:32

1999

-09-

22-1

6:55

:56

1999

-09-

22-1

7:08

:20

%C

PU

(4

CP

U)

Nr.Jobs %CPU (user) %CPU (system) %CPU (total)

Page 15: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 15

Server and Client connected via FEThroughput in Kbit/sec

0

10000

20000

30000

40000

50000

60000

70000

80000

Kbi

t/sec

Throughput in kbit/sec

Page 16: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 16

Server and client connected via FEMean Wall clock time

0

1000

2000

3000

4000

5000

6000

7000

0 10 20 30 40 50 60

n.job

se

c

bbfarm02

Page 17: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 17

Comments:

– Client CPU: 60 % used up to 30 jobs then 20% – Server CPU: 100 % used with 5 jobs (multi-

processor server used as mono-processor)– After 30 concurrent jobs execution wall clock

execution increase rapidly – After 30 concurrent jobs, Client CPU decreases as

well as network throughput

Page 18: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 18

one server (Bologna) – one client(CERN)QOS/Differentiaited Service

2Mbps WAN ATM link

2 Mbps

sunlab1 monarc01

Server Client

sunlab1: Sun Ultra5, 333 MHz, 128 MB, Solaris 2.7monarc01: Sun Enterprise 450 - 4 X 400 MHz, 512 MB, Solaris 2.6

Page 19: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 19

• 4717 cells/sec --> 4717 * 48 byte/ sec = 1811 Kbit/sec.

• Available bandwidth completely used.• CPU server and client unloaded.• Very high job elapsed time. • Differentiated services mechanism working

properly (with precise network parameters tuning).

Page 20: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 20

Clientssunlab1, gsun, cmssun4, atlsun1, atlas4: Sun Ultra5/10, 333 MHz, 128 MBvlsi06: Sun SPARC20, 125 MHz, 128 MBmonarc01: Sun Enterprise 450, 4X 400 MHz, 512 MB

sunlab1

Server

1000BaseT

2 Mbps

2 Mbps

2 Mbps8 Mbps

sunlab1

cmssun4 vlsi06 atlsun1 atlas4 monarc01

Access Link technologyto GARR-B (ATMbackbone)

Access Speed to GARR(Mbps)

RTT to Cnaf

Cnaf ATM 6M

Padova Point-to-point 2M 7msec

Genova Frame-relay 2M 21msec

Milano ATM 8M 8msec

Cnaf - Cern ATM 2M ( 4717 cells/sec) 51msec

WAN Test

Page 21: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 21

Comments

• client CPU: never 100 % used

• server CPU: never 100 % used

• wall clock time for job running in the workstation connected via GE very high (i.e. for 10 jobs >1000’; 400’ without other clients in WAN): slow clients degrade performances on fast clients ???

Page 22: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 22

ClientsClients and server: Sun Enterprise 450, 4X 400 MHz, 512 MB

cutter

Server

100BaseT

100 BaseT 100BaseT

bbfarm01

bbfarm02 bbfarm03 bbfarm04

One server –many clients LAN FERome Babar Farm

Page 23: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 23

Comments:

• Throughput decrease when there are more than 30 job in the server.

• Cpu server is always 100% used.

• Starting with 40 concurrent jobs in one client jobs start crashing .

Page 24: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 24

CLIENT SERVER Network speed

Max CPU Number of jobs Max CPU Number of jobs

1000M (GE)

100% >5 100% >50

100M (FE)

60% , then 20%

Up to 30, then up to 60

100% Up to 60

10M(Eth) 80% >20 30% >60 2M (PPP ATM WAN

5% Up to 20 10% (constant)

1-20 (during the all test)

Test results

• GE test: client CPU is saturated with 5 jobs• FE test: server CPU is saturated with 30 jobs• Ethernet test: network is saturated

Page 25: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 25

Test results

• GE utilization is poor

Network Link Server host

Network speed

Max throughput

Number of jobs

1000M GEthernet

37Mbps 20

100M FEthernet

80Mbps 30

10M Ethernet 9Mbps 20

2M VC ATM 1.7Mbps 20

Page 26: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 26

Conclusion• Objectivity 5.1 behaviour on different network layouts:

– Application/Objectivity in powerful client machine use high CPU.– AMS is not able to use multi CPU machine

• Optimal measured values for the server corresponds to 30 connection from concurrent remote jobs.– Too small for a production environment.

• Identified boundary condition for efficent running with the specific CPU.• Acceptable running condition:

– Link Server/Client minimum speed 8Mbps– Client machine from 6 to 15 concurrent analysis jobs– Server 30 conncurrent jobs request

• Global performance degrades rapidly moving away from optimal condition

Page 27: Distributed applications monitoring at system and network level A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi),

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 27

Future works• Application monitoring tools able to real time check the

working conditions, to take necessary action to mantain the system around the optimal conditions.

• Test Objectivity 5.2 features• Test a Multi Server Configuration using read/write application. • Dedicated WAN Test-bed with 10Mbps bandwitdh links.• LAN - WAN behaviour with equivalent high bandwidth

capacity to be investigated deeply:– Host tuning and RTT could impact performances– QoS

• documentation: www.cern.ch/MONARC