distributed applications monitoring at system and network level a.brunengo (infn- ge), a.ghiselli...

Post on 18-Jan-2016

228 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Distributed applications monitoring at

system and network level

A.Brunengo (INFN- Ge), A.Ghiselli (INFN-Cnaf), L.Luminari (INFN-Roma1), L.Perini (INFN-Mi), S.Resconi (INFN-Mi), M.Sgaravatto (INFN-Pd), C.Vistoli (INFN-Cnaf)

for the

Monarc Collaboration

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 2

Objectives• analysis of measurements regarding resource

utilization in a distributed environment: – CPU usage, – network throughput – wall clock time of a single job

• to locate– system and network bottlenecks,– software and hardware inefficiencies in different scenarios.

• to understand– client behaviour and system resource requirements– network impact to the application behaviour and application efficency using

network

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 3

Test description• ATLFast++ stress tests: increasing number of concurrent

jobs with read access to the Data Base• A single job reads ~3000 events 40KB each• Single objectivity federation• System configuration

– Single workstation (without AMS server)– One AMS server - one client machine – One AMS server - many client machines

• Network configuration– LAN (Gigabit Ethernet,Fast Ethernet, Ethernet)– WAN with different bandwidth capacity (2Mbps to 8Mbps)– QOS/Differentiated services WAN link 2Mbps

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 4

• System parameters collected:– Client side: CPU use (by user and system), job wall clock time– Server side: CPU use (by user and system), network throughput

• CPU use in client machine is important to evaluate machine load versus number of concurrent jobs with different link speed.

• CPU use on Server is important to evaluate the maximum number of client-jobs that can be served and if this is related with client characteristics and network link capacity.

• Wall clock time execution is important to evaluate system capacity to deliver workload in connection with the number of jobs and network speed.

Application monitoring

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 5

• CPU usage (client and server) is collected via periodical ‘vmstat’ commands

• Application itself records elapsed time and Cpu time• Aggregate server throughput is collected tracing the AMS

server process systems calls:– every 2 minutes a script compute and sum n.bytes read/send

and n.bytes write /received.

Application monitoring

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 6

Test results

• One Client – One Server – Gigabit Ethernet

Server

1000BaseSX

sunlab1 gsun

Client

sunlab1, gsun: Sun Ultra5, 333 MHz, 128 MB RAM, Solaris 2.7 (14 SpecInt 95)

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 7

Server and client connected with GE CPU use on client

0

20

40

60

80

100

120

09-2

2-18

:29:

18

09-2

2-18

:43:

25

09-2

2-18

:57:

33

09-2

2-19

:11:

42

09-2

2-19

:25:

57

09-2

2-19

:40:

13

09-2

2-19

:54:

31

09-2

2-20

:08:

55

09-2

2-20

:23:

08

09-2

2-20

:37:

35

09-2

2-20

:51:

55

09-2

2-21

:06:

04

09-2

2-21

:20:

16

09-2

2-21

:34:

27

09-2

2-21

:48:

37

09-2

2-22

:02:

47

09-2

2-22

:17:

00

09-2

2-22

:31:

13

09-2

2-22

:45:

27

09-2

2-22

:59:

35

09-2

2-23

:16:

10

09-2

2-23

:33:

33

09-2

2-23

:47:

58

09-2

3-00

:02:

06

09-2

3-00

:23:

13

09-2

3-00

:38:

44

09-2

3-00

:53:

05

09-2

3-01

:07:

25

09-2

3-01

:21:

35

09-2

3-01

:42:

10

09-2

3-02

:01:

45

09-2

3-02

:18:

16

09-2

3-02

:32:

39

09-2

3-02

:46:

47

09-2

3-03

:05:

51

09-2

3-03

:41:

12

09-2

3-04

:15:

38

09-2

3-04

:47:

31

09-2

3-05

:13:

50

%C

PU

n. job %cpu user %cpu system %cpu totale

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 8

Server and Client connected via GE CPU use on Server

0

20

40

60

80

100

120

09-2

2-18

:29:

17

09-2

2-18

:45:

47

09-2

2-19

:02:

54

09-2

2-19

:20:

16

09-2

2-19

:38:

53

09-2

2-19

:57:

35

09-2

2-20

:15:

24

09-2

2-20

:36:

18

09-2

2-20

:55:

27

09-2

2-21

:16:

10

09-2

2-21

:37:

42

09-2

2-21

:55:

47

09-2

2-22

:17:

26

09-2

2-22

:38:

41

09-2

2-22

:56:

32

09-2

2-23

:18:

44

09-2

2-23

:40:

01

09-2

2-23

:58:

17

09-2

3-00

:20:

01

09-2

3-00

:40:

17

09-2

3-01

:00:

54

09-2

3-01

:19:

06

09-2

3-01

:40:

30

09-2

3-02

:00:

38

09-2

3-02

:21:

14

09-2

3-02

:40:

00

09-2

3-03

:00:

39

09-2

3-03

:19:

02

09-2

3-03

:36:

10

09-2

3-03

:53:

17

09-2

3-04

:10:

28

09-2

3-04

:27:

43

09-2

3-04

:44:

58

09-2

3-05

:02:

27

09-2

3-05

:22:

44

%cp

u

n.job %Cpu User %Cpu system %Cpu totale

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 9

Server Client connected via GE Throughpit in kbit/sec

0

5000

10000

15000

20000

25000

30000

35000

40000

1999

-09-

22-1

8:31

:21

1999

-09-

22-1

8:47

:53

1999

-09-

22-1

9:05

:24

1999

-09-

22-1

9:22

:28

1999

-09-

22-1

9:40

:58

1999

-09-

22-2

0:00

:07

1999

-09-

22-2

0:17

:33

1999

-09-

22-2

0:39

:07

1999

-09-

22-2

0:57

:36

1999

-09-

22-2

1:18

:57

1999

-09-

22-2

1:40

:16

1999

-09-

22-2

1:58

:56

1999

-09-

22-2

2:20

:11

1999

-09-

22-2

2:41

:23

1999

-09-

22-2

2:59

:27

1999

-09-

22-2

3:21

:30

1999

-09-

22-2

3:42

:43

1999

-09-

23-0

0:00

:39

1999

-09-

23-0

0:22

:42

1999

-09-

23-0

0:42

:59

1999

-09-

23-0

1:03

:38

1999

-09-

23-0

1:21

:28

1999

-09-

23-0

1:43

:09

1999

-09-

23-0

2:03

:23

1999

-09-

23-0

2:23

:56

1999

-09-

23-0

2:42

:05

1999

-09-

23-0

3:03

:18

1999

-09-

23-0

3:21

:13

1999

-09-

23-0

3:38

:21

1999

-09-

23-0

3:55

:28

1999

-09-

23-0

4:12

:41

1999

-09-

23-0

4:29

:55

1999

-09-

23-0

4:47

:09

1999

-09-

23-0

5:04

:46

1999

-09-

23-0

5:25

:16

kbp

s

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 10

Server - Client connected via GEMean Wall Clock Time

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

1 5 10 20 30 40 50 60 70 80 90 100

n.job

se

c.

gsun

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 11

Comments:

– Client CPU: 100 % used with 5 jobs

– Server CPU: 100 % used with 50 jobs

– After 40 concurrent jobs, Client CPU decreases as well as network throughput

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 12

100BaseT

bbfarm01 bbfarm02

Client Server

CPU Memory DISK OS LINK

bbfarm01 Sun E450 Ultra 2 400MHz, 4 CPU, each 17 SpecInt

512MB RAID A3500

Solaris 2.7 Sun PCI FE

bbfarm02 Sun E450 Ultra 2 400MHz, 4 CPU, each 17 SpecInt

512MB RAID A3500

Solaris 2.7 Sun PCI FE

One Client – One Server – Fast EthernetRome Babar Farm

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 13

Server and Client connected via FECPU use on Server

0

10

20

30

40

50

60

70

09-2

2-10

:21:

02

09-2

2-10

:31:

42

09-2

2-10

:43:

15

09-2

2-10

:54:

03

09-2

2-11

:06:

20

09-2

2-11

:19:

37

09-2

2-11

:31:

34

09-2

2-11

:45:

45

09-2

2-11

:58:

46

09-2

2-12

:09:

33

09-2

2-12

:20:

51

09-2

2-12

:32:

14

09-2

2-12

:44:

09

09-2

2-12

:56:

24

09-2

2-13

:08:

42

09-2

2-13

:21:

06

09-2

2-13

:32:

41

09-2

2-13

:44:

15

09-2

2-13

:56:

15

09-2

2-14

:08:

23

09-2

2-14

:20:

27

09-2

2-14

:32:

35

09-2

2-14

:44:

34

09-2

2-14

:56:

32

09-2

2-15

:08:

28

09-2

2-15

:20:

26

09-2

2-15

:31:

24

09-2

2-15

:43:

35

09-2

2-15

:55:

30

09-2

2-16

:07:

26

09-2

2-16

:19:

23

09-2

2-16

:30:

38

09-2

2-16

:41:

52

09-2

2-16

:53:

09

09-2

2-17

:04:

34

%C

PU

Nr.Jobs %CPU (user) %CPU (system) %CPU (total)

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 14

Server and Client connected via FECPU use on Client

0

10

20

30

40

50

60

70

80

1999

-09-

22-1

0:18

:57

1999

-09-

22-1

0:31

:20

1999

-09-

22-1

0:43

:44

1999

-09-

22-1

0:56

:07

1999

-09-

22-1

1:08

:31

1999

-09-

22-1

1:20

:57

1999

-09-

22-1

1:33

:21

1999

-09-

22-1

1:45

:53

1999

-09-

22-1

1:58

:23

1999

-09-

22-1

2:10

:46

1999

-09-

22-1

2:23

:10

1999

-09-

22-1

2:35

:34

1999

-09-

22-1

2:47

:58

1999

-09-

22-1

3:00

:22

1999

-09-

22-1

3:12

:46

1999

-09-

22-1

3:25

:10

1999

-09-

22-1

3:37

:33

1999

-09-

22-1

3:49

:57

1999

-09-

22-1

4:02

:21

1999

-09-

22-1

4:14

:45

1999

-09-

22-1

4:27

:09

1999

-09-

22-1

4:39

:33

1999

-09-

22-1

4:51

:57

1999

-09-

22-1

5:04

:21

1999

-09-

22-1

5:16

:45

1999

-09-

22-1

5:29

:09

1999

-09-

22-1

5:41

:32

1999

-09-

22-1

5:53

:56

1999

-09-

22-1

6:06

:20

1999

-09-

22-1

6:18

:44

1999

-09-

22-1

6:31

:08

1999

-09-

22-1

6:43

:32

1999

-09-

22-1

6:55

:56

1999

-09-

22-1

7:08

:20

%C

PU

(4

CP

U)

Nr.Jobs %CPU (user) %CPU (system) %CPU (total)

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 15

Server and Client connected via FEThroughput in Kbit/sec

0

10000

20000

30000

40000

50000

60000

70000

80000

Kbi

t/sec

Throughput in kbit/sec

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 16

Server and client connected via FEMean Wall clock time

0

1000

2000

3000

4000

5000

6000

7000

0 10 20 30 40 50 60

n.job

se

c

bbfarm02

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 17

Comments:

– Client CPU: 60 % used up to 30 jobs then 20% – Server CPU: 100 % used with 5 jobs (multi-

processor server used as mono-processor)– After 30 concurrent jobs execution wall clock

execution increase rapidly – After 30 concurrent jobs, Client CPU decreases as

well as network throughput

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 18

one server (Bologna) – one client(CERN)QOS/Differentiaited Service

2Mbps WAN ATM link

2 Mbps

sunlab1 monarc01

Server Client

sunlab1: Sun Ultra5, 333 MHz, 128 MB, Solaris 2.7monarc01: Sun Enterprise 450 - 4 X 400 MHz, 512 MB, Solaris 2.6

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 19

• 4717 cells/sec --> 4717 * 48 byte/ sec = 1811 Kbit/sec.

• Available bandwidth completely used.• CPU server and client unloaded.• Very high job elapsed time. • Differentiated services mechanism working

properly (with precise network parameters tuning).

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 20

Clientssunlab1, gsun, cmssun4, atlsun1, atlas4: Sun Ultra5/10, 333 MHz, 128 MBvlsi06: Sun SPARC20, 125 MHz, 128 MBmonarc01: Sun Enterprise 450, 4X 400 MHz, 512 MB

sunlab1

Server

1000BaseT

2 Mbps

2 Mbps

2 Mbps8 Mbps

sunlab1

cmssun4 vlsi06 atlsun1 atlas4 monarc01

Access Link technologyto GARR-B (ATMbackbone)

Access Speed to GARR(Mbps)

RTT to Cnaf

Cnaf ATM 6M

Padova Point-to-point 2M 7msec

Genova Frame-relay 2M 21msec

Milano ATM 8M 8msec

Cnaf - Cern ATM 2M ( 4717 cells/sec) 51msec

WAN Test

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 21

Comments

• client CPU: never 100 % used

• server CPU: never 100 % used

• wall clock time for job running in the workstation connected via GE very high (i.e. for 10 jobs >1000’; 400’ without other clients in WAN): slow clients degrade performances on fast clients ???

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 22

ClientsClients and server: Sun Enterprise 450, 4X 400 MHz, 512 MB

cutter

Server

100BaseT

100 BaseT 100BaseT

bbfarm01

bbfarm02 bbfarm03 bbfarm04

One server –many clients LAN FERome Babar Farm

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 23

Comments:

• Throughput decrease when there are more than 30 job in the server.

• Cpu server is always 100% used.

• Starting with 40 concurrent jobs in one client jobs start crashing .

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 24

CLIENT SERVER Network speed

Max CPU Number of jobs Max CPU Number of jobs

1000M (GE)

100% >5 100% >50

100M (FE)

60% , then 20%

Up to 30, then up to 60

100% Up to 60

10M(Eth) 80% >20 30% >60 2M (PPP ATM WAN

5% Up to 20 10% (constant)

1-20 (during the all test)

Test results

• GE test: client CPU is saturated with 5 jobs• FE test: server CPU is saturated with 30 jobs• Ethernet test: network is saturated

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 25

Test results

• GE utilization is poor

Network Link Server host

Network speed

Max throughput

Number of jobs

1000M GEthernet

37Mbps 20

100M FEthernet

80Mbps 30

10M Ethernet 9Mbps 20

2M VC ATM 1.7Mbps 20

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 26

Conclusion• Objectivity 5.1 behaviour on different network layouts:

– Application/Objectivity in powerful client machine use high CPU.– AMS is not able to use multi CPU machine

• Optimal measured values for the server corresponds to 30 connection from concurrent remote jobs.– Too small for a production environment.

• Identified boundary condition for efficent running with the specific CPU.• Acceptable running condition:

– Link Server/Client minimum speed 8Mbps– Client machine from 6 to 15 concurrent analysis jobs– Server 30 conncurrent jobs request

• Global performance degrades rapidly moving away from optimal condition

7 feb. 2000 - CHEP 2000 C. Vistoli INFN/CNAF 27

Future works• Application monitoring tools able to real time check the

working conditions, to take necessary action to mantain the system around the optimal conditions.

• Test Objectivity 5.2 features• Test a Multi Server Configuration using read/write application. • Dedicated WAN Test-bed with 10Mbps bandwitdh links.• LAN - WAN behaviour with equivalent high bandwidth

capacity to be investigated deeply:– Host tuning and RTT could impact performances– QoS

• documentation: www.cern.ch/MONARC

top related