data center cooling plan document - des dessy
TRANSCRIPT
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
1/22
Cooling Device Requirements
Optimization Team
8 March 2012
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
2/22
2
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
3/22
Chapter 1
Requirements for the cooling plant:
1.1 General Requirements:
The Data Center is devided in four zones, so each of the four racks in the Date Center
will be associated to a particular zone, as shown below:
R1 Z1 R2 Z2 R3 Z3 R4 Z4
where Ri is a particular rack, and Zi is the zone associated to that rack.
A unique CRAC will be used, but each zone can be cooled indipendently, usign a
mechanical system control including a set of valves. The interaction of cooling between
different zones will not be taken into account1.
The cost for the cooling depends on the Crac usage level, it can be calculated as the
sum of the power needed to cool each rack.
Cool air will be pumped into the Data Center from the floor at a fixed temperature
Tin, hot hair will be exhausted from the ceiling at a variable temperature Tout2.
Each zone should have an indipendent emergency cooling device available, in order to
avoid dangerous black-outs. If even the emergency device fails the interested zone will
be declared out of order, and so each executing job will be suppressed in that particular
zone.
1It is important to discuss this problem with the thermal team.2A good policy could be explaiot more the lower CPUs, as they receive cooler hair, and so they can be
cooled in an easier way.
3
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
4/22
4 CHAPTER 1. REQUIREMENTS FOR THE COOLING PLANT:
1.2 Cooling device characterization:
Three cooling levels will be available for each zone, plus the level off, which will be
called level 0.
Transition between different cooling levels, turning on/off the cooling will be possible
every Tc, where Tc is called cooling epoque and is equal to five minutes.
Table 1.1: Example of cooling device charachterization
Cooling LevelCooling
5 min 10 min 15 min energy used
Level.
4 -3C -6C -10C 5U [W]
3 -1.8C -3C -5C 2U [W]
2 -1C -1.6C -2.5C 1U [W]
The temperature will be a function of the current scheduling and of the current
cooling plan:
T = f( cooling plan, current scheduling )
The cooling level in each zone will be a function of the average temperature of
CPUs in the rack, of the maximum temperature and of the derivative of the
temperature in that particular area:
Cl = f( Tavg,Tmax,T )
1.3 Cooling policy:
The Data Center should have an average temperature of 27C, when it is working.
The cooling system will be turned on in one of the zones when the average temperature
exceedes 30C or the maximum temperature exceedes 60C.
The derivative of the temperature will be used to determine the cooling level: the most
powerful cooling level will be used with high derivative.
The cooling system will be turned off when the temperature goes below 24C.
If no information is retrieved from the thermal map, an additional safety mechanism
is implemented 3.
3The way safaty mechanism works will be explained later on this report.
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
5/22
1.4. INPUTS: 5
0 1 2 3 4 5 6 7 8 9 10
24
26
28
30
32
34
36
Time [m]
Temperature[C]
Cooling policy
Unsafe Temperature
Desired TemperatureStart cooling temperature
Stop cooling temperature
Max power
Med powerMin Power
Figure 1.1: Example of what could happen with the adopted cooling policy.
1.4 Inputs:
A [ 4 10 ] float matrix containing the temperature of each CPU will be received from
the thermal team. This matrix will be taken one time each Tc, in order to have enough
information to calculate the derivative of the temperature and the average temperature.
1.5 Outputs:
The cooling schedule will be sent to the database every Tc, in order to allow Thermal
Team to update their thermal model, and to allow the Power Team to calculate the
amount of non computing energy. The schedule will consist of an array composed by
four integers, in the interval [ 1, 4 ].
NOTE: more details on the input and output will be given in the parametric diagram.
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
6/22
6 CHAPTER 1. REQUIREMENTS FOR THE COOLING PLANT:
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
7/22
Chapter 2
Formal models:
2.1 Algebraic specifications:
The state variables used will be three:
The Cooling level { 1 ( Off ) or 2 or 3 or 4 }
The average temperature { 0C 70C }
The maximum temperature { 0C 70C }
If Tmax > 60
C = Cooling level = 4 (2.1)If Tmax < 60
C and Tavg > 30C and T > 4 and 1 = Cooling level = 4 (2.2)
If Tmax < 60C and Tavg > 30
C and 2 < T < 4 and 1 = Cooling level = 3 (2.3)
If Tmax < 60C and Tavg > 30
C and T < 2 and 1 = Cooling level = 2 (2.4)
If Tmax < 60C and Tavg < 24
= Cooling level = 1 (2.5)
If Tmax < 60C and T > 0 and 2 = Cooling level = 3 (2.6)
If Tmax < 60C and T > 0 and 3 = Cooling level = 4 (2.7)
2.2 Parametric Diagram and Automata:
7
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
8/22
8 CHAPTER 2. FORMAL MODELS:
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
9/22
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
10/22
10 CHAPTER 2. FORMAL MODELS:
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
11/22
Chapter 3
Code implementation:
3.1 Code:
The code needed for the cooling device implementation is listed and commented here below:
1 A DD T HE C OD E H ER E
3.2 Code automata:
Here below a code automata is presented, it basically explains how the code works: As it is
possible to grasp from the figure, the steps in the code are the following:
1. the variables are initialized;
2. system enters in a forever loop and waits;
3. system checks if a valid thermal map is retrievable;
4. if it is not, the safe plan is adopted, and the waiting state is called again;
5. if it is, the value of the data is checked for consistency;
6. if received data is not ok, the safe plan is adopted, and the waiting state is called again;
7. otherwise the cooling plan is calculated and waiting state is entered finally.
11
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
12/22
12 CHAPTER 3. CODE IMPLEMENTATION:
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
13/22
Chapter 4
Testing
4.1 Testing plan:
The testing of the cooling device will be devided into two section.
4.1.1 Input-output data structure:
First of all the format of the input-output data will needs to be controlled. This means that
if the incoming matrix is not a matrix containing [ 4 x 10 ] floats the input is not valid.
The input validity is guaranteed by the integration group, that provided a method the ensures
to receive every time a [4x10] float matrix.
On the other hand, if the output in not an array containing four integers contained in the
specified interval, the cooling plan must be recalculated.
4.1.2 Data consistency:
We will try to test the behaviour of the cooling device giving as an input a matrix containing
invalid values. The values we will use are going to be:
-20 : in this case the software will send to the database an ERROR TYPE 1: Tem-
perature out of valid range message.
0 : in this case the software should accept the incoming input.
30 : in this case the software should accept the incoming input.
70 : in this case the software should accept the incoming input.
100: in this case the software will send to the database an ERROR TYPE 1: Tem-
perature out of valid range message.
13
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
14/22
14 CHAPTER 4. TESTING
4.1.3 Automata path coverage:
A JUnit test suite will then be used in order to test some of the methods in the code. We will
focus our effords in particular on one of them, the calculateCooling method, as it is the one
that effectively implements the automata of section 2.2, and so the one that is at the core
of the cooling device. In order to have an exaustive test, we will try each possible transition
between the different states, in order to ensure a proper path coverage.
4.2 Test results:
In order to test the code we used JUnit inside the Eclipse environment.
4.2.1 Data Consistency:
As stated before, the data consistency test is already done inside the code itself.
The software reads the input from the database. The database passes a matrix to the coolingcode: one element of this array is red at a time. While processing the input cooling code
could send back different messages. Between this messages, two are error messages, while the
other one is a validation signal:
1. ERROR TYPE 1: Temperature out of valid range: this is sent when one of
the temperatures in the file is out of the acceptable range. In this case the system uses
the safe policy.
2. ERROR TYPE 2: Map not updated: if the termal map is not ready the system
sends this message, in order to ask the input data. In this case the system uses the safe
policy.
3. OK 1: Valid Input: if, while processing the input data, no one of the error listed
above happens, than the input is correct. The system can now proceed to calculate the
cooling plan.
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
15/22
4.2. TEST RESULTS: 15
Results:
Input:
Rack 0 Rack 1 Rack 2 Rack 3
32 27 61 1
32 27 29 1
32 27 29 1
32 27 29 1
32 27 29 1
32 27 29 1
32 27 29 1
32 27 29 132 27 29 1
32 27 29 1
System output:1 O K 1 : v al i d i n pu t . ..
3 O n r ac k n um be r 0 c oo li ng l ev el 3 h av e b ee n u se d .
T he a v er a ge t e mp e ra t ur e h er e i s o f : 3 2. 0 C .
5 T he m a xi m um t e mp e ra t ur e o n t hi s r ac k i s 3 2. 0 C
7 O n r ac k n um be r 1 c oo li ng l ev el 0 h av e b ee n u se d .
T he a v er a ge t e mp e ra t ur e h er e i s o f : 2 7. 0 C .
9 T he m a xi m um t e mp e ra t ur e o n t hi s r ac k i s 2 7. 0 C
11 O n r ac k n um be r 2 c oo li ng l ev el 3 h av e b ee n u se d .
T he a v er a ge t e mp e ra t ur e h er e i s o f : 3 2. 2 C .
13 T he m a xi m um t e mp e ra t ur e o n t hi s r ac k i s 6 1. 0 C
15 O n r ac k n um be r 3 c oo li ng l ev el 0 h av e b ee n u se d .
T he a v er a ge t e mp e ra t ur e h er e i s o f : 1 . 0 C .
17 T he m ax im um t em pe ra tu re o n t hi s r ac k i s 1 .0 C
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
16/22
16 CHAPTER 4. TESTING
Input:
Rack 0 Rack 1 Rack 2 Rack 3
32 27 61 1
32 27 29 132 27 29 1
100 27 29 1
32 27 29 1
32 27 29 1
32 27 29 1
32 27 29 1
32 27 29 1
32 27 29 1
System output:1 E RR OR T YP E 1 : T em pe ra tu re o ut o f t he v al id r an ge
S af e p la n h as b ee n a do pt ed d ue t o i nv al id i np ut
4.2.2 Path Coverage and Methods Testing:
Now that we know what happens in valid and invalid inputs are passed, it is possible to check
what happens to the system if proper inputs are used.
All the code used for this part of the testing is reported here below:
i m p or t j u n it . f r a m e w o r k . * ;
2 i m p or t o r g . j u ni t . A f t e r ;
i m p or t o r g . j u ni t . B e f o r e ;
4
p u bl i c c la s s c o ol i ng 1 Te s t e x te n ds T e st C as e {
6
@Before
8 p ub l ic v oi d s e tU p ( ) {
}
10
@After
12 p ub l ic v oi d t e ar D ow n ( ) {
}
14
p ub l ic v oi d t e st M ax ( ) {
16 f lo a t [] a = n ew f lo a t [ 10 ];
for ( int i = 0; i < a .l eng th ; i ++) {
18 a [ i ] = i ;
}
20 f lo at m ax = 9;
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
17/22
4.2. TEST RESULTS: 17
f l o at M a x = c o o l in g 1 . c a l c u l a te M a x ( a ) ;
22 a s s e r tE q u a ls ( m a x , M a x );
}
24
p u bl i c v oi d t e st A VG ( ) {
26 f lo at [ ] a = n ew f lo at [ 1 0] ;
for ( int i = 0; i < a .len gt h; i ++) {
28 a [i ] = 10 ;
}
30 f lo at a vg = 1 0;
f l o at A V G = c o o l in g 1 . c a l c u l a te A V G ( a ) ;
32 a s s e r tE q u a ls ( a v g , A V G );
}
34
p u bl i c v oi d t e s tC a lc u la t e Co o l ( ) {
36 f lo at [ ] a = n ew f lo at [ 1 0] ;
for ( int i = 0; i < a .len gt h; i ++) {
38 a [i ] = 31 ;
}
40 i nt O to A = c o ol i ng 1 . c a l cu l a te C oo l in g ( 61 , 2 4 , 0 , 0 );
a s s e r tE q u a ls ( 3 , O t oA ) ;
42 i nt A to O = c o ol i ng 1 . c a l cu l a te C oo l in g ( 30 , 2 3 , - 3, 3 );
a s s e r tE q u a ls ( 0 , A t oO ) ;
44 i nt B t oA 1 = c o ol i ng 1 . c a l cu l at e C oo l in g ( 61 , 2 3 , - 3, 2 );
a s s e r tE q u a ls ( 3 , B t o A1 ) ;
46 i nt B t oA 2 = c o ol i ng 1 . c a l cu l at e C oo l in g ( 40 , 2 8 , 1 , 2 );
a s s e r tE q u a ls ( 3 , B t o A2 ) ;
48 i nt O to B = c o ol i ng 1 . c a l cu l a te C oo l in g ( 40 , 3 1 , 3 , 0 );
a s s e r tE q u a ls ( 2 , O t oB ) ;
50 i nt B to O = c o ol i ng 1 . c a l cu l a te C oo l in g ( 40 , 2 3 , - 2, 2 );
a s s e r tE q u a ls ( 0 , B t oO ) ;
52 i nt O to C = c o ol i ng 1 . c a l cu l a te C oo l in g ( 40 , 3 1 , 1 , 0 );
a s s e r tE q u a ls ( 1 , O t oC ) ;
54 i nt C to O = c o ol i ng 1 . c a l cu l a te C oo l in g ( 40 , 2 3 , - 4, 1 );
a s s e r tE q u a ls ( 0 , C t oO ) ;56 i nt C t oA 1 = c o ol i ng 1 . c a l cu l at e C oo l in g ( 61 , 2 5 , - 3, 1 );
a s s e r tE q u a ls ( 3 , C t o A1 ) ;
58 i nt C t oA 2 = c o ol i ng 1 . c a l cu l at e C oo l in g ( 40 , 2 9 , 2 , 1 );
a s s e r tE q u a ls ( 2 , C t o A2 ) ;
60 }
}
As it is easy to see we focused particularly on the method CalculateCool, as it is the core
of the cooling software. In order to test it in the proper way we analized the State Chart
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
18/22
18 CHAPTER 4. TESTING
Diagram of section 2.2, and create a test case for each passible state transition.
At the end of the test JUnit declared that all the methods were tested succesfully, and so the
code is supposed to implement correctly the cooling system model weve created.
4.3 Errors vs Time:
10 20 30 40 50 60 700
5
10
15
20
25
30
Days passed after the first software release
Bugs/Tests executed on the cooling code
Bugs found
Executed tests
Comments: As it is possible to see errors and test are located in the days of the software
releases. In particular:
After the first release a few bugs were found, and it was almost easy to fix them all,
but the code was still running as a standalone peace of software.
After the second release lots of bugs were found, both in the database and in the
software itself. At the beginning most of them were sorted out, but this required a lot
of testing effort.
With the third release new bugs were introduced, in addition to the unsolved ones.
This time however it was quite easy to fix them all.
The last release of the software gave very few problems, as just one bug was found, and
if was extremely easy to fix it. This is probably due to the fact that this last release
is really semplified and the code was completely renewed. It is important to note that
the fourh release has been introduced as the third was becoming too much complicated
and not so much manageble, so sometimes strange results were given.
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
19/22
Chapter 5
Safety mechanism:
A safety mechanism had been added to the code, in order to cool the datacenter even in thecase a valid thermal map is not found.
The safety mechanism starts to work if a correct thermal map is not found within a minute.
In this case all the zones are cooled using level 4, which is of course the highest one.
The usage of this policy should ensure that all the datacenter will not be too hot even if the
thermal map fails.
This safety strategy is really simple, but seems to be the most effective. It is important to
note that it should be used just in extreme cases, as it is really expensive in terms of power
usage.
19
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
20/22
20 CHAPTER 5. SAFETY MECHANISM:
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
21/22
Chapter 6
Software releases:
6.1 Release 1:
Date: 25th January 2012
The first release of the software included all the basic functions in order to test if the module
was working properly:
Basic functionality to read/write input/output from/to .txt files;
A simple trigger system, used to temporize the software;
A procedure, used to determine if the input correct or not;
A signaling system, used to notify that;
Additional methods used to calculate the cooling levels in all the racks of the datacenter.
The JUnit related file included:
A test case for each one of the methods used in the code.
Some additional test cases for the applyCooling method.
6.2 Release 2:
Date: 20th February 2012
The basic improvements of this second release, respect to the first one, were:
Input/output from/to database, so database interaction;
More sophisticated triggering system;
Simplified and more readable code ( we passed from 306 lines to less than 250 );
21
-
7/31/2019 Data Center Cooling Plan Document - Des Dessy
22/22
22 CHAPTER 6. SOFTWARE RELEASES:
The Junit related file included:
A test case for each one of the methods used in the code.
A test case for each possible transition of the applyCooling method, in order to make
that possible, we used the cooling devide automata as an oracle. We tested all the
possible transitions and even what happens if the cooling level remains constant.
6.3 Release 3:
Date: 8th March 2012
The basic improvements of this third release, respect to the second one, were:
A safety mechanism added to the main class;
A new secondary class, used to store past information about the cooling levels and the
past average temperatures in each rack.
The Junit related file included:
A test case for each one of the methods used in the code.
A test case for each possible transition of the applyCooling method.
A test case for the safety mechanism.
6.4 Release 4:
Date: 10th April 2012
The basic improvements of release 4 were:
An extreme semplification of the code, based on the use of some new methods that
made the code more simple and linear.
A new synchronization system, using the method CurrentTimeMillis().
A better use of the safety mechanism.
The Junit related file included:
A test case for some of the methods used in the code.
A test case for each possible transition of the applyCooling method.
A test case for the safety mechanism.