fcm2063 printed notes

Probability and Statistics For Scientist and Technologist

A Course in Probability and Statistics

By

Radzuan RazaliAfza Shafie

1

Contents:

Part One: Probability and Distribution

Chapter 1

1. The concept of probability

1.1 Introduction1.2 Sample Space, probability of events, counting rule1.3 Conditional probability1.4 Multiplication rule1.5 Bayes theorem

Chapter 2:

2. Discrete random variable and probability distribution

2.1 Introduction2.2 Discrete random variable2.3 Discrete probability distribution 2.4 Special functions for discrete probability distribution

Chapter 3:

3. Continuous random variable and probability distribution

3.1 Introduction3.2 Continuous random variable3.3 Continuous probability distribution3.4 Special functions for continuous probability distribution

Part Two: Descriptive Statistics

Chapter 4:

4. Data display and summary of data

4.1 Introduction4.2 The definition and the difference between sample and population4.3 Graphical display of data: Stem and leaf, and Box-plot4.4 The mean, variance and standard deviation of the data

Chapter 5:

5. Statistical process control: X-bar and R-charts.

2

5.1 Introduction5.2 Statistical process control: X-bar and R-charts

Part Three: Inferential Statistics

Chapter 6:

6. Hypothesis testing for single population

6.1 Introduction6.2 Test about a sample mean for large sample, population variance is known6.3 The P-value for the test and confidence interval for mean6.4 Test about sample mean for small sample, population variance is unknown6.5 The P-value for the test and confidence interval for mean6.6 Test about proportion6.7 The P-value for the test and confidence interval for proportion6.8 Confidence interval for proportion 6.9 Test about variance6.10 The P-value for the test and confidence interval for variance

Chapter 7:

7. Simple linear regression model

8.1 Introduction8.2 Least squares estimator8.3 The coefficient of determination8.4 Confidence intervals and significance tests

Part Four: Design of Experiments

Chapter 8:

8. The design and analysis of experiments

10.1 Introduction10.2 Two-Factorial design experiment

Appendix

Table 1: The normal - Z distributionTable 2: The Student’s t-distributionTable 3: The chi-squared, distributionTable 4: The F-distribution

3

Preface

This book provides an introduction to probability and statistics, with particular emphasis on applications in applied sciences, technology and engineering. Typically introductory texts on engineering statistics spend a great deal of time on basic probability ideas for the first several chapters. In fact, basic probabilities can easily fill up a standard introductory course. Because engineering students often have only one probability/statistics course, the material needs to be reorganized in order to allow for coverage of statistical methodology.

This book will be divided to four parts; part 1 is related to the basic concept of probability and the distributions as in the Chapter 1 till Chapter 4. Chapter 1, we give a brief introduction to the basic concept of the probability. In Chapter 2, we introduce the definition of discrete random variables and the probability distributions. The continuous and their probability distributions will be discussed in Chapter 3.

Part 2 is covering on descriptive statistics as in the Chapter 4 and Chapter 5. In Chapter 4, we introduce the types of how to display data and summary of the data. Meanwhile, the random sample, central limit theorem, normal approximation and statistical process control will be discussed in the Chapter 5.

The engineering students also must be given some experience on how to do a basic data analysis. The inferential statistics are important as a part of statistical methods to do the data analysis. Part III is covering the inferential statistics such as in the Chapter 6 till Chapter 9. In Chapter 6, we introduce the hypothesis testing for single population and the hypothesis testing for two populations will be discussed in Chapter 7. While in Chapter 8 and Chapter 9, we introduce the simple and multiple linear regressions, respectively.

Finally, in part IV, we also want the students have some experience on the real application in engineering. The related topics such as a factorial design and design of experiment will be discussed in the Chapter 10.

Afza ShafieRadzuan Razali

4

April 2010.

Chapter 1

1. Basic concept of probability

Learning objectives:

At the end of this chapter, student should be able to:

Define and construct sample space of an experiment. Define random events, identify types of events, apply Venn Diagram and laws To

find event set including intersection, union and complement. Identify mutually exclusive and exhaustive events. To apply Bayes’ theorem to find the conditional probability of an event when the

event is partitioned into several mutually exclusive and exhaustive subsets.

1.1 Introduction

Probability theory refers to the study of randomness and uncertainty. Probability forms the basis knowledge which we can make inferences about a population based on the distribution and it’s provide methods for quantifying the chances or likelihood associated with various outcomes. Probability helps to explain a lot of everyday occurrences and we actually discuss it frequently.

Probability also has been used everyday in engineering and technology. For example: the probability of a good part being produce, the reliability of a new machine (reliabilities are actually probabilities) etc.

An engineer wants to be fairly certain that the percentage of good rods is at least 90%; otherwise he will shut down the process for recalibration. How certain that he has at least 90% of the 1000 rods are good?

What is the different between probability and inferential statistics? Probability is involving properties of the population under study which are assumed known and questions regarding a sample taken from the population are posed and answered. While, inferential statistics is involved a characteristics of a sample which are available to the experimenter and this information enables experimenter to draw conclusions about the populations.

5

1.1.1 Definition:

Some definitions or terms in basic probability must be known and well understand. Among the definitions are:

Random Process is a situation in which possible results are known but actual results cannot be predicted with certainty in advance.

Outcome is related to each possible result for a random process

Experiment is a process by which an observation or measurement is obtained (yield outcomes)

1,2 Sample Space, probability of events, counting rule

In the process of collecting data before analysis and interpretation being done, the method of how to model the random experiment is crucial. The terms related to it such as sample space and an event are important.

1.2.1 Sample Space:

Sample space denoted by S, is the set of all possible outcomes of an experiment.Event is any collection (subset) of outcomes contained in the sample space S. An event is called simple if it consists of exactly one outcome and called compound event if it consists of more than one outcome. Mean while the null event is an event with no outcomes. This is actually impossible event or empty set.

Example 1.1:

Experiment of roll a die:

The sample space is: S = {1, 2, 3, 4, 5, 6}

The simple events (or outcomes) are:

E1: observe No. 1 = {1} E2= {2} E3 = {3}E4= {4} E5 = {5} E6 = {6}

The compound events are:A : observe an odd number = {1, 3, 5}B : observe a number greater than or equal to 4 = {4, 5, 6}

6

Example 1.2:

Toss a coin for three times and observed the number of heads. The sample space is,

S = {0, 1, 2, 3}

The sample space for the lifetime of a machine (in days) is,

S = { t | t ≥ 0 } = [ 0, ∞ )

The sample space for the number of calls at a telephone exchange during a specific time interval is,

S = {0, 1,….}

The knowledge in set theory is important to understand the basic of probability. The union of events A and B denoted by A U B and read “A or B” is the event consisting of all outcomes that are either in A or in B or in both events.

The intersection of A and B denoted by A ∩ B and read “A and B”, is the event consisting of all outcomes that are in both A and B.

The complement of event A, denoted by AC, is the event of all outcomes in the sample space S that are not contained in event A.

If two events A and B have no outcomes in common they are said to be mutually exclusive or disjoint events. This means that if one of the events occurs the other cannot.

All these events can be visualized in term of Venn diagram:

1.2.2 Probability of Events

An event is a subset of all of the possible outcomes of an experiment. The probability of event is to assign for each event, say E, a number, P(E), called the probability of E which will give a precise measure of the chance that E will occur. The probability of an event E, is defined as the ratio of the number of outcome favorable to the event, n divided by the total number of all possible outcomes, N. That is P(E) = n/N.

For example, in the experiment tossing a die repeatedly, in the long run, what would we expect that the probability of even number will occurs, P(E=2 or 4 or 6)?

7

In this experiment, an event is even number will occur three times, so n=3. The total possible outcomes is six, so N=6. Hence the probability of even number will occur is, P(E=2 or 4 or 6)=3/6=0.5

Condition of Probability

A probability denoted by P is a rule (or function) which assigns a number between 0 and 1 to each event and must satisfies:

0 ≤ P(E) ≤ 1 for any event E

P(Ø ) = 0 , P(S) = 1,

If A1 , A2 , … is an infinite collection of mutually exclusive events, then

The probability of the complement of any event A is given as

For example, if P(rain tomorrow) = 0.6 then P(no rain tomorrow) = 0.4

Other notations for complement for A is Ac or Ā

Example 1.3:

An oil-prospecting firm plans to drill two exploratory wells. Past evidence is used to assess the possible outcomes listed in the following table:

Event Description Probability

ABC

Neither well produces oil nor gas Exactly one well produces oil or gas. Both wells produce oil or gas

0.850.120.03

Find and give description.

8

Solution:

Events A, B and C are mutually exclusive because the occurrence of one event precludes the occurrence of either of the other two.

P(A or B) = P(A) + P(B) = 0.97 (probability at most one well produces oil or gas)

P(B or C) = P(B) + P(C)= 0.15 (probability at least one well produces gas or oil

P(B’) = 1 – P(B) = 0.88 (probability both wells not produce or both produce oil or gas)

1.2.3 General Addition Law

Let A and B be two events defined in a sample space S.

If two events A and B are mutually exclusive, then

Thus

This can be expanded to consider more than two mutually exclusive events.

Example 1.4

One of the residential in Ipoh, 45% of all households subscribe to the Sinar Harian newspaper published in a nearby city, 75% subscribe to the Utusan Malaysia, and 30% of all households subscribe to both papers. Draw a Venn diagram for this problem. If a household is selected at random, what is the probability that it subscribes to

a) At least one of the two newspapers b) Exactly one of the two newspapers

Solution:

a) A = event subscribe to Sinar Harian, B = event subscribe to Utusan Malaysia P(A U B) = [ P(A) + P(B) – P(A ∩ B)] = 0.45 + 0.75 – 0.30 = 0.9

b) P (exactly one) = P (A ∩ B’) + P (A’ ∩ B) = 0.15 + 0.45 = 0.6

9

The probability of an event A equals the number of outcomes (sample points) contained in A divided by the total number of possible outcomes. That is:

P(A) = n(A) / n(S)

Important condition: all outcomes are equally likely to occur. Inefficient when n(S) is large.

1.2.4 Counting Rule:

Eliminates the need for listing each simple event and help to easily assigned probabilities to various events when the outcomes are equally likely. Especially helpful if the sample space is quite large.

Product (Multiplication) Rule

If there are k elements ( or things) to choose and there are n1 choices for the first element, n2 for the second element, and so on to nk choices for the kth element,then the number of possible ways of selecting them is only applies when elements are different or the order of elements matters.

Example 1.5:

A chemical engineer wishes to conduct an experiment to determine how these four factors affect the quality of the coating. She is interested in comparing two charge levels, three density levels, four temperature levels, and five speed levels. How many experimental conditions are possible?

Solution:

The possible experiment conditions are 2x3x4x5=120

Permutations and Combinations

Permutation is an ordered arrangement of k objects taken from a set of n distinct objects ( k ≤ n ).

The number of ways of permutation of k objects from n distinct objects will be denoted by the symbol Pk,n

10

Example 1.6:

8 teaching assistants are available to grade an exam of four questions. Wish to select a different assistant to grade each question (only one assistant per question). How many possible ways can the assistant are chosen for grading?

Solution:

The number of possible ways is

Combination

Combination: an unordered subset of k objects taken from a set of n distinct objects.

The number of ways of combination of k objects from n distinct objects is denoted by the symbol Ck,n

Permutation vs. Combination

Permutations are larger in number than combinations: e.g., the three numbers (1,2, 3), (1, 3,2) (2,3,1) , (3,1,2), (3,2,1) are all different permutations of the numbers 1, 2 and 3. However, they all represent the same combination of numbers.

Example 1.7:

Fifteen players compete in a tournament. In how many ways cana) rankings be assigned to the top five competitors?b) the best five competitors be randomly chosen?

Solution

The number of rankings that can be assigned to the top five competitors is

The number of ways that five competitors can be chosen is

11

1.3 Conditional Probability; Independent Events;

Sometimes it is useful to know the probability that an event will occur given that another event occurred. Given two possible events, if we know that one event occurred then this information can be applied in calculating the other event’s probability.

1.3.1 Conditional Probability

The conditional probability of A, given that B has already occurred, is denoted as P (A | B) and defined as:

The conditional probability of B, given that A has already occurred, is denoted as P ( B | A) and defined as:

Example 1.8:

The Information Resource Center(IRC), UTP displays three types of books entitled “Science” (S), “Engineering” (E), and “Technology” (T). Reading habits of randomly selected reader with respect to these types of books are

Read regularly S E T S∩E S∩T E∩T S∩E∩T Probability 0.14 0.23 0.37 0.08 0.09 0.13 0.05

Find the following probabilities and interpret

a) P( S | E )b) P( S |E U T )c) P( S | reads at least one )d) P( S U E | T)

Solution:

12

1.3.2 Independent Events

The probability of both events occurring can be calculated by rearranging the terms in the expression of conditional probability.

Two events A and B are called independent if the probability of event A is not affected by the occurrence of event B, so and

Example 1.9:

In rolling a fair die, let event A = {1, 3, 5} and event B = {4, 5, 6}.Are events A and B independent?

Solution:

13

0.07

0.20

0.02

0.04

0.03

0.080.05

S E

T39.3x

P(A) = ½, P(B)=1/2 and

Since , so A and B are not independent events.

What’s the difference between mutually exclusive and independent events?Two events mutually exclusive (disjoint): both cannot happen when the experiment is performed, so P( A| B) = 0, or vice versa

Two mutually exclusive events: P(A ∩ B) = 0 and P( A U B) = P(A) + P(B)Mutually exclusive events must be dependent.Two events are independent: P( A U B) = P(A) + P(B) – P(A ∩ B)

Example 1.10:

Toss a single die and observe the events A: a number less than 4 B: a number less than or equal to 2 C: a number greater than 3 Are events A and B independent? Are events A and B mutually exclusive? Are events A and C independent? Are events A and C mutually exclusive?

Solution:

P(A) = ½ , P(B) = 1/3, P(C ) = ½ P( A | B) ≠ P(A), A and B dependent but not mutually exclusive. A and C are dependent but mutually exclusive.

1.4 Bayes theorem

1.4.1 Multiplicative Law of Probability and Independence

For two events A and B,

( ) ( | ). ( )P A B P A B P B

Events A and B are independent if and only if

If events A1, …….., Ak are independent then,

14

Multiplication rule is most useful when the experiment consists of several stages in succession. The conditioning event, B, describes the outcome of the first stage and A the outcome of the second, so that P( A| B) – conditioning on what occurs first will often be known.

Example 1.11:

During a space shot, the primary computer system is backed up by two secondary systems. They operate independently of one another, and each is 95% reliable. What is the probability that all three systems will be operable at the time of the launch?

SolutionLet,A1: event main system is operableA2: event first backup is operableA3: event second backup is operable

Given P(A1) = P(A2) = P(A3) = 0.95Since they operate independentlyP(A1 ∩ A2 ∩A3) = P(A1)P(A2) P(A3) = 0.857

1.4.2 The Law of Total Probability

Suppose B1, B2 ,…, Bn are mutually exclusive and exhaustive in S, then for any event A

1.4.3 Bayes’ Theorem

Suppose B1, B2 ,…, Bn are mutually exclusive and exhaustive (whose union is S). Let A be an event such that P(A) > 0. Then for any event Bj , j =1, 2, …, n,

Example 1.12:

A store stocks bulbs for LCD projector from three suppliers. Suppliers A, B, and C supply 10%, 20%, and 70% of the bulbs respectively. It has been determined that company A’s bulbs are 1% defective while company B’s are 3% defective and company

15

C’s are 4% defective. If a bulb is selected at random and found to be defective, what is the probability that it came from supplier B?

Solution:

Let D is a defective, then the probability that it came from supplier B is

16

Exercise Chapter 1:

1.Each message in a digital communication system is classified as to whether it is received within the time specified by the system design. If 3 messages are classified, what is an appropriate sample space for this experiment?

2.A digital scale is used that provide weights to the nearest gram. Let event A: a weight exceeds 11 grams, B: a weight is less than or equal to 15 grams, C: a weight is greater than or equal to 8 grams and less than 12 grams. What is the sample space for this experiment? and find

(a) A U B (b) A’ (c) A ∩ B

(d) (A U C)’ (e) A ∩ B ∩ C (f) B’ ∩ C

3. Samples of building materials from three suppliers are classified for conformance to air-quality specifications. The results from 100 samples are summarized as follows:

Conforms

Yes No

Supplier

R 30 10S 22 8T 25 5

Let A denote the event that a sample is from supplier R, and B denote the event that a sample conforms to the specifications. If sample is selected at random, determine the following probabilities:

(a) P(A) (b) P(B) (c) P(B’)(d) P(AUB) (e) P(A B) (f) P(AUB’)(g) (h)

4. The compact discs from a certain supplier are analyzed for scratch and shock resistance. The results from 100 discs tested are summarized as follows:

Scratch

ResistanceHigh Low

Shock Resistance

High 30 10Medium 22 8

Low 25 5

Let A denote the event that a disc has high shock resistance, and B denote the event that a

17

disc has high scratch resistance. If sample is selected at random, determine the following probabilities:

(a) P(A) (b) P(B) (c) P(B’)(d) P(AUB) (e) P(A B) (f) P(AUB’)(g) (h)

5. The reaction times ( in minutes) of a reactor for two batches are measured in an experiment.

(a) Define the sample space of the experiment. (b) Define event A where the reaction time of the first batch is less than 45 minutes and event

B is the reaction time of the second batch is greater than 75 minutes.(c) Find A U B, A ∩ B and A’(d) Verify whether events A and B are mutually exclusive.

6. When a die is rolled and a coin is tossed, use a tree diagram to describe the set of possible outcomes and find the probability that the die shows an odd number and the coin shows a head.

7. A bag contains 3 black and 4 while balls. Two balls are drawn at random one at a time without replacement.

(i) What is the probability that a second ball drawn is black? (ii) What is the conditional probability that first ball drawn is black if the second ball is

known to be black?

8. An oil-prospecting firm plans to drill two exploratory wells. Past evidence is used to assess the possible outcomes listed in the following table:

Find and give description for

9. In a residential suburb, 60% of all households subscribe to the metro newspaper published in a nearby city, 80% subscribe to the local paper, and 50% of all households subscribe to both papers. Draw a Venn diagram for this problem. If a household is selected at random, what is the probability that it subscribes to(a) at least one of the two newspapers(b) exactly one of the two newspapers

18

Event Description Probability

ABC

Neither well produces oil or gasExactly one well produces oil or gas

Both wells produce oil or gas

0.800.180.02

10. In a student organization election, we want to elect one president from five candidates, one vice president from six candidates, and one secretary from three candidates. How many possible outcomes?

11. Suppose each student is assigned a 5 digit number. How many different numbers can be created?

12. A chemical engineer wishes to conduct an experiment to determine how these four factors affect the quality of the coating. She is interested in comparing three charge levels, five density levels, four temperature levels, and three speed levels. How many experimental conditions are possible?

13. A menu has five appetizers, three soup, seven main course, six salad dressings and eight desserts. In how many ways can(a) a full meal be chosen? (b) a meal be chosen if either and appetizer or a soup is ordered, but not both?

14. Ten teaching assistants are available to grade a test of four questions. Wish to select a different assistant to grade each question (only one assistant per question). How many possible ways can the assistant be chosen for grading?

15. Participant samples 8 products and is asked to pick the best, the second best, and the third best. How many possible ways?

16. Suppose that in the taste test, each participant samples eight products and is asked to select the three best products. What is the number of possible outcomes?

17. A contractor has 8 suppliers from which to purchase electrical supplies. He will select 3 of these at random and ask each supplier to submit a project bid. In how many ways can the selection of bidders be made?

18. Twenty players compete in a tournament. In how many ways can(a) rankings be assigned to the top five competitors?(b) the best five competitors be randomly chosen?

19. Three balls are selected at random without replacement from the jar below. Find the probability that one ball is red and two are black.

20. A university warehouse has received shipment of 25 printers, of which 10 are laser printers and 15 are inkjet models. If 6 of these 25 are selected at random by a technician, what is the probability that exactly 3 of those selected are laser printers?

21. There are 17 broken light bulbs in a box of 100 light bulbs. A random sample of 3 light bulbs is chosen without replacement.(a) How many ways are there to choose the sample?

19

(b) How many samples contain no broken light bulbs?(c) What is the probability that the sample contains no broken light bulbs?(d) How many ways to choose a sample that contains exactly 1 broken light bulb?(e) What is the probability that the sample contains no more than 1 broken light bulb?

22. An agricultural research establishment grows vegetables and grades each one as either good or bad for taste, good or bad for its size, and good or bad for its appearance. Overall, 78% of the vegetables have a good taste. However, only 69% of the vegetables have both a good taste and a good size. Also, 5% of the vegetables have a good taste and a good appearance, but a bad size. Finally, 84% of the vegetables have either a good size or a good appearance.(a) if a vegetable has a good taste, what is the probability that it also has a good size?(b) if a vegetable has a bad size and a bad appearance, what is the probability that it has a

good taste?

23. A local library displays three types of books entitled “Science” (S), “Arts” (A), and “Novels” (N). Reading habits of randomly selected reader with respect to these types of books are

Read regularly S A N S∩A S∩N A∩N S∩A∩N Probability 0.14 0.23 0.37 0.08 0.09 0.13 0.05

Find the following probabilities and interpret(a) P( S | A )(b) P( S | A U N )(c) P( S | reads at least one )(d) P( S U A | N)

24. A batch of 500 containers for frozen orange juice contains 5 that are defective. Two are selected at random, without replacement, from the batch. Let A and B denote that the first and second selected is defective respective(a) Are A and B independent events? (b) If the sampling were done with replacement, would A and B be independent?

25. Everyday (Mon to Fri) a batch of components sent by a first supplier arrives at certain inspection facility. Two days a week, a batch also arrives from a second supplier. Eighty percent of all batches from supplier 1 pass inspection, and 90% batches of supplier 2 pass inspection. On a randomly selected day, what is the probability that two batches pass inspection?

26. The probability is 1% that an electrical connector that is kept dry fails during the warranty period of a portable computer. If the connector is ever wet, the probability of a failure during the warranty period is 5%. If 90% of the connectors are kept dry and 10% are wet, what proportion of connectors fail during the warranty period?

20

27. Computer keyboard failures are due to faulty electrical connects (12%) or mechanical defects (88%). Mechanical defects are related to loose keys (27%) or improper assembly (73%). Electrical connect defects are caused by defective wires (35%), improper connections (13%) or poorly welded wires (52%). Find the probability that a failure is due to(a) loose keys(b) improperly connected or poorly welded wires.

28. During a space shot, the primary computer system is backed up by two secondary systems. They operate independently of one another, and each is 90% reliable. What is the probability that all three systems will be operable at the time of the launch?

29. A store stocks light bulbs from three suppliers. Suppliers A, B, and C supply 10%, 20%, and 70% of the bulbs respectively. It has been determined that company A’s bulbs are 1% defective while company B’s are 3% defective and company C’s are 4% defective. If a bulb is selected at random and found to be defective, what is the probability that it came from supplier B?

30. A particular city has three airports. Airport A handles 50% of all airline traffic, while airports B and C handle 30% and 20%, respectively. The rates of losing a baggage in airport A, B and C are 0.3, 0.15 and 0.14 respectively. If a passenger arrives in the city and losses a baggage, what is the probability that the passenger arrives at airport A?

31. A company rated 75% of its employees as satisfactory and 25% unsatisfactory. Of the satisfactory ones 80% had experience, of the unsatisfactory only 40%. If a person with experience is hired, what is the probability that (s)he will be satisfactory?

32. In a certain assembly plant, three machines, B1, B2, B3, make 30%, 45% and 25%, respectively, of the products. It is known from past experience that 2%,3% and 2% of the products made by each machine, respectively, are defective. Now, suppose that a finished product is randomly selected. (a) What is the probability that it is defective?(b) If a product was chosen randomly and found to be defective, what is the probability

that it was produced by machine B3?

33. Three machines A, B and C produce identical items of their respective output 5%, 4% and 3% of the items are faulty. On a certain day A has produced 25%, B has produced 30% and C has produced 45% of the total output. An item selected at random is found to be faulty. What are the chances that it was produced by C?

34. Suppose that a test for Influenza A, H1N1 disease has a very high success rate: if a tested patient has the disease, the test accurately reports this, a ’positive’, 99% of the time, and if a tested patient does not have the disease, the test accurately reports that, a ’negative’, 95% of the time. Suppose also, however, that only 0.1% of the population have that disease.(a) What is the probability that the test returns a positive result?(b) If the patient has a positive, what is the probability that he has the disease?

21

(c) What is the probability of a false positive?

35. An insurance company charges younger drivers a higher premium than it does older drivers because younger drivers as a group tend to have more accidents. The company has 3 age groups: Group A includes those less than 25 years old, have a 22% of all its policyholders. Group B includes those 25-39 years old, have a 43% of all its policyholders, Group C includes those 40 years old and older, have 35% of all its policyholders. Company records show that in any given one-year period, 11% of its Group A policyholders have an accident. The percentages for groups B and C are 3% and 2%, respectively.(a)What is the probability that the company’s policyholders are expected

to have an accident during the next 12 months? (b)Suppose Mr. Chong has just had a car accident. If he is one of the company’s

policyholders, what is the probability that he is under 25?

22

Chapter 2

2. Discrete random variable and discrete probability distributions



Define the random variables Differentiate between discrete and continuous random variables Define the discrete probability distributions Know the special functions for discrete probability distribution

2.1 Introduction

A random variable is a rule that assigns a number to each outcome of an experiment. These numbers are called the measured values of the random variable. The capital letters like X, Y and Z is used to denote a random variable and the small letters like x, y and z to denote the measured values.

Example 2.1:

Select a soccer player; the random variable Y is the number of goals the player has scored during the season.

The measured values of Y are 0, 1, 2, 3,

The test marks for 100 engineering students; the random variable Z is the average number of goals scored by the students.

The values of Z are 65.4, 67.8, 70.5, 77.3,

There are two types of random variables called a discrete random variable and a continuous random variable.

2.2 Discrete random variable

23

The measured values for a discrete random variable are finite or countable. The values are in terms of integer value. The number of students in this class is the example of a discrete random variable.

2.3 Continuous random variableThe measured values for continuous random variables are in terms of real number in the range. It can be any values within the range. The weight of students in this class is the example of a continuous random variable.

Example 2.2:

Identify whether the random variable below is discrete or continuous random variable.

(i) The number of female students in the class.(ii) The number of telephone calls.(iii) The time between two accidents.(iv) The number of cracks in a certain length of road.(v) The height of the athletes participated in the Asian Game.(vi) The volume of water in the tank.(vii) A score on the statistics final examination. (viii) The number of cars on the road at a certain period of time.

Solution:

(i) discrete(ii) discrete(iii) continuous(iv) discrete(v) continuous(vi) continuous(vii) continuous(viii) discrete

2.4 Discrete probability distribution

If a random variable is a discrete variable, its probability distribution is called a discrete probability distribution or probability mass function, pmf.

Suppose the experiment is flipping a coin two times. This simple experiment can have

four possible outcomes or sample space: HH, HT, TH, and TT. Now, let the random

variable X represent the number of Heads that result from this experiment. The random

variable X can only take on the values 0, 1, or 2, so it is a discrete random variable.

24

http://stattrek.com/Help/Glossary.aspx?Target=Statistical_experiment

http://stattrek.com/Help/Glossary.aspx?Target=Probability_distribution

http://stattrek.com/Help/Glossary.aspx?Target=Random_variable

The probability distribution for this experiment appears as below.

Number of

heads, X

Probability function, P(X)

0 1/4

1 1/2

2 1/4

The above table represents a discrete probability distribution and the probability function,

P(X=xi) is called probability mass function (pmf) of X because it relates each value of a

discrete random variable with its probability of occurrence.

2.4.1 The properties of the probability mass function (pmf)

The pmf, P(X=xi) of a discrete random variable X must satisfied two conditions;

(i)

(ii)

Given pmf, the probability of X occurs can be calculated. For example the probability at most one occurs is

Example 2.3:

Two balls are drawn at random in succession without replacement from an urn containing 4 red balls and 6 black balls. Find the probabilities of all the possible outcomes.

Solution:

Let X denote the number of red balls in the outcome.

Possible RR RB BR BB

25

outcomesX 2 1 1 0

Here, x1 = 2, x2 = 1, x3 = 1, x4 = 0

Now, the probability of getting 2 red balls when we draw out the balls one at a time is:

Probability of first ball being red = 4/10

Probability of second ball being red = 3/9 (because there are 3 red balls left in the urn, out of a total of 9 balls left.) So:

Likewise, for the probability of red first is 4/10 followed by black is 6/9 (because there are 6 black balls still in the urn and 9 balls all together). So:

Similarly for black then red:

Finally, for 2 black balls:

So the probability distribution is:

X 2 1 0P(X=x) 2/15 8/15 5/15

Example 2.4:

Given the probability distribution,

X 0 1 2 3 4 5P(X=x) 1/10 1/5 k 1/5 3/10 1/10

Find the value of k that makes P(X=x) a valid pmf of X.

26

Solution:

For P(X=x) is truly pmf of X, its must satisfy . Hence,

.

2.4.2 The cumulative distribution function (cdf)

The cdf of a discrete random variable X is defined by,

Example 2.5:

Given pmf,

X 2 1 0P(X=x) 2/15 8/15 5/15

Find the cdf of X.

Solution:

For x < 0,

For

For

For

So the cdf of X is:

27

2.4.3 The mean and the variance of X

Given the pmf of X, P(X=x), all the parameters of X such as the mean, the variance and the standard deviation can be determined by using the expectation definition. The mean of X is defined by,

The variance of X is defined by,

The standard deviation, is a square root of the variance.

Example 2.6:

Let X is a random variable with pmf,

X 2 1 0P(X=x) 2/15 8/15 5/15

Find the mean, the variance and the standard deviation of X.

Solution:

The mean of X is:

The variance of X is:

28

Standard deviation is = 0.653

2.5 Special functions for discrete probability distribution

There are many special discrete probability distributions such as Bernoulli distribution, Binomial distribution and Poisson distribution.

2.5.1 Bernoulli distribution

The experiment conducted with only two possible outcomes. In an experiment of tossing a fair coin for 1 time and X is the number of head. There are only two possible outcomes, X =0 or X=1 with probability distribution:

Possible outcomes

Head Tail

X 1 0P(X=x) 1/2 1/2

2.5.2 Binomial distribution

If the Bernoulli experiment conducted for n times, and the random variable X is the number of success, then the probability distribution of X is called Binomial distribution with pmf,

where p is the probability of success and q=1-p.

By using the definition, it can be shown that, if X is a Binomial distribution, then the mean of X is E(X) = np and the variance of X, is Var(X) = npq.

Example 2.7:

In the experiment of tossing a fair coin for 10 times, and X is the number of head. (i) What is the pmf of X?. (ii) Find the probability the head will appear exactly 5 times.(iii) What is the probability no head?(iv) Find the mean and the variance of X.

29

Solution: (i) The probability mass function of X is given by:

(ii)

(iii)

(iii) The mean of X is np = 10(0.5)=5 The variance of X is npq=10(0.5)(0.5)=2.5

2.5.3 Poisson distribution

Another important discrete distribution is a Poisson distribution. The random variable X is the number of occurrences in the interval of interest. The example of Poisson distribution is the number of accidents within a certain period of times. If X is a random variable with a Poisson distribution then the pmf of X is given by,

where is the mean of X for the interval of interest.

By using the definition, it can be shown that, if X is a Binomial distribution, then the mean of X is E(X) = and the variance of X, is Var(X) = .

Example 2.8:

Anne's answering machine receives about 6 telephone calls between 8 a.m. and 10 a.m. What is the probability that Anne receives more than 1 call in the next 15 minutes?

Solution:

Let X = the number of calls Anne receives in 15 minutes. (The interval of interest is 15 minutes. The random variable X takes on the values 0, 1, 2,.. If Anne receives, on the average, 6 telephone calls in 2 hours, then Anne will receives 1/8 = 0.75 calls in 15 minutes, on the average. So, it means that is 0.75.

Hence, the probability that Anne receives more than 1 call in the next 15 minutes is,

30

Exercise 2

1. Identify each of the random variables as continuous or discrete random variable.

(a) The number of atoms (b) The number of fish in a pond(c) The home team score in a football game(d) The voltage on a power line(e) A score on the mathematic final exam(f) The volume of gas in the tank (g) The number of cars at the petrol station(h) The number of accidents in Ipoh(i) The number of cakes left in the pantry(j) The height of civil engineering students in UTP

2. Let,

x 0 1 2 3

0.15 0.25 k 0.35

(i) Find the value of k that result in a valid probability distribution.(ii) Find the expected number of X and the standard deviation of X.(iii) What is the probability that X greater than or equal to 1?

3.At UTP, the business students run an investment club. Each semester they create investment portfolios in multiples of RM1,000 each. Records from the past several years show the following probabilities of profits (rounded to the nearest RM50). In the table below, x = profit per RM1, 000 and P(x) is the probability of earning that profit.

x 0 50 100 150 200

0.15 0.35 k 0.2 0.05

(a) Determine the value of k that results in a valid probability distribution.(b) The profit per RM1, 000 is a random variable. Is it discrete or continuous? Explain.(c) Find the expected value of the profit in a $1,000 portfolio.(d) Find the standard deviation of the profit.(e) What is the probability of a profit of $150 or more in a RM1, 000 portfolios?

31

4. Let X denote the number of bars of service on your cell phone whenever you are at an intersection with the following probabilities:

x 0 1 2 3 4 5

0.05 0.15 0.20 0.35 0.15 0.1

Determine the following: (a) F(x)(b) Mean and variance(c) P(X < 2)(d) P(X >2.5)

5. A local cab company is interested in the number of pieces of luggage a cab carries on a taxi run. A random sample of 260 taxi runs gave the following information. x = number of pieces of luggage and f is the frequency with which taxi runs carried x pieces of luggage.

x : 0 1 2 3 4 5 6 7 8 9 10f : 42 51 63 38 19 16 12 10 6 2 1

(a) Find the probability distribution for x.(b) Estimate the probability that a taxi run will have from 0 to 4 pieces of luggage

(including 0 and 4).(c) What is the expected value of x?(d) What is the standard deviation of x.

6. A Professor estimates the probability that he will receive at least one telephone call at home during the hours of 5pm to 7pm on a weekday to be 1/3. Use the formulas for computing binomial probabilities to answer the following questions:(a) What is the probability that he will receive at least one call on all five of the next five

weekday nights?(b) What is the probability that he will not receive a call on any of the next five weekday

nights?(c) What is the probability that he will receive a call on at least four of the next five weekday

nights?

7. The probability of successfully landing a plane using a flight simulator is given as 0.80. Nine randomly and independently chosen student pilots are asked to try to fly the plane using the simulator. (a) What is the probability that all the student pilots successfully land the plane using the

simulator?

32

(b) What is the probability that none of the student pilots successfully lands the plane using the simulator?

(c) What is the probability that exactly eight of the student pilots successfully land the plane using the simulator?

8.Suppose X has a Poisson distribution a mean of 7 Determine the following. (a) P(X = 0);(b) P(X = 5);(c) P(X < 3); and(d) .

9.At the Mc Donald drive-thru window of food establishment, it was found that during slower periods of the day, vehicles visited at the rate of 15 per hour. Determine the probability that(a) no vehicles visiting the drive-thru within a ten-minute interval during one of these slow

periods;(b) only 3 vehicles visiting the drive-thru within a ten-minute interval during one of these

slow periods; and(c) at least three vehicles visiting the drive-thru within a ten-minute interval during one of

these slow periods.

10. The number of cracks in a section of PLUS highway that are significant enough to require repair is assumed to follow a Poisson distribution with a mean of two cracks per kilometer. Determine the probability that(a) there are no cracks at all in 2km of highway;(b) at least one crack in 500meter of highway; and(c) there are exactly 3 cracks in 0.5km of highway.

33

Chapter 3

3. Continuous probability distributions



Define the continuous probability distributions Know the special functions for continuous probability distribution

3.1 Introduction

If the outcomes of the experiment conducted are continuous random variables, its probability distribution is called a continuous probability distribution or probability density function, pdf.

3.1.1 The properties of the probability density function (pdf)

The pdf, f(x) of a continuous random variable X must satisfied two conditions;

(j)

(ii)

Given pdf, the probability of X occurs can be calculated. For example the probability at most one occurs is

Example 3.1:

Let X be continuous random variable with pdf given by,

34

http://stattrek.com/Help/Glossary.aspx?Target=Probability_distribution

http://stattrek.com/Help/Glossary.aspx?Target=Random_variable

Find the value of k that makes f(x) a valid pdf of X.

Solution:

To be a valid pdf, f(x) must satistify,

So,

3.1.2 The cumulative distribution function (cdf)

The cdf of a continuous random variable X is defined by,

Example 3.2:


Find,

(i) P(X < 0.5), P( 0.5 < X < 0.75)

(ii) The cdf of X.

Solution:

(i)

35

(ii) The cdf of X,

So the cdf of X is:

3.1.3 The mean and the variance of X

Given the pdf of X, f(x), all the parameters of X such as the mean, the variance and the standard deviation can be determined by using the expectation definition. The mean of X is defined by,

The variance of X is defined by,

The standard deviation, is a square root of the variance.

Example 3.3:


Find the mean and the variance of X.

Solution:

36

The mean of X is

The variance of X is,

3.2 Special functions for continuous probability distribution

There are many special continuous probability distributions such as Uniform distribution, Exponential distribution, Gamma distribution and Normal distribution.

3.2.1 Uniform distribution

The random variable X is a uniform distribution, and then the pdf of X is given by,

By using the definition, it can be shown that, if X is a uniform distribution, then the mean of X is E(X) = and the variance of X, is Var(X) = .

3.2.2 Exponential distribution

The random variable X is an exponential distribution, and then the pdf of X is given by,

By using the definition, it can be shown that, if X is a uniform distribution, then the mean of X is E(X) = and the variance of X, is Var(X) =1/.

37

Example 3.4:

Let X is the number of individuals failing in a large group and has a exponential distribution. If we assume that the mean of X under a certain situation is 10, what is the probability that more than 20 will fail at the same time?

Solution:

The random variable X has pdf,

where is the 1/mean of X. So, = 1/10=0.1

Hence,

3.2.3 Normal distribution

The random variable X is a Normal distribution, and then the pdf of X is given by,

By using the definition, it can be shown that, if X is a random variable with normal distribution, then the mean of X is E(X) = and the variance of X, is Var(X) = . If X is a random variable with normal distribution, then X is always be written as,

X ~ N(

This distribution is called non-standard normal distribution. The probability of X can be found by integrate the pdf, f(x). But this integration is not easy to calculate. By using the transformation, Z = (x-m)/s, then the random variable X will be change to random variable Z with pdf,

Z is a random variable normally distributed with mean 0 and variance 1, and always be written as,

Z ~ N(0,1)

38

This distribution is called standard normal distribution.

The probability of z occurs can be calculated from pdf and the values of the

integration are tabulated in the Standard Normal

Distribution Table.

Example 3.5:

After completing a study, the civil engineering department in Universiti Teknologi PETRONAS (UTP) concluded that the time UTP employees spend commuting to work each day is normally distributed with a mean equal to 15 minutes and a standard deviation equal to 5 minutes. One employee has indicated that he commutes 25 minutes per day. Find the probability that an employee would commute 25 or more minutes per day,

Solution:

The random variable X is the time employees spend commuting to work and

X ~ N(

where = 15 and 2=(5)2

Hence,

Exercise 3

1.Suppose that X is a continuous random variable having the probability density function

(a) Find the value of constant k(b) Find P(-0.5<X<0.5) (c) Determine x such that P( X > x) = 0.5(d) Determine the mean and the variance of X.

39

2. Let X be a continuous random variable with pdf given by

Find(a) the value of constant k(b) P(X < 1)(c) the mean of X(d) the standard deviation of X.

3.Let X be a continuous random variable with pdf given by

Find (a) the value of constant k(b) P(X > 1)(c) P(0 < X < 2)(d) the mean of X(e) the variance of X.

4. Let X be a continuous random variable with pdf given by

Find(a) the value of constant k(b) the cdf, F(x)(c) P(X >1)(d) the mean of X(e) the variance of X.

5. Find the cumulative probability distribution of X given that the density function is

Find(a) the value of constant k(b) the cdf, F(x)(c) P(0.25 < X < 0.5)(d) the mean of X(e) the variance of X.

6.Suppose a random variable, X has a uniform distribution with a = 5 and b = 9. Find

40

(a)P(5.5 < X < 8)(b)P(X < 7)(c)the mean of X(d)the standard deviation of X.

7.Let X be an exponential random variable with λ = 0.01. Calculate the following probabilities:(a) P(X < 50)(b) P(x > 60)(c) P(50 < x < 60)(d) What is the mean and the variance of X.

8. The lifetime of a certain electronic component is known to be exponentially distributed with a mean lifetime of 100 hours. What is the probability that(a) the lifetime of the component is more than 100hours?(b) the lifetime of the component is between 50 to 100hours?(c) a component will fail before 50hours?

9.The time between telephone calls to ASTRO, a cable television payment processing center follows an exponential distribution with a mean of 1.5 minutes. What is the

probability that the time between the next two calls(a) at least 45 seconds?(b) will be between 50 to 100 seconds?; and(c) at most 150 seconds?

10. The mean weight of 500 UTP students is 68kg and the variance is 72.25kg. Find the probability of students who weight (a) between 65kg and 72kg(b) more than 70kg

11. An average LCD Projector bulb manufactured by the ABC Corporation lasts 300 days with variance of 2500days. By assuming that the bulb life is normally distributed, what is the probability that the bulb will last(a) at most 365 days?(b) between 250days and 350days?(c) at least 400days?

12. The line width of a tool used for semiconductor manufacturing is assumed to be normally distributed with a mean of 0.5 micrometer and a standard deviation of 0.05 micrometer. (a) What is the probability that a line width is greater than 0.62 micrometer?(b) What is the probability that a line width is between 0.47 and 0.63 micrometer?(c) The line width of 90% of samples is below what value?

oooOOOooo

41

Chapter 4

4. Data display and summary of data



Explain the different between population and sample Find the sample mean, sample variance and sample standard deviation Plot data using stem and leaf display Construct the Box-Plot

4.1 Introduction

The major use of inferential statistics is to use information from a sample to infer something about a population. A population is a collection of data whose properties are analyzed. The population is the complete collection to be studied; it contains all subjects of interest. A sample is a part of the population of interest, a sub-collection selected from a population. A parameter is a numerical measurement that describes a characteristic of a population, while a statistic is a numerical measurement that describes a characteristic of a sample. In general, we will use a statistic to infer something about a parameter.

42

4.2 Mean and variance

The mean is the sum of all numbers in the list divided by the total numbers in the

list. If the given list is Statistical Population then the mean is called Population

Mean and the given list is a Statistical Sample, then the mean is called Sample

mean. The mean has an expected value of μ, known as the population mean. The

sample mean makes a good estimator of the population mean, as its expected value

which is as the same as the population mean.

Often, since the population variance is an unknown parameter, it is estimated by

the mean sum of squares, which changes the distribution of the sample mean from

a normal distribution to a Student's t distribution with n − 1 degrees of freedom.

The mean and the variance of population and sample mean and sample variance

can be expressed as follows. By using the following equations we can identify the

difference.

Population Mean and Variance are defined as:

where N is the size of the Population.

Sample Mean and sample variance are defined as:

where n is the sample size

Example 4.1:

Given the sample data as 55, 68, 90, 42, 89, 70. Find the sample mean and the sample variance of this data.

Solution:

43

4.3 A Stem and Leaf plot

Data can be shown in a variety of ways including graphs, charts and tables. A Stem and Leaf plot is a type of graph that is similar to a histogram but shows more information. The Stem-and-Leaf plot summarizes the shape of a set of data (the distribution) and provides extra detail regarding individual values.

The data is arranged by place value. The digits in the largest place are referred to as the stem and the digits in the smallest place are referred to as the leaf (leaves). The leaves are always displayed to the left of the stem. Stem and Leaf plots are great organizers for large amounts of information. It provides an at ‘a glance’ tool for specific information in large sets of data, otherwise one would have a long of marks to sift through and analyze. The totals of data, median and mode are also can be determined by Stem and Leaf plots. They are usually used when there are large amounts of numbers or data to analyze. Series of scores on sports teams, series of temperatures or rainfall over a period of time, series of classroom test scores are examples of when Stem and Leaf plots could be used.

Example 4.2:

The following data is the temperatures for August in Malaysia.

77 80 82 68 65 59 6155 50 62 61 70 69 6465 70 62 65 65 75 76

44

http://math.about.com/cs/baseten/index.htm

85 80 82 83 79 79 7180 77 89

Use the Stem and Leaf plot to determine the mode and the median for the temperatures.

Solution:

First step should be to place the numbers in order from smallest to the largest.

The mode is 65 and the median is 70.

4.4 A Box plot

In descriptive statistics, a box plot or (also known as a box-and-whisker diagram) is an excellent visual summary of many important aspects of a data distribution through their five-number summaries: the smallest observation (sample minimum), lower quartile (Q1), median (Q2), upper quartile (Q3), and largest observation (sample maximum). A box plot may also indicate which observations, if any, might be considered outliers. Box plot can be drawn either horizontally or vertically.

4.4.1 Construct a box plot

Step 1: Place the numbers in order from smallest to the largest.

Step 2: Find the median, Q2, the lower quartile, Q2 and the upper quartile, Q3 of a given set of data.

Step 3: Find the interquartile range (IQR). The IQR is the difference between the upper quartile and the lower quartile.

Step 4: Start to draw the Box-plot either horizontally or vertically.

45

Temperatures

Tens Ones

5 0 5 9

6 1 1 2 2 4 5 5 5 5 8 9

7 0 0 1 5 6 7 7 9 9

8 0 0 0 2 2 3 5 9

http://en.wikipedia.org/wiki/Outlier

http://en.wikipedia.org/wiki/Sample_maximum

http://en.wikipedia.org/wiki/Quartile

http://en.wikipedia.org/wiki/Median

http://en.wikipedia.org/wiki/Quartile

http://en.wikipedia.org/wiki/Sample_minimum

http://en.wikipedia.org/wiki/Sample_minimum

http://en.wikipedia.org/wiki/Five-number_summary

http://en.wikipedia.org/wiki/Descriptive_statistics

Step 5: Calculate the 1.5IQR and determine the range of 1.5IQR from upper quartile and the lower quartile. The value(s) that place outside of the 1.5IQR range called the outlier(s). The value(s) that place outside of the 3IQR range called the extreme outlier(s).

Example 4.3

Suppose that thirty UTP students live in Village 2. These are the following ages:

18, 20, 21, 26, 24, 19, 25, 20, 22, 21,

19, 24, 25, 28, 24, 20, 26, 20, 35, 17,

18, 24, 20, 21, 22, 27, 25, 28, 27, 24.

Step 1: Place the numbers in order from smallest to the largest.

17, 18, 18, 19, 19, 20, 20, 20, 20, 20,

21, 21, 21, 22, 22, 24, 24, 24, 24, 24,

25, 25, 25, 25, 26, 26, 27, 27, 28, 35.

Step 2: Find the median, Q2, the lower quartile, Q2 and the upper quartile, Q3 of a given set of data.

The median, Q2 = (X15 + X16)/2 = (22+24)/2=23

The position of Q1 = (0.25) (n+1) = 0.25(31) = 7.75

So the lower quartile, Q1 is X7 + 075(X8-X7) =20 + 0.75(20-20)=20

The position of Q3 = (0.75) (n+1) = 0.75(31) = 23.25

So the upper quartile, Q3 is X23 + 0.25(X24-X23) =25+0.25(25-25) = 25

Step 3: The interquartile range (IQR) = Q3 - Q1 = 25 – 20 = 5

46

The 1.5IQR = 7.5 and 3IQR = 15

Step 4: Start to draw the Box-plot either horizontally or vertically.

outlier

17 28 o35

Q1 =20 Q2 =23 Q3=25

12.5.<…1.5IQR…>< ------------…..IQR=5 ----------------><…1.5IQR…>.32.5

Exercise 4:

1. Find the mean, median and mode for the following observations:

6.5 7.8 4.6 3.7 6.5 9.2 12.1 6.5 3.7 10.8 2. Find the mean, median and mode for the following observations:

2.3 3.6 2.6 2.8 3.2 3.6 4.3 5.2 6.9 2.8 3.6

3. Seven oxide thickness measurements of wafers are studied to assess quality in a semiconductor manufacturing process. The data (in angstroms) are: 1264, 1280, 1301, 1300, 1292, 1307, and 1275. Calculate the sample average, variance and standard deviation.

4. The following data are direct solar intensity measurements (watts/m2) on different days at a location in southern Spain: 562, 869, 708, 775, 775, 704, 809, 856, 655, 806, 878, 909, 918, 558, 768, 870, 918, 940, 946, 661, 820, 898, 935, 952, 957, 693, 835, 905, 939, 955, 960, 498, 653, 730, 753. Calculate the sample mean, variance and sample standard deviation.

47

5. Find the mean, variance and standard deviation of the following samples of marks for the probability and statistics final examination.

84.9 81.9 80.8 79.4 78.2 76.575.0 73.8 72.7 72.6 71.4 70.969.3 68.6 67.5 66.8 65.2 64.459.5 58.3 58.5 57.6 56.9 55.2 48.2 48.0 47.8 46.5 45.9 44.638.3 37.4 36.8 36.5 35.6 34.938.4

6. Find the mean, variance and standard deviation of the following samples of marks for the engineering drawing course.

98.4 98.1 98.0 97.8 96.4 95.2 94.3 92.6 91.8 90.589.6 88.7 87.3 86.8 85.7 84.2. 83.7 82.8 80.5 80.879.7 78.2 77.4 77.4 76.8 75.9 74.2 73.9 72.6 71.469.8 68.6 67.5 66.8 65.2 64.4 63.7 62.8 61.4 60.759.2 58.3 58.5 57.6 56.9 55.2 54.7 53.9 52.9 51.259.6 48.0 47.8 46.5 45.9 44.6 43.8 42.7 41.8 40.639.8 37.4 36.8 36.5 35.6 34.9 33.2 33.8 32.7 31.6

7.The shear strengths of 100 spot welds in a titanium alloy follow. Construct a stem-and-leaf diagram for the weld strength data and comment on any important features that you notice.

5408 5431 5475 5442 5376 5388 5459 5422 5416 5435

5420 5429 5401 5446 5487 5416 5382 5357 5388 5457

5407 5469 5416 5377 5454 5375 5409 5459 5445 5429

5463 5408 5481 5453 5422 5354 5421 5406 5444 5466

5399 5391 5477 5447 5329 5473 5423 5441 5412 5384

5445 5436 5454 5453 5428 5418 5465 5427 5421 5396

5381 5425 5388 5388 5378 5481 5387 5440 5482 5406

5401 5411 5399 5431 5440 5413 5406 5342 5452 5420

5458 5485 5431 5416 5431 5390 5399 5435 5387 5462

5383 5401 5407 5385 5440 5422 5448 5366 5430 5418

(a) Construct a stem-and-leaf display for these data.(b) Find the median, the quartiles, and the 5th and 95th percentiles.

8. The data that follow represent the yield on 90 consecutive batches of ceramic substrate to which a metal coating has been applied by a vapor-deposition process.

48

94.1 87.3 94.1 92.4 84.6 85.4

93.2 84.1 92.1 90.6 83.6 86.6

90.6 90.1 96.4 89.1 85.4 91.7

91.4 95.2 88.2 88.8 89.7 87.5

88.2 86.1 86.4 86.4 87.6 84.2

86.1 94.3 85.0 85.1 85.1 85.1

95.1 93.2 84.9 84.0 89.6 90.5

90.0 86.7 78.3 93.7 90.0 95.6

92.4 83.0 89.6 87.7 90.1 88.3

87.3 95.3 90.3 90.6 94.3 84.1

86.6 94.1 93.1 89.4 97.3 83.7

91.2 97.8 94.6 88.6 96.8 82.9

86.1 93.1 96.3 84.1 94.4 87.3

90.4 86.4 94.7 82.6 96.1 86.4

89.1 87.6 91.1 83.1 98.0 84.5

(a) Construct a cumulative frequency plot and histogram for the yield (b) Construct a stem-and-leaf display for these data. (c) Find the median, the quartiles, and the 5th and 95th percentiles for the yield

9. The average age of the football players on each team of the premier league as follows.

29.4 29.8 29.4 31.8 32.7 34.0

28.5 27.9 30.9 29.3 28.8 28.6

29.1 31.0 30.7 30.3 29.7 31.0

28.4 28.9 27.7 28.7 30.5 29.8

26.6 27.9 27.9 29.9 29.3 28.1

(a) Construct a cumulative frequency plot and histogram for the yield(b) Construct a stem-and-leaf display for these data.(c) Find the median, the quartiles, and the 5th and 95th percentiles for the yield

10. The following “ cold start ignition time” of an automobile engine obtained for a test vehicle are as follows:

1.75 1.92 2.62 2.35 3.09 3.15 2.53 1.91

(a)Calculate the sample median, the quartiles and the IQR(b)Construct a box plot of the data.

49

11. The following data are the joint temperatures of the O-rings (°F) for each test firing or actual launch of the space shuttle rocket motor (from Presidential Commission on the Space Shuttle Challenger Accident, Vol. 1, pp. 129–131): 84, 49, 61, 40, 83, 67, 45, 66, 70, 69, 80, 58, 68, 60, 67, 72, 73, 70, 57, 63, 70, 78, 52, 67, 53, 67, 75, 61, 70, 81, 76, 79, 75, 76, 58, 31.

(a) Compute the sample mean and sample standard deviation;(b) Calculate the median, the quartiles and the IQR;(c) Construct a box plot of the data and comment on the possible presence of outliers.

12. Ipoh Pantai Hospital compiles data on the length of stay by patients in short-term hospitals. A random sample of 28 patients yielded the following data on length of stay, in days.

3 6 15 7 3 55 14 4 12 18 9 6 125 10 13 7 1 23 96 8 11 9 4 21 10

(a) Compute the sample mean and sample standard deviation;(b) Calculate the median, the quartiles and the IQR;(c) Construct a box plot of the data and comment on the possible presence of outliers.

oooOOOooo

Chapter 5

5. Random sample, central limit theorem and Normal Approximation;Statistical process control



Define the random sample and sample mean Use the Central Limit Theorem to define the sample mean distribution Define the Normal Approximation to Binomial and Poisson distribution Construct the X-bar chart and R chart in statistical process control

5.1 Random sample and sample mean

In statistical terms, a random sample is a set of independent random variables X1, X2, …, Xn that have been drawn from a population in such a way that each random variable was selected has the same distribution and has the same chance of being selected.

50

Sample mean is the average of the sample. If we have n observation in one sample, the sample mean is the total of the observation divide by the number of sample size, n.

5.2 Central Limit Theorem and sample mean distribution

Central limit theorem says that if the sample size is large, and a random sample is a set of independent random variables X1, X2, …, Xn has a normal distribution with mean, and variance, then the sample mean, is also normally distributed with mean, and variance, n. That is

Example 5.1:

At chemical engineering department, Universiti Teknologi PETRONAS, the mean age of the students is 20.6 years old, and the variance is 20 years. A random sample of 80 students is drawn from 250 students. What is the probability that the average age of these students is greater than 22 years old?

Solution:

5.3 Normal Approximation

The binomial and Poisson distributions are discrete random variables, whereas the normal distribution is continuous. We need to take this into account when we are using the normal distribution to approximate a binomial or Poisson using a continuity correction.

The continuity correction, for probability of X is depend on the inequality sign, . For example P(X < a) = P(X - 0.5 < a - 0.5) and for

5.3.1 Normal approximation to Binomial

51

The Central Limit Theorem says that as n increases, the binomial distribution with n trials and probability p of success gets closer and closer to a normal distribution. That is, the binomial probability of any event gets closer and closer to the normal probability of the same event.

The normal distribution is a good approximation to Binomial when n is sufficiency large and p is not too close to 0 or 1. How large n needs to be depends on the value of p. It is better to be conservative and limit the use of the normal distribution as an approximation to the binomial when np > 5 and n(1 - p) > 5.

That is, if we have a random variable X ~ Bin(n , p) and n is large and p is small such that np > 5, than X can be calculated approximately using the Normal distribution. It means that the random variable X will be normally distributed with mean μ = np and variance, i.e X ~ N().

Example 5.2:

Suppose in experiment of tossing a fair coin for 50 times. What is the probability of getting between 9 and 11 heads?

Solution:

Let X be the random variable representing the number of heads thrown.

X ~ Bin (50, 0.5)

Since n is large and np > 5, then we can use normal approximation to find the probability. It mean that now, X is normally distributed with mean np =25 and variance 12.5. i.e X ~ N (25, 12.5). Hence,

5.3.2 Normal approximation to Poisson

52

The normal distribution can also be used to approximate the Poisson distribution for large values of (the mean of the Poisson distribution).

That is, if we have a random variable X ~ Poisson () and is large than X can be calculated approximately using the Normal distribution. It means that the random variable X will be normally distributed with mean μ = and variance, i.e X ~ N ()

Example 5.3:

A car hire firm has 20 cars to hire. The number of demands for a car is hired per

day is a Poisson distribution with mean of 3. Calculate the probability that at most

ten cars will be hired in one day.

Solution:

Let a random variable X denotes the number of demands for a car.

The given mean value is 3. By the Poisson distribution

Sinceis large, then the probability can be calculated using a normal approximation with mean and variance is alsoi.eX ~ N(

Hence,

5.4 Statistical process control

Statistical process control (SPC), is a powerful tools that implement the concept of prevention as a shift from the traditional quality by inspection/correction. SPC is a technique that employs statistical tools for controlling and improving processes. It is an important ingredient in continuous process improvement (CPI) strategies. It uses simple statistical means to control, monitor, and improve processes.

Among the most commonly used tools of SPC:

histograms cause-and-effect diagrams

53

http://www.mathsrevision.net/alevel/pages.php?page=75

http://www.mathsrevision.net/alevel/pages.php?page=74

Pareto diagrams

control charts

scatter or correlation diagrams

run charts

process flow diagrams

The most important SPC tool is called control charts. That is a graphical representations of process performance over time concerned with how (or whether) processes vary at different intervals and identifying nonrandom or assignable causes of variation. The control charts are also providing a powerful analytical tool for monitoring process variability and other changes in process mean. There are two common charts use in the SPC. The - chart and R-chart.

The and Range, R Charts are a set of control charts for variables data (data that is both quantitative and continuous in measurement, such as a measured dimension or time). The - chart monitors the process location over time, based on the average of a series of observations, called a subgroup. While the R-chart monitors the variation between observations in the subgroup over time.

The - chart or R- chart are used when you can rationally collect measurements in groups (subgroups) of between two and ten observations. The charts' x-axes are time based, so that the charts show a history of the process. The data is time-ordered; that is, entered in the sequence from which it was generated.

5.4.1 How to construct - chart and R-chart

In order to construct the chart, the sample mean, the average of the sub-group and the limits must be calculated.

The sample mean is calculated from a set of n data values as .

The average of the subgroups data is calculated as

where n is the subgroup size and m is the total number of subgroups included in the analysis.

This is a centre line of the chart and is called the estimate process mean.

54

The average range is calculated as , where r is range between the

largest and the smallest value in each subgroup.

The upper and lower limits for the - chart are calculated by using the formula

where A2 can be find from the process control chart table.

While the upper for the R-chart is calculated by using the formula and for lower limit using the formula where D4 and D3 can be find from the process control chart table.

After the centre line and limits are calculated, and then the chart can be constructed by plotting the observations of sample number versus for -chart and the sample number versus for R-chart.

Example 5.4:

A component part for a jet aircraft engine is manufactured by an investment casting process. The vane opening on this casting is an important functional parameter of the part. We will illustrate the use of and R control charts to assess the statistical stability of this process. The table presents 20 samples of five parts each. The values given in the table have been coded by using the last three digits of the dimension; that is, 31.6 should be 0.50316 inch.

Sample Number x1 x2 x3 x4 x5 r 1 33 29 31 32 33 31.6 4

2 33 31 35 37 31 33.4 6

3 35 37 33 34 36 35.0 4

4 30 31 33 34 33 32.2 4

5 33 34 35 33 34 33.8 2

6 38 37 39 40 38 38.4 3

7 30 31 32 34 31 31.6 4

8 29 39 38 39 39 36.8 10

9 28 33 35 36 43 35.0 15

10 38 33 32 35 32 34.0 6

11 28 30 28 32 31 29.8 4

12 31 35 35 35 34 34.0 4

13 27 32 34 35 37 33.0 10

14 33 33 35 37 36 34.8 4

15 35 37 32 35 39 35.6 7

16 33 33 27 31 30 30.8 6

17 35 34 34 30 32 33.0 5

55

18 32 33 30 30 33 31.6 3

19 25 27 34 27 28 28.2 9

20 35 35 36 33 30 33.8 6

(a) Construct and R control charts. (b) After the process is in control, estimate the process mean and standard

deviation.

Exercise 5

1. Suppose X1, X2, …, X20 is a sample from normal distribution N ( 2) with = 5, 2 = 4. Find (a) Expectation and Variance of (b) Distribution of

2. Given that X is normally distributed with mean 50 and standard deviation 4, compute the following for n=25.(a) Mean and variance of (b)(c)(d)

3. Given that X is normally distributed with mean 20 and standard deviation 2, compute the following for n=40.(a) Mean and variance of (b)(c)(d)

4. Let X denote the number of flaws in a 1 in length of copper wire. The pmf of X is given in the following table

X=x 0 1 2 3

56

P(X=x) 0.48 0.39 0.12 0.01

100 wires are sampled from this population. What is the probability that the average number of flaws per wire in this sample is less than 0.5?

5. At a large university, the mean age of the students is 22.3 years, and the standard deviation is 4 years. A random sample of 64 students is drawn. What is the probability that the average age of these students is greater than 23 years?

6. Assuming an equal chance of a new baby being a boy or a girl, what is the probability that 60 or more out of the next 100 births at Pantai Hospital will be girls?

7. If 10% of UTP students are international students, what is the probability that fewer than 100 in a random sample of 818 students are coming from overseas?

8. Suppose that a sample of n = 1,600 tires of the same type are obtained at random from an ongoing production process in which 8% of all such tires produced are defective. What is the probability that in such a sample 150 or fewer tires will be defective?

9. For overseas flights, an airline has three different choices on its dessert menu—ice cream, apple pie, and chocolate cake. Based on past experience the airline feels that each dessert is equally likely to be chosen.(a) If a random sample of four passengers is selected, what is the probability that at least two will choose ice cream for dessert?(b) If a random sample of 21 passengers is selected, what is the approximate

probability that at least two will choose ice cream for dessert?

10. Suppose that at a certain automobile plant, the number of work stoppage is a Poisson distribution with an average per day due to equipment problems during the production process is 12.0.What is the approximate probability of having 15 or fewer work stoppages due to equipment problems on any given day?

11. The number of cars arriving per minute at a toll booth on a particular bridge is Poisson distributed with a mean of 2.5.What is the probability that in any given minute(a)no cars arrive?(b)not more than two cars arrive?

If the expected number of cars arriving at the toll booth per ten-minute interval is

25.0, what is the approximate probability that in any given ten-minute period

(c) not more than 20 cars arrive?(d) between 20 and 30 cars arrive?

57

12. A component part for a jet aircraft engine is manufactured by an investment casting process. The vane opening on this casting is an important functional parameter of the part. We will illustrate the use of and R control charts to assess the statistical stability of this process. The table presents 20 samples of five parts each. The values given in the table have been coded by using the last three digits of the dimension; that is, 31.6 should be 0.50316 inch.

Sample Number x1 x2 x3 x4 x5 r 1 33 29 31 32 33 31.6 4 2 33 31 35 37 31 33.4 6 3 35 37 33 34 36 35.0 4 4 30 31 33 34 33 32.2 4 5 33 34 35 33 34 33.8 2 6 38 37 39 40 38 38.4 3 7 30 31 32 34 31 31.6 4 8 29 39 38 39 39 36.8 10 9 28 33 35 36 43 35.0 1510 38 33 32 35 32 34.0 611 28 30 28 32 31 29.8 412 31 35 35 35 34 34.0 413 27 32 34 35 37 33.0 1014 33 33 35 37 36 34.8 415 35 37 32 35 39 35.6 716 33 33 27 31 30 30.8 617 35 34 34 30 32 33.0 518 32 33 30 30 33 31.6 319 25 27 34 27 28 28.2 920 35 35 36 33 30 33.8 6

(a) Construct and R control charts. (b) After the process is in control, estimate the process mean and standard deviation.

58

13. The overall length of a skew used in a knee replacement device is monitored using and R charts. The following table gives the length for 20 samples of size 4. (Measurements are coded from 2.00 mm; that is, 15 is 2.15 mm.)

Observation Observation

Sample 1 2 3 4 Sample 1 2 3 4

1 16 18 15 13 11 14 14 15 13

2 16 15 17 16 12 15 13 15 16

3 15 16 20 16 13 13 17 16 15

4 14 16 14 12 14 11 14 14 21

5 14 15 13 16 15 14 15 14 13

6 16 14 16 15 16 18 15 16 14

7 16 16 14 15 17 14 16 19 16

8 17 13 17 16 18 16 14 13 19

9 15 11 13 16 19 17 19 17 13

10 15 18 14 13 20 12 15 12 17

(a) Using all the data, find trial control limits for and R charts, construct the chart, and plot the data.(b) Use the trial control limits from part (a) to identify out-of-control points. If

necessary, revise your control limits, assuming that any samples that plot outside the control limits can be eliminated.

(c) Assuming that the process is in control, estimate the process mean and process standard deviation.

14. The thickness of a printed circuit board (PCB) is an important quality parameter. Data on board thickness (in cm) are given below for 25 samples of three boards each.

Sample 1 2 3 Sample 1 2 3

1 0.0629 0.0636 0.0640 14 0.0645 0.0640 0.0631

2 0.0630 0.0631 0.0622 15 0.0619 0.0644 0.0632

3 0.0628 0.0631 0.0633 16 0.0631 0.0627 0.0630

4 0.0634 0.0630 0.0631 17 0.0616 0.0623 0.0631

5 0.0619 0.0628 0.0630 18 0.0630 0.0630 0.0626

6 0.0613 0.0629 0.0634 19 0.0636 0.0631 0.0629

7 0.0630 0.0639 0.0625 20 0.0640 0.0635 0.0629

8 0.0628 0.0627 0.0622 21 0.0628 0.0625 0.0616

9 0.0623 0.0626 0.0633 22 0.0615 0.0625 0.0619

10 0.0631 0.0631 0.0633 23 0.0630 0.0632 0.0630

59

Sample 1 2 3 Sample 1 2 3

11 0.0635 0.0630 0.0638 24 0.0635 0.0629 0.0635

12 0.0623 0.0630 0.0630 25 0.0623 0.0629 0.0630

13 0.0635 0.0631 0.0630

(a) Using all the data, find trial control limits for and R charts, construct the chart,and plot the data.

(b) Use the trial control limits from part (a) to identify out-of-control points. If necessary, revise your control limits, assuming that any samples that plot outside the control limits can be eliminated.

(c) Assuming that the process is in control, estimate the process mean and process standard deviation.

oooOOOooo

Chapter 6

6. Hypothesis Testing – One population



Explain the concept of hypothesis testing Understand the procedure or steps to perform the test Do a testing about the mean when the population variance is known and is

unknown Do a testing about the proportion Perform the testing about the variance

6.1 Introduction

There are two types of statistical inferences: estimation of population parameters and hypothesis testing. Hypothesis testing is one of the most important tools of application of statistics to real life problems. Most often, decisions are required to

60

be made concerning populations on the basis of sample information. Statistical tests are used in arriving at these decisions.

Statistical hypotheses are based on the concept of proof by contradiction. For example, say, we test the mean () of a population to see if an experiment has caused an increase or decrease in . We do this by proof of contradiction by formulating a null hypothesis against alternative hypothesis.

6.1.1 Null Hypothesis:

It is a hypothesis which states that there is no difference between the procedures and is denoted by H0. For the above example the corresponding H0 would be that there has been no increase or decrease in the mean. Always the null hypothesis is tested, i.e., we want to either accept or reject the null hypothesis because we have information only for the null hypothesis.

6.1.2 Alternative Hypothesis:

It is a hypothesis which states that there is a difference between the procedures and is denoted by H1.

In hypothesis testing there will be a correct decision or false decision would be made on the null hypothesis as summarize in this table.

Suppose Accept H0 as true Reject H0 as false

H0 is trueCorrect decision. Probability: 1 - α

Type I error. Probability: α

H0 is false Type II error. Probability: βCorrect decision. Probability: 1 − β

We will make a correct decision if we accept H0 when H0 is true or we will reject H0 when actually H0 is false.

The risk of rejecting the null hypothesis when we should not reject it is called type I error with probability α. It means that we make a false decision because we reject H0 when actually H0 is true with probability α.

While when we accept H0 but actually H0 is not true, then we make a wrong decision and this decision is called type II error is. The probability of type II error is . We cannot determine β (beta) with the statistical tools you learn in this course.

61

The probability type I error, α is called the level of significance and (1- α)100% is called the confidence level of the test and (1 − β) is called the "power" of the test.

6.1.3 Types of test

In hypothesis testing there are three types of test on any parameters of interest called two tailed (sided) test, upper tailed test and lower tailed test such as in the table below.

Type Null Hypothesis, H0

Alternative Hypothesis, H1

1 Two tailed test

2 Upper tailed test

3 Lower tailed test

6.1.4 Test statistics

It is the random variable X whose value is tested to arrive at a decision. In hypothesis testing the right choice of test statistics is essential. Among the test statistics in hypothesis testing are Z-statistics, T-statistics, -statistic and F-statistic. The choice of this test statistics is depend on the parameter of interest that we want to test. The Central Limit Theorem states that for large sample sizes (n > 30) drawn randomly from a population, the distribution of the means of those samples will approximate normality, even when the data in the parent population are not distributed normally but the population variance is known then a Z-statistic is usually used for large sample sizes (n > 30). However, often large samples are not easy to obtain and the population variance is unknown, then the t-distribution can be used. The population standard deviation is estimated by the sample standard deviation, s. For test the population variance or standard deviation, the -statistic is used. In case of performing multiple comparisons by one way ANOVA, the F-statistic is normally used.

6.1.5 Rejection region

It is the part of the sample space (critical region) where the null hypothesis H0 is rejected. The size of this region is determined by the probability () of the sample point falling in the critical region when H0 is true. is also known as the level of significance, the probability of the value of the random variable falling in the critical region. Also it should be noted that the term "Statistical significance" refers only to the rejection of a null hypothesis at some level . It implies that the

62

observed difference between the sample statistic and the mean of the sampling distribution did not occur by chance alone.

6.1.6 Decision making

If the test statistic falls in the rejection/critical region, then we may conclude that H0 is rejected, it means that there are enough evidence to support the alternative hypothesis. Otherwise we fail to reject H0 means that there are no evidence to support the claim that the H1 is true.

6.1.7 Steps to do the test

In hypothesis testing, there are seven steps to perform any statistical test:

(i) Identify the parameter of interest. (ii) State the hypothesis: Null Hypothesis and Alternate Hypothesis(iii) Determine the appropriate Test Statistic(iv) Determine the critical value (v) Determine the Rejection/Critical Region or P-value or 100(1-)% confidence intervals (vi) Calculate the Test Statistic (vii) Make a decision or conclusion based on step (v).

6.2 Testing about the mean for large sample size and the variance is known

If the parameter of interest is to test about the mean for population when the variance is known or the sample size is very large, then the test can be performed as below:

Step 1: To test about the mean and population variance is known

Step 2:

Step 3: Test statistic:

Step 4: The critical value, at significant level is Zfor one tailed (sided) test and Zfor two tailed (sided) test.

Step 5: i. The critical region:

Alternative Hypothesis, H1 Rejection Criteria (Reject H0) IF

Two tailed test,

63

Upper tailed test,

Lower tailed test,

ii. P-Value approach:

Alternative Hypothesis, H1 Reject H0 IF P <

Two tailed test, P-value = 2[1-Φ(|z0|)]

Upper tailed test, P-value = [1-Φ(z0)]

Lower tailed test, P-value = Φ(z0)

iii 00(-%ConfideneInterv:

Alternative Hypothesis, H1 Reject H0 IF o falls outside of the interval

Two tailed test,

Upper tailed test,

Lower tailed test,

Step 6: Calculate the test statistics, Z in step 3.

Step 7: Decision: To make a conclusion based on the criteria in step 5.

Example 6.1:

Test the hypothesis that the mean age of UTP students is less than 21, given a random sample of 20 individuals who have a mean of 20 and assume that the age is normally distributed with variance of 20.

i. Test the hypothesis that the mean age is less than 21. Use alpha = 0.05.ii. What is the P-value for this test?

64

iii. Construct 95% two-sided CI on the mean strength.iv. Use the CI found in part (iii) to test the hypothesis.

Solution:

i. (1) The parameter of interest is to test the true mean age of UTP students, μ. (2) The hypothesis Testing:

(3) The test statistics is:

(4) Critical value, 00so z0.05 = 1.65

(5) The critical region is reject H0 if z0 < - 1.65

(6) Computation

(7) Result and conclusion:

Since zo > -1.65, so we failed to reject H0 and conclude that not enough evidence to say the true mean age of UTP students is less than 21 years old at α = 0.05.

ii. The P-value for this test is [f(-1)]=0.159. Since P-value > 0.05, then we failed to reject H0.

iii. A 95% two-sided CI on mean strength is

Since 21 is in the interval, so we failed to reject H0 and conclude that not enough evidence to say the true mean age of UTP students is less than 21 years old at α = 0.05.

65

6.3 Testing about the mean when the population variance is unknown

If the parameter of interest is to test about the mean for population when the variance is unknown or the sample size is small, then to perform the test same as for variance is known. Except that in step 3, instead of using the Z statistic, now the test statistic will be replaced by T statistic and will be replaced by s the sample standard deviation.

Step 1: To test about the mean and population variance is known

Step 2:


Step 4: The critical value, at significant level is tn-1for one tailed (sided) test and t, n-1for two tailed (sided) test.



Two tailed test,

Upper tailed test,

Lower tailed test,



Two tailed test, P-value = 2P(Tn-1 > |t|)

Upper tailed test, P-value = P(Tn-1 > t)

Lower tailed test, P-value = P(Tn-1 < t)

66



Reject H0 IF o falls outside of the interval

Two tailed test,

Upper tailed test,

Lower tailed test,

Step 6: Calculate the test statistics, T in step 3.


Example 6.2:

A practical brand of diet margarine was analyzed to determine the level of polyunsaturated fatty acid (in percent). A sample of six packages resulted in the following data: 16.8, 17.2, 17.4, 16.9, 16.5 and 17.1.

i. Using the P-value approach, test the hypothesis that the mean is not 17.0, ii. Construct 95% two-sided CI on the mean.iii. Use the CI found in part (ii) to test the hypothesis.

[4 marks]Solution:

i. (1) The parameter of interest is the true mean compressive strength, μ, variance unknown

(2) The hypothesis Testing:


67

(4) Critical value, 00

(5) The critical region is reject H0 if P-value < 0.05

(6) Computation

From t-table, t0 = 0.1537 with 5 df is fall < 0.267 for which > 0.4,so the P-value > 2(0.4) = 0.8


Since P-value > 0.05, then we fail reject H0 and we conclude that the true mean is 17 at α = 0.05.

ii. A 95% two-sided CI on mean strength is

iii. Since 17 is fall in the interval, so we fail to reject H0 and conclude the true mean is 17at α = 0.05.

6.4 Testing about the proportion

In hypothesis testing, the procedure to test about the proportion of the population is the same as the procedure to test about the mean when the population variance is known.

Step 1: To test about the population proportion

Step 2:


68




Two tailed test,

Upper tailed test,

Lower tailed test,








Reject H0 IF po falls outside of the interval

Two tailed test,

Upper tailed test,

Lower tailed test,

69



6.5 Testing about the variance

For the parameter of interest is to test about the population variance or standard deviation, same steps are used except in step 3, the test statistic used in this test is test. The steps are following:

Step 1: To test about the population proportion

Step 2:





Two tailed test,

Upper tailed test,

Lower tailed test,



70

Two tailed test, P-value = 2P(n-1 > ||)

Upper tailed test, P-value = P(n-1 > )

Lower tailed test, P-value = P(n-1 < )


Alternative Hypothesis, H1 Reject H0 IF o falls outside of the interval

Two tailed test,

Upper tailed test,

Lower tailed test,

Step 6: Calculate the test statistics, in step 3.


Example 6.3:

An Aerospace Engineers claim that the standard deviation of the percentage in an alloy used in aerospace casting is greater than 0.3. 51 parts were randomly selected and the sample standard deviation of the percentage in an alloy used in aerospace casting is s =0.37. (i). At α = 0.05, do these data support the claim of the engineers? (ii) What is the P-value for this test?(iii) Construct a 95% two-sided CI for . What is conclusion?

Solution:

(i) (1) The parameter of interest is the population variance (2) The hypothesis testing:

71

(3) Test statistics is:

(4) Critical value, 00

(5) The critical region is Reject H0 if

(6) Computation

(7) Result and Conclusion:

Since 76.0556 < 67.50, thus we reject the null hypothesis and conclude that the engineers claim is true at the 0.05 level of significance.

(ii). From the table, .Since 71.42<76.0556<

76.15, so the P-value is 0.01 < p < 0.025. Because the P-value is <0.05, then we reject the null hypothesis.

(iii) 95% two-sided CI is

Since =0.3 is outside of the interval, then we reject the null hypothesis and conclude that the engineers claim is true at the 0.05 level of significance.

oooOOOooo

72

Chapter 6

1.A manufacturer of sprinkler systems used for fire protection in office buildings claimsthat the true average system- activation temperature is 1300. A sample of 9 systems when tested yields an average activation temperature of 131.080F. If the distribution of activation times is normal with standard deviation 1.50F, does the data contradict the firm’s claim at level of significance a = 0.01. What is the P-value for this test?

2. A random sample of 50 battery packs is selected and subjected to a life test. The average life of these batteries is 4.05 hours. Assume that the battery life is normally distributed with standard deviation equals 0.2 hour. Is there evidence to support the claims that mean battery life exceeds 4 hours? Use a = 0.05. What is the P-value for this test?

3. The flow discharge of Perak River (measured in m3/s) was obtained at random. 40 readings were collected and the mean flow discharge was found to be 3.815m3/s with a standard deviation of 0.5m3/s. (a) Test the hypothesis that mean flow discharge at Perak River is not equal to 4m3/s .

Use =0.05;(b) Use the P-value approach to test the hypothesis null.(c) Construct a 95% two-sided CI on mean flow discharge. What is conclusion?

4. A civil engineer is analyzing the compressive strength of concrete. Compressive strength is approximately normally distributed with variance 2 = 1000psi2. A random sample of 12 specimens has a mean compressive strength of =3255.42 psi. (a) Test the hypothesis that mean compressive strength is 3500psi. Use =0.01;(b) What is the smallest level of significance at which you would be willing to reject

the null hypothesis?;(c) Construct a 95% two-sided CI on mean compressive strength; and

73

(d) Construct a 99% two-sided CI on mean compressive strength. Compare the width of this confidence interval with the width of the one in part (c). What is your comment?

5. A new process for producing synthetic diamonds can be operated at a profitable level only if the average weight of the diamonds is greater than 0.5 karat. To evaluate the profitability of the process, six diamonds are generated with recorded weights, 0.46, 0.61, .52, .48, .57 and .54 karat. (a) At 5% significance level Do the six measurements present sufficient evidence that

the average weight of the diamonds produced by the process is in excess of .05 karat?

(b) Use the P-value approach to test the hypothesis null.(c) Construct a 95% CI on the average weight of diamonds.

6. One of the Cigarette Company claims that their cigarettes contain an average of only 10mg of tar. A random sample of 25 cigarettes shows the average tar content to be 12.5 mg with standard deviation of 4.5mg. (a) Construct a hypothesis test to determine whether the average tar content of

cigarettes exceeds 10mg. using the P-value approach;(b) Construct a 95% two-sided CI on the average tar content of cigarettes.

7. Regardless of age, about 20% of Malaysian adults participate in fitness activities at least twice a week. In a local survey of 100 adults over 40 years old, a total of 15 people indicated that they participated in a fitness activity at least twice a week. (a) Do these data indicate that the participation rate for adults over 40 years of age is

significantly less than 20%? Carry out a test at 10% significance level and draw appropriate conclusion.

(b) Construct a 95% two-sided CI on the participation rate.

8. A survey done one year ago showed that 45% of the population participated in recycling programs. In a recent poll a random sample of 1250 people showed that 588 participate in recycling programs.(a) Test the hypothesis that the proportion of the population who participate in

recycling programs is greater than it was one year ago. Use a 5% significance level.

(b) Construct a 95% two-sided CI on the proportion.

9. A Ipoh city council member gave a speech in which she said that 18% of all private homes in the city had been undervalued by the county tax assessor’s office. In a follow-up story the local newspaper reported that it had taken random sample of 91 private homes. Using professional evaluator to evaluate the property and checking against county tax records it found that 14 of the homes had been undervalued. (a) Does this data indicate that the proportion of private homes that are undervalued by the county tax assessor is different from 18%? Use a 5% significance level.(b) Construct a 95% two-sided CI on the proportion.

74

10. Engineers designing the front-wheel-drive half shaft of a new model automobile claim that the variance in the displacement of the constant velocity joints of the shaft is less than 1.5 mm. 20 simulations were conducted and the following results were obtained, and s = 1.41. (a) At α = 0.05, do these data support the claim of the engineers? (b) What is the P-value for this test?(c) Construct a two-sided CI for

11. An Aerospace Engineers claim that the standard deviation of the percentage in an alloy used in aerospace casting is greater than 0.3. 51 parts were randomly selected and the sample standard deviation of the percentage in an alloy used in aerospace casting is s =0.37. (a) At α = 0.05, do these data support the claim of the engineers? (b) What is the P-value for this test?(c) Construct a 95% two-sided CI for . What is conclusion?

12. The scientists claim that the variance of sugar content of the syrup in canned peaches thought to be 18 mg2. From a random sample of 10 cans yields a sample deviation of 4.8mg.(a) At α = 0.05, do these data support the claim of the scientists?(b) What is the P-value for this test?(c) Construct a 95% two-sided CI for . What is the conclusion?

oooOOOooo

75

Chapter 7

7. Hypothesis Testing – Two populations



Understand the procedure or steps to perform the test for two populations. Do a testing about the different between the two mean when the

populations variance are known. Do a testing about the different between the two mean when the

populations variance are unknown but assume to be equal. Do a testing about the different between the two mean when the

populations variance are unknown but assume to be not equal. Do a testing about the different between the two proportions. Perform the testing about the different between the two variances.

7.1 Introduction

In hypothesis testing for two populations, the procedure or method is the same as in hypothesis testing for one population. But now we want to test the different between two parameters of interest of populations. For example, we want to test about the different between the two mean of the two populations, and or to test the different between two proportions of the two populations, p1 and p2.

7.1.1 Types of test

76

In this hypothesis testing there are three types of test on any parameters of interest called two tailed (sided) test, upper tailed test and lower tailed test such as in the table below.

Type Null Hypothesis, H0


1 Two tailed test

2 Upper tailed test

3 Lower tailed test

7.1.3 Steps to do the test

In hypothesis testing for two populations, the procedure to perform the test is the same as in the hypothesis testing for one population.

(i) Identify the parameter of interest. (ii) State the hypothesis: Null Hypothesis and Alternate Hypothesis(iii) Determine the appropriate Test Statistic(iv) Determine the critical value (v) Determine the Rejection/Critical Region or P-value or 100(1-)% confidence intervals (vi) Calculate the Test Statistic (vii) Make a decision or conclusion based on step (v).

7.2 Testing about the different between the two means, when the both population variances are known.

If the parameter of interest is to test about the different between the two means for two populations when both variances are known, then the test can be performed as below:

Step 1: To test about the different between two means, and when both population variances, and are known

Step 2:


77




Two tailed test,

Upper tailed test,

Lower tailed test,






iii. 100(1-)% Confidence Intervals:


Reject H0 IF falls outside of the interval

Two tailed test,

78

Upper tailed test,

Lower tailed test,



Example 7.1:

The burning rates of two different solid-fuel propellants used in rocket systems are being studied. It is known that both propellants have approximately the same standard deviation of burning rate, that is 3cm/second. Two random samples with the same sample size of 20 specimens are tested and the sample mean burning rates are 18 cm/second and 24 cm/second respectively.

i. Test the hypothesis that both propellants have the same mean burning rate, using the P-value approach. ii. Construct a two-sided 95% CI on the difference in means, 1-2. iii. What is the practical meaning of this interval?

Solution:

(i) (1) The parameter of interest is the difference in mean fill volume, μ1 - μ2 , variances, and are known

(2) The hypothesis testing:


(4) Critical value 00

(The critical region is reject H0 if P-value < 0.05

79

(6) Computation: z0 and P-value

P-value = 2[1-f(6.32)]=2[1-1]=0


Since P-vale < 0.05, then we reject H0 . Both propellants are not the same mean burning rate.

ii. A 95% two-sided CI on the difference in means, - is

iii. Since -0is not in the interval then we reject H0. Both propellants are not the same mean burning rate.

7.2 Testing about the different between the two means, when the both population variances are unknown.

If the parameter of interest is to test about the different between the two means for two populations when both variances are unknown, then the test can be classified into two cases. First case we assume that the populations variances are the same,

and second case is we assume that the variances are not equal, .

7.2.1 First case:

Testing about the different between the two means, when the both population variances are unknown but . The test can be performed as below:

80

Step 1: To test about the different between two means, and when both population variances, and are unknown but .

Step 2:


Step 4: The critical value, at significant level is tfor one tailed (sided) test and tfor two tailed (sided) test.



Two tailed test,

Upper tailed test,

Lower tailed test,







81



Two tailed test,

Upper tailed test,

Lower tailed test,

Step 6: Calculate the test statistics, t0 in step 3.


Example 7.2:

Professor Adams taught the same large lecture course for two terms. Except for negligible differences the two courses were the same. However, one met at 8a.m. and the other met at 11a.m. The two courses were given final exams of the same degree of difficulty and covering the same material. Both exams were worth 100 points. A random sample of 49 students from the 8a.m. class had an average score of 73.2 with standard deviation 8.1. A random sample of 36 students from the 11:00a.m. class had an average score of 78.1 with standard deviation 10.0. Assume that the population variances are the same and the data are drawn from a normal distribution.

i. Does this data indicate that the mean score for the 11:00 a.m. class is higher than the mean score for the 8a.m. class? Use a 5% significance level.ii. What is the P-value for this test?iii. Construct a two-sided 95% CI on the difference in average scores.

Solution:

(i) (1) The parameter of interest is to test the mean score at 11am, is better than the mean score at 8am,

(2) Hypothesis testing:

82


(4) Critical value 00t0.05, 83 = 1.658

(5) The critical region is reject H0 if t0 < -1.658

(6) Computation:

Computations:


Since t0 < -1.658, then we have to reject H0 at =0.05. The mean score for the two tests are the same. There is enough evidence to say that test at 11am is better result from test at 8am.

ii. The P-value for the test:

From t-table with 83 df, t0 =2.5 is between t= 2.358 and t=2.617, which give 0.005<p<0.01. Since P < 0.05, thus we reject H0 at the 0.05 level of

83

significance and conclude that there is enough evidence to say that test at 11am is better result from test at 8am.

iii. A 95% CI for the difference in mean before and after the policy change where t0.025,22 =1.98 is

7.2.2 Second case:

Testing about the different between the two means, when the both population variances are unknown but . The test can be performed as below:

Step 1: To test about the different between two means, and when both population variances, and are unknown but .

Step 2:


with degrees of freedom given by,

Step 4: The critical value, at significant level is tvfor one tailed (sided) test and tvfor two tailed (sided) test.

84



Two tailed test,

Upper tailed test,

Lower tailed test,


lternative Hypothesis, H1 Reject H0 IF P <







Two tailed test,

Upper tailed test,

Lower tailed test,

Step 6: Calculate the test statistics, t0 in step 3.

85


Example 7.3:

Two companies manufacture a rubber material intended for use in an automotive application. The part will be subjected to abrasive wear in the application, so we decide to compare the material produced by each company in a test. Twenty-five samples of material from each company are tested in an abrasion test and the amount of wear after 1000 circles is observed. The sample mean and standard deviation of wear for company A and B respectively, are

i. Do the data support the claim that the two companies produce material with different mean wear? Use 0.05, and assume that each population is normally distributed but their variances are not equal. What is the P-value for this test?

ii. Construct a two-sided 95% CI that will address the questions in part(i) and (ii) above.

(6 marks)

Solution:

(i) (1) The parameter of interest is to test the different between the two means, and variance unknown but not equal.



(4) Critical value 00

86

t0.025, 27 = 3.057

(5) The critical region is reject H0 if t0 > 3.057 or t0 < -3.057

(6) Computation:


Since t0 > 3.057, thus we have to reject the null hypothesis and enough evidence to support that the means are difference.

P-value for the test is P < 0.0001, since t0 = 4.85 with 27 df is falls > 3.69 for which > 0.0005.

(ii) A 95% CI for the difference in mean and is

From CI, since -0 is not in the interval, so we reject H0. Strong evidence to support that and B are difference

87

7.3 Testing about the different between the two proportions, p1 and p2.

If the parameter of interest is to test about the different between the two proportions of two populations, then the test can be performed as below:

Step 1: To test about the different between two proportions, p and p .

Step 2:


where,




Two tailed test,

Upper tailed test,

Lower tailed test,






88




Two tailed test,

Upper tailed test,

Lower tailed test,



Example 7.4:

In a study on the effects of sodium restricted diets on hypertension, 24 out of 55 hypertensive patients were on sodium restricted diets, and 36 out of 149 non-hypertensive patients were on sodium restricted diets. i. Test the hypothesis that the proportion of patients on sodium restricted diets

is higher for hypertensive patients at =0.05.11. What is the P-value for this test?12. Construct a two sided 95% CI and comment.

Solution:i. (1) The parameters of interest are to test the proportion of hypertension patients on sodium restricted diets, pA and non-hypertension patients, pB



89

(4) Critical value 00z = 1.65

(5) The critical region is reject H0 if z0 > 1 (6) Computation:

(7) Result and conclusions:

Since z0 > 1.65, then we reject H0. It means that enough evidence to claim that the proportion of patients on hypertension is higher than non hypertension patients

ii. P-value=(1-f(4.95)=(1-1)=0. Since the P-value is less than 0.05, thus we reject the null hypothesis. There is enough evidence to claim that the proportion of patients on hypertension is higher than non hypertension patients.

iii A 95% CI on the difference in the two proportion of patients are

From CI, since is not in the interval, so we reject H0. There a significance difference in the two proportion of patients.

7.3 Testing about the different between the two variances, .

90

If the parameter of interest is to test about the different between the two variances of two populations, then the test can be performed as below:

Step 1: To test about the different between two variances, .

Step 2:


Step 4: The critical value, at significant level is ffor one tailed (sided) test and ffor two tailed (sided) test.



Two tailed test,

Upper tailed test,

Lower tailed test,

ii. 100(1-)% Confidence Interval on the ratio of two variances

Alternative Hypothesis, H1 Reject H0 IF falls outside of the interval

Two tailed test,

Upper tailed test,

Lower tailed test,


91


Example 7.4:

A random sample of 12 air pollution index at UTP station produced a variance 0.0340 while a random sample of another 13 air pollution index at Tronoh station produced a variance 0.0525.

(i). Are the population variances equal?. Use = 0.05.

(ii) Find the 95% two-sided confidence interval on the ratio of two variances.

Solution:i. (1) The parameters of interest are to test the difference between the two variances of pollution indexes,

and



(4) Critical value 00f = 3.15

(5) The critical region is reject H0 if

(6) Computation:

(7) Result and conclusions:

Since 0.32<F0 < 3.15, then we cannot reject H0. It means that not enough evidence to say that the variances of the two pollution indexes are different.

ii A 95% two-side confidence interval on the ratio of two variances of pollution indexes are

92

From CI, since =1 is in the interval, so we cannot reject H0. It means that

not enough evidence to say that the variances of the two pollution indexes are different.

Exercise 7

1. A random sample of size n = 25 taken from a normal population with = 5.2 has a mean equals 81. A second random sample of size n = 36, taken from a different normal population with = 3.4, has a mean equals 76. (a) Do the data indicate that the true mean value 1 and 2 are different? Carry out a

test at = 0.01 (b) Find 90% CI on the difference in mean strength

2. Two machines are used for filling plastic bottles with a net volume of 16.0 oz. The fill volume can be assumed normal with, s1 = 0.02 and s2 = 0.025. A member of the quality engineering staff suspects that both machines fill to the same mean net volume, whether or not this volume is 16.0 oz. A random sample of 10 bottles is taken from the output of each machine with the following results: (a) Do you think the engineer is correct? Use the p – value approach.

93

(b) Find a 95% CI on the difference in means.

3. Two machine are used to fill plastic bottles with dishwashing detergent. The standard deviations of fill volume are known to be 10.01 and = 0.15 fluid ounce for

two machines, respectively. Two random samples of n1 = 12 bottles from machine 1 and n2=10 bottles from machine 2 are selected, and the sample mean fill volumes are

=30.61 =30.24 fluid ounces. Assume normality.(a) Test the hypothesis that both machines fill to the same mean volume. Use the P-

value approach;(b) Construct a 90% two-sided CI on the mean difference in fill volume; and(c) Construct a 95% two-sided CI on the mean difference in fill volume. Compare

and comment on the width of this interval to the width of the interval in part (ii).

4. To find out whether a new serum will arrest leukemia, 9 mice, all with an advanced stage of the disease are selected. 5 mice receive the treatment and 4 do not. Survival, in years, from the time the experiment commenced are as follows:

Treatment 2.1 5.3 1.4 4.6 0.9

No treatment

1.9 0.5 2.8 3.1

At the 0.05 level of significance can the serum be said to be effective? Assume the two distributions to be of equal variances.

5. A new policy regarding overtime pay was implemented. This policy decreased the pay factor for overtime work. Neither the staffing pattern nor the work loads changed. To determine if overtime loads changed under the policy, a random sample of employees was selected. Their overtime hours for a randomly selected week before and for another randomly selected week after the policy change were recorded as follows:

Employees: 1 2 3 4 5 6 7 8 9 10 11 12Before: 5 4 2 8 10 4 9 3 6 0 1 5After: 3 7 5 3 7 4 4 1 2 3 2 2

Assume that the two population variances are equal and the underlying population is normally distributed.(a) Is there any evidence to support the claim that the average number of hours

worked as overtime per week changed after the policy went into effect. Use a P-value approach in arriving at this conclusion.

(b) Construct a 95% CI for the difference in mean before and after the policy change. Interpret this interval.

94

6. The diameter of steel rods manufactured on two different extrusion machines is being investigated. Two random samples of sizes n1 = 15 and n2 = 17 are selected, and

respectively. Assume that data are drawn normal distribution with equal variances. (a) Is there evidence to support the claim that the two machines produce rods with

different mean diameters ? Use the p – value approach. (b) Construct a 95% CI on the difference in mean rod diameter.

7. The following data represent the running times of films produced by 2 motion-picture companies. Test the hypothesis that the average running time of films produced by company 2 exceeds the average running time of films produced by company 1 by 10 minutes against the one-sided alternative that the difference is less than 10 minutes? Use a = 0.01 and assume the distributions of times to be approximately normal with unequal variances.

Time

Company

X1 102 86 98 109 92

X2 81 165 97 134 92 87 114

8. Two companies manufacture a rubber material intended for use in an automotive application. 25 samples of material from each company are tested, and the amount of wear after 1000 cycles is observed. For company 1, the sample mean and standard deviation of wear are and for company 2, we obtain

(a) Do the sample data support the claim that the two companies produce material with different mean wear? Assume each population is normally distributed but unequal variances?

(b) Construct a 95% CI for the difference in mean wear of these two companies. Interpret this interval.

95

9. Professor A claims that a probability and statistics student can increase his or her score on tests if the person is provided with a pre-test the week before the exam. To test her theory she selected 16 probability and statistics students at random and gave these students a pre-test the week before an exam. She also selected an independent random sample of 12 students who were given the same exam but did not have access to the pre-test. The first group had a mean score of 79.4 with standard deviation 8.8. The second group had sample mean score 71.2 with standard deviation 7.9. (a) Do the data support Professor A claims that the mean score of students who get a

pre-test are different from the mean score of those who do not get a pre test before an exam. Use the P-value approach and assume that their variances are not equal.

(b) Construct a 95% CI for the difference in mean score of students who get a pre-test and those who do not get a pre-test before an exam. Interpret this interval.

10. A vote is to be taken among residents of a town and the surrounding county to determine whether a proposed chemical plant should be constructed. If 120 of 200 town voters favour the proposal and 240 of 500 county residents favour it, would you agree that the proportion of town voters favouring the proposal is higher than the proportion of county voters? Use a = 0.05

11. The rollover rate of sport utility vehicles is a transportation safety issue. Safety advocates claim that the manufacturer A’s vehicle has a higher rollover rate than that of manufacturer B. One hundreds crashes for each of this vehicles were examined. The rollover rates were pA=0.35 and pB=0.25.(a) By using the P-value approach, does manufacturer A’s vehicle has a higher

rollover rate than manufacturer B’s?(b) Construct a 95% one-sided CI on the difference in the two rollover rates of the

vehicle. Interpret this interval.

12. Professor Rady gave 58 A’s and B’s to a class of 125 students in his section of English 101. The next term Professor Hady gave 45 A’s and B’s to a class of 115students in his section of English 101. (a) By using a 5% significance level, test the claim that Professor Rady gives a higher

percentage of A’s and B’s in English 101 than Professor Hady does. What is comment?

(b) Construct a 95% one-sided CI on the difference in the percentage of A’s and B’s in English 101 given by this two professors.

13. The diameter of steel rods manufactured on two different extrusion machines is being investigated. Two random samples of sizes n1 = 15 and n2 = 17 are selected, and

respectively. (a) Is there evidence to conclude that the variance of the diameter of steel rods is

different for the two machines? Use the p – value approach.(b) Construct a 95% two-sided CI on the difference in mean rod diameter.

96

14. Professor A claims that a probability and statistics student can increase his or her score on tests if the person is provided with a pre-test the week before the exam. To test her theory she selected 16 probability and statistics students at random and gave these students a pre-test the week before an exam. She also selected an independent random sample of 12 students who were given the same exam but did not have access to the pre-test. The first group had a mean score of 79.4 with standard deviation 8.8. The second group had sample mean score 71.2 with standard deviation 7.9. (a) Do the data support Professor A claims that the mean score of students who get a

pre-test are different from the mean score of those who do not get a pre test before an exam. Use the P-value approach and assume that their variances are not equal.

(b) Construct a 95% two-sided CI for the difference in mean score of students who get a pre-test and those who do not get a pre-test before an exam. Interpret this interval.

15. The melting points of two alloys used were investigated by melting 15 samples of each material. The sample standard deviation for alloy 1 was 2.34oF and for alloy 2 was 2.5oF. (a) Do the sample data support a claim that both alloys have the same variance

melting point?. Use = 0.05.(b) Construct a 95% two-sided confidence interval on the ratio of the two variances.

16. A study was conducted to test whether there are differences between two variances of petrol consumptions of two types of petrol, RON95 and RON97. Five cars were selected at random and the data of petrol consumptions in km/liter for each petrol types are obtained as follow:

Km per literRON95 RON97

Car 1 8.9 9.2Car 2 7.5 7.8Car 3 8.2 8.5Car 4 8.6 8.8Car 5 9.5 9.4

(a) Do the sample data support a claim that both petrol types have the same variance of petrol consumptions?. Use Use = 0.05.

(b) Construct a 95% two-sided confidence interval on the ratio of the two variances.

oooOOOooo

97

Chapter 8

8. Simple Linear Regression

Learning Outcomes:

At the end of the lesson, the student should be able to

Use the least squares method to estimate the intercept and slope of the liner regression model Carry out tests to determine if the model obtained is an adequate fit to the data Construct confidence intervals on regression parameters

98

8.1 Introduction

Regression models are statistical models which describe the variation in one (or more) variable(s) when one or more other variable(s) vary. Inference based on such models is known as regression analysis. There are two types of regression models called simple linear regression model and multiple linear regression models.

Simple linear regression is a linear regression model with a single predictor variable. In other words, simple linear regression fits a straight line through the set of n points in such a way that makes the sum of squared residuals of the model (that is, vertical distances between the points of the data set and the fitted line) as small as possible.

The simple refers to the fact that this regression model has only one response or dependent variable and one independent variable. The fitted line has the slope equal to the correlation between the dependent variable, y is also referred to as the response and independent variable, x is also referred to as regressor or predictor variable corrected by the ratio of standard deviations of these variables. The intercept of the fitted line is such that it passes through the center of mass (x, y) of the data points. The statistical relation between x and y may be expressed as follows:

(8.1)

Where, is called the intercept of the regression and is the slope of the regression. These two parameters called regression coefficients. The slope, , can be interpreted as the change in the mean value of Y for a unit change in x.

The random error term, , is assumed to follow the normal distribution with a mean of 0 and variance of . Since Y is the sum of this random term and the mean value, E(Y), (which is a constant), the variance of Y at any given value of x is also . Therefore, at any given value of x, say xi, the dependent variable Y follows a normal distribution with a mean of and a standard deviation of .

8.2 Fitted Regression Model

The true regression line corresponding to Equation (8.1) is usually never known. However, the regression line can be estimated by estimating the coefficients

and for an observed data set. The estimates of, and , are calculated using

99

http://en.wikipedia.org/wiki/Pearson_product_moment_correlation_coefficient

http://en.wikipedia.org/wiki/Errors_and_residuals_in_statistics

http://en.wikipedia.org/wiki/Covariate

http://en.wikipedia.org/wiki/Covariate

http://en.wikipedia.org/wiki/Linear_regression_model

least squares method. The estimated regression line, obtained using the values of and , is called the fitted line. The least square estimates, and , are

given by

(8.2)

and

(8.3)

where is the mean of all the observed values and is the mean

of all values of the predictor variable at which the observations were taken.

Once the and are known, the fitted regression model can be written as:

(8.4)

Where is the fitted or estimated value based on the fitted regression model. It is an estimate of the mean value, E(Y). The fitted value, for a given value of the predictor variable, xi, may be different from the corresponding observed value, yi. The difference between the two values is called the residual,

(8.5)

8.3 Assessment of the Regression Model

The fitted regression model, equation (8.4) can be used to estimate the value of response, y for a certain value of predictor variable x.

100

The regression model can be evaluated by three method of assessment. There are, the error of estimate, , the coefficient of determination, R2 and testing the slope of the regression.

8.3.1 Method 1: The error of estimate,

The error of estimate is a square root of the error of sum of squares divided by

error degree of freedom, n-2. i.e . The smaller the more successful is

the linear regression model in explaining the response, y.

8.3.2 Method 2: the coefficient of determination, R2

The coefficient of determination can be interpreted as the proportion of variability in the observed response variable that is explained by the linear regression model. The coefficient of determination measures the strength of that linear relationship, denoted by

R2 = 1 - SSE/SST

The greater R2 the more successful is the linear regression model.

8.3.3 Method 3: Testing the slope,

The significance of the fitted regression model can be tested by using the t-

student’s test on the parameter, . The test statistic is . If

, then the null hypothesis, is rejected. It means that the regression model is adequate and fitted to the data otherwise there is no relationship between x and y.

8.3.4 The ANOVA Approach

The analysis of variance (ANOVA) can also be used to test for the significance of regression as in the table below:

ANOVA Table for simple linear regression, :

Source of variation

Degree of freedom

Sum of squares

Mean of squares

F

Regression 1 SSR MSR=SSR/1 MSR/MSE

101

Error n-2 SSE MSE=SSE/n-2

Total n-1 SST

Where,

is called regression sum of squares which measures variability

explained by the regression model.

is called error sum of squares which measures of

unexplained variability in the response.

is called total sum of squares which

measures of the total variability in the response.

SST can be written as,

If the value of F is large compared to f,1, n-2, then the regression model is significant.

8.4 Confidence intervals on regression parameters

A 100 (1- a )% confidence level on the slope in a simple linear regression is given by

Similarly, a 100 (1- a )% confidence level on the intercept is given by

102

where and

Example 8.1

The following measurements of the specific heat of a certain chemical were made in order to investigate the variation in specific heat with

temperature.

Temperature oC 0 10 20 30 40 50Specific heat 0.51 0.55 0.57 0.59 0.63 0.65

i. Plot the points on a scatter diagramii. Estimate the regression line of specific heat on temperatureiii. Estimate the value of the specific heat when the temperature is

35oC.

Example 8.2

The following data were collected on 8 lung cancer patients where x measures the

number of years the patient smoke cigarette (or any form of nicotine product) and

y is the physician’s subjective evaluation of the extent of lung damage on a scale

of 0 to 100.

x (years) 25 35 22 15 48 39 42 31

y (0-100) 55 60 50 30 75 70 71 55

An analysis of variance is conducted using a statistic software package and the

output is displayed in the table below:

The regression equation isy = 21.228 + 1.230x

Predictor Coef SE Coef T p –

103

valueConstant 21.228 9.442 2.248 0.066

1.230 0.280 4.397 0.005

S = 8.17169 R - squared = ? R - squared ( adj) = 0.724

Analysis of VarianceSource Df SS MS F p –

valueRegression 1 1290.84 1290.84 19.331 0.005Residual Error 6 400.66 66.777

Total 7 1691.50

i. Estimate the predicted physician scores, on the extent of lung damage, for two patients who have smoking habits of 20 and 40 years respectively.

ii. Obtain a 95 % confidence interval for the true slope β.

iii. Estimate the coefficient of determination (R2). Discuss briefly on

the value obtained.

iv. Conduct a hypothesis test on the significance of the regression at

level of significance α = 0.05.

v. Find a 95% Confidence Interval for the the physician’s subjective

evaluation of the extent of lung damage.

Exercise 8

1. The manager of a car plant wishes to investigate how the plant’s electricity usage depends upon the plant production. The data is given below

Production (RMmillion)

(x)

4.51 3.58 4.31 5.06 5.64 4.99 5.29 5.83 4.7 5.61 4.9 4.2

Electricity 2.48 2.26 2.47 2.77 2.99 3.05 3.18 3.46 3.03 3.26 2.67 2.53

104

Usage (y)

(a) Estimate the linear regression equation (b) An estimate for the electricity usage when x = 5 (c) Find a 90% Confidence Interval for the electricity usage.

2. An experiment was set up to investigate the variation of the specific heat of a certain chemical with temperature. The data is given below

Temperature oF(x)

50 60 70 80 90 100

Heat(y)

1.601.64

1.631.65

1.671.67

1.701.72

1.711.72

1.711.74

(a) Estimate the linear regression equation (b) Plot the results on a scatter diagram(c) An estimate for the specific heat when the temperature is 75oF (d) Find a 95% Confidence Interval for the specific heat.

3. An engineer at a semiconductor company wants to model the relationship between the device HFE (y) and the parameter Emitter - RS ( ). Data for Emitter - RS was first collected and a statistical analysis is carried out and the output is displayed in the table given.

Regression Analysis: y = 1075.2 – 63.87x1

Predictor Coef SE Coef T P-valueConstant 1075.2 121.1 8.88 0.000

x1 -63.87 8.002 -7.98 0.000S = 19.4 R-Sq = 0.78

Analysis of variance

Source DF SS MS FRegression 1 23965 23965 63.70Residual 18 6772 376Total 19 30737

(a) Estimate HFE when the Emitter - RS is 14.5. (b) Obtain a 95 % confidence interval for the true slope β. (c) Test for significance of regression for a = 0.05.

105

4. An chemical engineer wants to model the relationship between the purity of oxygen (y) produced in a chemical distillation process and the percentage of hydrocarbons (x ) that are present in the main condenser of the distillation unit. A statistical analysis is carried out and the output is displayed in the table given.

Regression Analysis: y = 74.3 + 14.9x

Predictor Coef SE Coef T P-valueConstant 74.283 1.593 46.62 0.000

x1 14.947 1.317 11.35 0.000S = 1.087 R-Sq = 87.7%

Analysis of varianceSource DF SS MS FRegression 1 152.13 152.13 12.86Residual 18 21.25 1.18Total 19 173.38

(a) Estimate the purity of oxygen when the percentage of hydrocarbon 1%. (b) Obtain a 95 % confidence interval for the true slope β. (c) Test for significance of regression for a = 0.05.

5. Regression methods were used to analyze the data from a study investigating the relationship between roadway surface temperature (x) and pavement deflection (y). The data follow.

Temperature x Deflection y Temperature x Deflection y

70.0 0.621 72.7 0.637

77.0 0.657 67.8 0.627

72.1 0.640 76.6 0.652

72.8 0.623 73.4 0.630

78.3 0.661 70.5 0.627

74.5 0.641 72.1 0.631

74.0 0.637 71.2 0.641

72.4 0.630 73.0 0.631

75.2 0.644 72.7 0.634

76.0 0.639 71.4 0.638

(a) Estimate the intercept and slope regression coefficients. Write the estimated regression line.

(b) Compute SSE and estimate the variance.(c) Find the standard error of the slope and intercept coefficients.

106

(d) Show that (e) Compute the coefficient of determination, R2. Comment on the value.(f) Use a t-test to test for significance of the intercept and slope coefficients at . Give the P-values of each and comment on your results.(g) Construct the ANOVA table and test for significance of regression using the P value. Comment on your results and their relationship to your results in part (f).(h) Construct 95% CIs on the intercept and slope. Comment on the relationship of these CIs and your findings in parts (f) and (g).

6. The designers of a database information system that allows its users to search backwards for several days wanted to develop a formula to predict the time it would be take to search. Actually elapsed time was measured for several different values of days. The measured data is shown in the following table:

Number of Days 1 2 4 8 16 25Elapsed Time 0.65 0.79 1.36 2.26 3.59 5.39

(a) Estimate the intercept and slope regression coefficients. Write the estimated regression line.(b) Compute SSE and estimate the variance.(c) Find the standard error of the slope and intercept coefficients.(d) Show that (e) Compute the coefficient of determination, R2. Comment on the value.(f) Use a t-test to test for significance of the intercept and slope coefficients at .

Give the P-values of each and comment on your results.(g) Construct the ANOVA table and test for significance of regression using the P-value.

Comment on your results and their relationship to your results in part (vi).(h) Construct 95% CIs on the intercept and slope. Comment on the relationship of these

CIs and your findings in parts (vi) and (vii).

Chapter 9

9. Multiple Linear Regression

Learning Outcomes:

At the end of the lesson, the student should be able to

Use the least squares method to estimate a multiple linear model Carry out tests to determine if the model obtained is an adequate fit to the data

107

9.1 Introduction

Multiple regressions (the term was first used by Pearson, 1908) is to learn more about the relationship between several independent or predictor variables and a dependent or criterion variable. It is an extension of a simple linear regression model. Consider the following data consisting of n sets of values

The value of the dependent variable yi is modeled as

(9.1)

Where, are called the regression coefficients can be estimated by using the least squares method. The random error term, , is assumed to follow the normal distribution with a mean of 0 and variance of .

9.2 Fitted Regression Model

The true regression line corresponding to Equation (9.1) is usually never known. However, the regressions model can be estimated by estimating the coefficients

for an observed data set.. The estimated regressions model

is obtained using the values of are called the fitted model

equations. The least square estimates, are given by

108

These equations can be solved by using matrices. Then we have the fitted regression model as given below

9.3 Assessment of the Regression Model

The ANOVA Approach

The analysis of variance (ANOVA) can also be used to test for the significance of regression as in the table below:

ANOVA Table for multiple linear regression,

Source of variation

Degree of freedom

Sum of squares

Mean of squares F

Regression k SSR MSR=SSR/k MSR/MSE

Error n - (k+1) SSE MSE=SSE/n-(k+1)

Total n-1 SST

Where,

109

is called regression sum of squares which measures variability

explained by the regression model.

is called error sum of squares which measures of

unexplained variability in the response.

is called total sum of squares which

measures of the total variability in the response.

SST can be written as,

The ANOVA table is used to test for significance of regression

The hypotheses are:

The test statistic is .

If the value of F0 is large compared to f,,k, n-(k+1), then the multiple regression model is significant.

Example 9.1:

A set of experimental runs were made to determine a way of predicting cooking time y at various levels of oven width x1, and temperature x2. The data were recorded as follows:

110

i. Estimate the multiple linear regression equation for the data.ii. Test whether the regression explained by the model obtained in part (i) is

significant at the 0.01 level of significance.

Solution:

i. Using the computer for computations, the following results were observed.

The regression equation is

Cooking time = 0.568 + 2.706 width + 2.051 temperature

Predictor Coef SE Coef T P

Constant 0.568 0.585 0.970 0.364

Width 2.706 0.194 13.935 0.000

Temp. 2.051 0.046 44.380 0.000

S = 0.6334 R-Sq = 100% R-Sq(adj) = 100%

ii. The following ANOVA table is obtained

Analysis of Variance

Source DF SS MS F P

Regression 2 10953.334 5476.667 13647.872 0.000

Residual Error 7 2.809 0.401

Total 9 10956.143

Since F > f0.01, 2, 7 = 9.55, then we reject H0. It means that the regressions are significant.

111

Exercise 9

1. Given the data:

Test Number y x1 x21 1.6 1 12 2.1 1 23 2.4 2 14 2.8 2 25 3.6 2 36 3.8 3 27 4.3 2 48 4.9 4 29 5.7 4 310 5 3 4

(a) Fit a multiple linear regression model to these data.

2. Given the data:

Observation Number Pull Strength y

Wire Length x1 Die Height x2

1 9.95 2 502 24.45 8 1103 31.75 11 1204 35.00 10 5505 25.02 8 2956 16.86 4 2007 14.38 2 3758 9.60 2 529 24.35 9 10010 27.50 8 30011 17.08 4 41212 37.00 11 40013 41.95 12 50014 11.66 2 36015 21.65 4 20516 17.89 4 40017 69.00 20 60018 10.30 1 58519 34.93 10 54020 46.59 15 25021 44.88 15 29022 54.12 16 51023 56.63 17 59024 22.13 6 10025 21.15 5 400

112

(b) Fit a multiple linear regression model to these data.

3. A study was performed to investigate the shear strength of soil (y) as it related to depth in meter (x1) and percentage moisture content (x2). Ten observations were collected and the following summary quantities obtained:

(a) Estimate the parameters to fit the multiple regression models for these data.(b) What is the predicted strength when x1=18meter and x2= 43%.

4. A set of experimental runs were made to determine a way of predicting cooking time y at various levels of oven width x1, and temperature x2. The data were recorded as follows:


(b) Estimate and the standard errors of the regression coefficients.(c) Test for significance of and .(d) Predict the useful range when brightness = 80 and contrast = 75. Construct a 95%

PI.(e) Compute the mean response of the useful range when brightness = 80 and contrast = 75. Compute a 95% CI.(f) Interpret parts (d) and (e) and comment on the comparison between the 95% PI and 95% CI.

5. An article in Optical Engineering (“Operating Curve Extraction of a Correlator's Filter,” Vol. 43, 2004, pp. 2775–2779) reported the use of an optical correlator to perform an experiment by varying brightness and contrast. The resulting modulation is characterized by the useful range of gray levels. The data are shown

113

Brightness (%): 54 61 65 100 100 100 50 57 54

Contrast (%): 56 80 70 50 65 80 25 35 26

Useful range (ng): 96 50 50 112 96 80 155 144 255


(b) Estimate and the standard errors of the regression coefficients.(c) Test for significance of and .(d) Predict the useful range when brightness = 80 and contrast = 75. Construct a 95% PI.(e) Compute the mean response of the useful range when brightness = 80 and contrast

= 75. Compute a 95% CI.(f) Interpret parts (d) and (e) and comment on the comparison between the 95% PI

and 95% CI.

6. A study was performed on wear of a bearing y and its relationship to x1 = oil viscosity and x2 = load. The following data were obtained:

x1 1.6 15.5 22.0 43.0 33.0 40.0

x2 851 816 1058 1201 1357 1115

y 293 230 172 91 113 125

(a) Fir a multiple regression model to these data.(b) Estimate and the standard errors of the regression coefficients.(c) Use the model to predict wear when x1 = 25 and x2 = 1000.(d) Fit a multiple regression model with an interaction term to these data.(e) Estimate and se(j) for this new model. How did these quantities change? Does

this tell you anything about the value of adding the interaction term to the model?(f) Use the model in (d), to predict when x1=25 and x2=1000. Compare this

prediction with the predicted value from part (c) above.

114

Chapter 10

10. Factorial Experiment

At the end of the lesson, the student should be able to: Design and conduct factorial experiments involving two factors using factorial

design. Analyze and interpret main effects and interactions. Understand how to use ANOVA to analyze data from these experiments.

10.1 Introduction

Experimental design techniques based on statistics are useful in the engineering world for improving the performance of manufacturing process.By using design experiments, we can determine which subsets of the process variables have the most influence on process performance. Among the advantage of using this experimental design are: it can improved process yield, reduced variability in the process, reduced design and development time and also can reduced cost of operation. Statistically designed experiments allow efficiency and economy in the experimental process. If data are collected without an experimental design, it may not be possible to extract the desired information. Results of the analysis may be confusing, misleading, not credible and not reproducible. Normally when several factors are of interest in an experiment, a factorial experiment should be used.

10.2 Terminology and definition

There is some terminology in experimental design such as

• Factor: variable whose influence upon the response variable is being studied in the experiment.

• Factor Level: different modes or settings of a factor.• Trial (or runs): applying of a treatment to an experimental unit.• Treatment or level combination: specific combination of the levels of

different factors. • Experimental units (subjects): the basic unit for which the response

measurement are collected.• Replicates: number of experimental units on which a particular treatment is

applied.• A factorial experiment means that in each complete replicate of the

experiment all possible combinations of the levels of the factors are investigated.

• The effect of a factor is defined as the change in response produced by a change in the level of the factor.

115

• Main effect: the primary factors in the study that change the response variable.Interaction effect: the change in response variable is due to an interaction between the factors

10.3 The 2k factorial design

Factorial design can be used to identify factors with significant effects on the response, to identify (discover) interactions among factors, to identify which factors have the most important effects on the response and last to decide whether further investigation of a factor’s effect is justified. The 2k factorial design means that the experiment has been setup with k factors at 2 levels for each factor. The objective is to test and to determine which are the main effects and the interactions are important of all k factors at 2 levels.

The Analysis of Variance (ANOVA) can be used to analyze the data from experimental designs. From the ANOVA, the null hypothesis – that the effect is equal to 0 is tested. When H0 is rejected, this provides evidence that the factor involved actually affect the outcome (response). However some assumptions should be made before we do the analysis. Among the assumptions are the same numbers of replicates for each treatment, at least 2 replications for each cells, and each treatment is a random sample from a normal population.

10.4 The 22 factorial design

The 22 factorial design means that the experiment has been setup with 2 factors at 2 levels for each factor. The objective is to test and to determine which are the two main effects and the interactions are important of all 2 factors at 2 levels. In one experiment, the example of two factors is factor A: reaction time and factor B: reaction temperature. For factor A, the two levels are time at 1 hour and 2 hours. These levels can also be denoted as – (minus) for one level and + (plus) for another level. For factor B, the two levels are 35oC (-) and 55oC (+). This can be explained by the table below:

Factor A (Time)1 hour

( - )2 hours

( + )

Factor B(Temperature)

35oC ( - )

Yields measured

Yields measured

55oC ( + )

Yields measured

Yields measured

where, n = 3 replications and xijk, k = 1,..,n are the observations in the cell (i,j).

116

The levels can be in the form of variable data (numbers) or attribute data such as male and female, on and off. Normally the levels will be designated one level as high (+) and the other level as low (-) as explained in the table below:

Factor A (Time)

Low( -)

High( + )

Factor B(Temperature)

Low( - )

(1) a

High( + )

b ab

All possible treatment combination of the level of the factors or called factorial experiment is given in a design or test matrix for 22 factorial designs as follows:

Treatment Combination

Factorial Effect

A B AB(1) - - +a + - -b - + -ab + + +

The letters (1), a, b and ab represents the total of all n observations at each treatment combination.

10.5 Estimate the effects of factors in the 22 factorial design

The effect of main factors, factor A and factor B and the effect of AB interaction can be calculated using the formula below.

Effect of main factor A is:

Effect of main factor B is:

117

Effect of AB interaction is:

where [….] is called contrast.

10.6 Sum of squares formula for ANOVA

The effects of main factors and the interaction factor can be tested by using the two-way ANOVA table.

ANOVA Table:

Source of

variation

Degree of

Freedom

Sum of Square

s

Mean of squares

F P-valu

e

A 1 SSA MSA=SSA/1 MSA/MSE

B 1 SSB MSB=SSB/1 MSB/MSE

AB 1 SSAB MSAB=SSAB/1 MSAB/MSE

Error n-4 SSE MSE=SSE/n-4

Total n-1 SST

The sum of squares for factor A, SSA is given by

The sum of squares for factor B, SSB is given by

The sum of squares for factor A, SSAB is given by

The total sum of squares, SST is given by,

118

The error sum of squares is obtained by subtraction: SSE = SST – SSA – SSB - SSAB

10.7 Least square regression model

An initial estimated regression model is,

where,= constant (grand average of all 4n observations)

= the estimated coefficient of x1 (the effect of having factor A) = (effect A)/2

= the estimated coefficient of x2 (the effect of having factor B) = (effect B)/2

= the estimated coefficient of x1x2 (the effect of interaction between factor A and factor B) = (effect AB)/2

The final regression model can be determined from the ANOVA table. For instance that the interaction factor between A and B is not significant, then the final regression model is,

Example 1:An engineer is interested in the effect of cutting speed (A) and tool

geometry (B) on the life in hours of a machine tool. Two cutting speeds and

two different geometries are used. Three experimental tests were done at each of

the four combinations. The data are as follows:

Tool

Geometry

(B)

Cutting Speed (A)

Low High

1 22 28 20 34 37 29

2 18 15 16 11 10 10

(a) Construct the 22 factorial design table.(b) Find the estimate of all effects and interaction.(c) Construct the ANOVA table for each effect; test the null hypothesis that the

effect is equal to 0.

119

Solution:

(a) The 22 factorial design table:Treatment Combination

Factorial Effect

Life time (hour)

A B AB Total Average(1) - - + 22 28 20 70 23.33a + - - 34 37 29 100 33.33 b - + - 18 15 16 49 16.33ab + + + 11 10 10 31 10.33

(b) Estimates of the effects:

A = 1/2n[a + ab – b –(1)]= 1/6[100 + 31 – 49 – 70]= 2

B = 1/2n[b + ab – a –(1)]= 1/6[49 + 31 – 100 – 70]= -15

AB = 1/2n[(1) + ab – b – a]= 1/6[70 + 31 – 49 – 100]= -8

(c) ANOVA table:

To construct ANOVA table we have to find the sum of squares for A, B, AB and Total.

SSA = [a + ab – b –(1)]2/4n = [100 + 31 – 49 – 70]2/12 = 144/12 = 12

SSB = [b + ab – a –(1)]2/4n = [49 + 31 – 100 – 70]2/12 = 8100/12 = 675

SSAB = [(1) + ab – b – a]2/4n = [70 + 31 – 49 – 100]2/12 = 2304/12 = 192

SST = ( 222+282+202+342+372+292+182+152+162+112+102+102) – (250)2/12

= 6160 – 5208.333 = 951.667

SSE = SST - SSA - SSB - SSAB = 951.667 - 12 – 675 – 192 = 72.667

ANOVA Table:

Source of variation SS df MS F0

A 12 1 12 1.321

B 675 1 675 74.315

120

AB 192 1 192 21.138

Error 72.667 8 9.083

Total 951.667 11

From F-Table: F0.05,1,8 = 5.32

Conclusion:

The effect of A is not significant because F0 < 5.32. But because F0 is greater than 5.32 for B and AB so it means that the effects of factor B and the interactions are significant to the effective life of the machine.

Exercise 10

1. An engineer is investigating the thickness of epitaxial layer which will be subject to two variations in A, deposition time (+ for short time, and – for long time) and two levels of B, arsenic flow rate (- for 55% and + for 59%). The engineer conduct 22 factorial design with n = 4 replicates. The data are as follow:

121

(a) Construct the 2 X 2 factorial design table. (b) Find the estimate of all effects and interaction. (c) Construct the ANOVA table for each effect, test the null hypothesis that the effect

is equal to 0.

2. A two factor experimental design was conducted to investigate the lifetime of a

component being manufactured. The two factors are A (design) and B (cost of

material). Two levels ((+) and (-)) of each factor are considered. Three components

are manufactured with each combination of design and material, and the total lifetime

measured (in hours) is as shown in table below

Treatment

Combinatio

n

Design

A

Material

BAB

Total lifetime of 3

components

(in hours)

(1) - - + 122

122

Arsenic Level

Deposition Time

B –(Low - 55%)

B +(High – 59%)

A - (Long)

14.03714.16513.97213.907

13.88013.86014.03213.914

A + (Short)

14.82114.75714.84314.878

14.88814.92114.41514.932

a + - - 60

b - + - 120

ab + + + 118

(a) Perform a two way analysis of variance to estimate the effects of design and material expense on the component life time if the sum squares of total are 1050.

(b) Based on your results in part (a), what conclusions can you draw from the factorial experiment?

(c) Indicate which effects are significant to the lifetime of a component.(d) Write the least square fitted model using only the significant sources.

3. An engineer suspects that the surface finish of metal parts is influenced by the type of paint used and the drying time. He selected two drying times, 20 and 30 minutes and used two types of paint. Three parts are tested with each combination of paint typoe and drying time. The data are as follow:

(a) Compute the estimates of the effects and their standard errors for this design.(b) Perform an analysis of variance of the appropriate regression model for this

design. Include in your analysis hypothesis tests for each coefficient, as well as residual

4. An experiment involves a storage battery used in the launching mechanism of a shoulder-fired ground-to-air missile. Two material types can be used to make the battery plates. The objective is to design a battery that is relatively unaffected by the ambient temperature. The output response from the battery is effective life in hours. Two temperature levels are selected, and a factorial experiment with four replicates is run. The data are as follows:

123

Drying Time (min)

Paint 20min 30min

ICI 746450

788592

NIPPON 928668

664585

Temperature (°F)

Material Low High

1 130 155 20 70

74 180 82 58

2 138 110 96 104

168 160 82 60


design. Include in your analysis hypothesis tests for each coefficient, as well as residual analysis. State your final conclusions about the adequacy of the model. Compare your results to part (c) and comment.

5. An article in the IEEE Transactions on Semiconductor Manufacturing (Vol. 5, 1992, pp. 214-222) describes an experiment to investigate the surface charge on a silicon wafer. The factors thought to influence induced surface charge are cleaning method (spin rinse dry or SRD and spin dry or SD and the position on the wafer where the charge was measured. The surface charge ( X1011 q/cm3) response data are shown.

CleaningMethod

Test Position

SD

L R1.66 1.841.90 1.841.92 1.62

SRD-4.21 -7.58-1.35 -2.20-2.08 -5.36


design. Include in your analysis hypothesis tests for each coefficient, as well as residual analysis. State your final conclusions about the adequacy of the model. Compare your results to part (c) and comment.

oooOOOooo

124

fcm2063 printed notes

Documents