measurement and evaluation (book).docx

332
TABLE OF CONTENTS UNIT-1:................................................. 1 INTRODUCTION............................................ 1 1.1 EVALUATION, ASSESSMENT, MEASUREMENT AND TEST:.............1 1.2 THE PURPOSE OF TESTING...............................32 1.3 GENERAL PRINCIPLES OF ASSESSMENT:......................37 1.4 TYPE OF EVALUATION PROCEDURE...........................39 1.5 NORM- REFERENCED AND CRITERION REFERENCED TEST:..........45 1.6 EDUCATIONAL:........................................ 47 UNIT-2:................................................ 52 JUDGING THE QUALITY OF THE TEST........................52 2.1 VALIDITY, METHODS OF DETERMINING VALIDITY:...............53 2.2 FACTORS AFFECTING VALIDITY............................56 2.3 RELIABILITY, AND METHODS OF DETERMINING RELIABILITY:......59 2.4 FACTORS AFFECTING RELIABILITY:.........................64 2.5 PRACTICALITY:.......................................67 UNIT-3:................................................ 69 APPRAISING CLASSROOM TESTS (ITEMS ANALYSIS)............69 3.1 THE VALUE OF ITEM...................................69 3.2 THE PROCEDURE/ PURPOSE OF ITEM ANALYSIS:................75 3.2 MAKING THE MOST OF EXAMS: PROCEDURES FOR ITEM ANALYSIS:. . .76 3.3 ITEM DIFFICULTY:.....................................96 3.4 THE INDEX OF DISCRIMINATION...........................97 UNIT-4:............................................... 102 INTERPRETING THE TEST SCORES..........................102 4.1 THE PERCENTAGE CORRECT SCORE:........................102 INTERPRETING THE TESTS SCORES.........................102 4.2 THE PERCENTILE RANKS:...............................113 4.3 STANDARD SCORES:....................................118 4.4 PROFILE: >>>>>>>..................................120 i

Upload: muhammad-nawaz-khan-abbasi

Post on 14-Dec-2015

44 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Measurement and Evaluation (Book).docx

TABLE OF CONTENTS

UNIT-1:.........................................................................................................................1

INTRODUCTION........................................................................................................1

1.1 EVALUATION, ASSESSMENT, MEASUREMENT AND TEST:..................................1

1.2 THE PURPOSE OF TESTING................................................................................32

1.3 GENERAL PRINCIPLES OF ASSESSMENT:...........................................................37

1.4 TYPE OF EVALUATION PROCEDURE.................................................................39

1.5 NORM- REFERENCED AND CRITERION REFERENCED TEST:.............................45

1.6 EDUCATIONAL:.................................................................................................47

UNIT-2:.......................................................................................................................52

JUDGING THE QUALITY OF THE TEST...........................................................52

2.1 VALIDITY, METHODS OF DETERMINING VALIDITY:.........................................53

2.2 FACTORS AFFECTING VALIDITY.......................................................................56

2.3 RELIABILITY, AND METHODS OF DETERMINING RELIABILITY:........................59

2.4 FACTORS AFFECTING RELIABILITY:.................................................................64

2.5 PRACTICALITY:.................................................................................................67

UNIT-3:.......................................................................................................................69

APPRAISING CLASSROOM TESTS (ITEMS ANALYSIS)...............................69

3.1 THE VALUE OF ITEM.........................................................................................69

3.2 THE PROCEDURE/ PURPOSE OF ITEM ANALYSIS:.............................................75

3.2 MAKING THE MOST OF EXAMS: PROCEDURES FOR ITEM ANALYSIS:..............76

3.3 ITEM DIFFICULTY:............................................................................................96

3.4 THE INDEX OF DISCRIMINATION.......................................................................97

UNIT-4:.....................................................................................................................102

INTERPRETING THE TEST SCORES...............................................................102

4.1 THE PERCENTAGE CORRECT SCORE:..............................................................102

INTERPRETING THE TESTS SCORES.............................................................102

4.2 THE PERCENTILE RANKS:...............................................................................113

4.3 STANDARD SCORES:.......................................................................................118

4.4 PROFILE: >>>>>>>........................................................................................120

UNIT-5:.....................................................................................................................121

EVALUATING PRODUCT, PROCEDURES & PERFORMANCE.................121

5.1 EVOLUTION THEMES AND TERMS PAPERS:....................................................121

5.2 EVALUATING GROUP WORK & PERFORMANCE.............................................132

i

Page 2: Measurement and Evaluation (Book).docx

5.3 EVALUATING DEMONSTRATION:....................................................................136

5.4 EVALUATION OF PHYSICAL MOVEMENTS AND MOTOR SKILLS:....................141

5.5 EVALUATING ORAL PERFORMANCE:..............................................................148

UNIT-6:.....................................................................................................................152

PORTFOLIOS..........................................................................................................152

6.1 PURPOSE OF PORTFOLIOS:..............................................................................152

6.3 GUIDELINE AND STUDENTS ROLE IN SELECTION OF PORTFOLIO ENTRIES AND SELF-EVALUATION:................................................................................................160

6.4 USING PORTFOLIOS IN INSTRUCTION AND COMMUNICATION:.......................164

6.5 POTENTIAL STRENGTH AND WEAKNESSES OF PORTFOLIOS:..........................167

6.6 EVALUATION OF PORTFOLIO:.........................................................................174

UNIT-7:.....................................................................................................................176

BASIC CONCEPTS OF INFERENTIAL STATISTS.........................................176

7.1 CONCEPT & PURPOSE OF INFERENTIAL STATISTICS:......................................176

7.2 SAMPLING ERROR:..........................................................................................179

7.3 NULL HYPOTHESIS:........................................................................................180

7.4 TESTS OF SIGNIFICANCE:................................................................................182

7.5 LEVELS OF SIGNIFICANCE:..............................................................................185

7.6 TYPE-I AND TYPE-II ERRORS: REMAINING:...................................................188

7.7 DEGREES OF FREEDOM:..................................................................................192

UNIT-8:.....................................................................................................................197

SELECTED TESTS OF SIGNIFICANCE............................................................197

8.1 T-TEST:...........................................................................................................197

8.2 CHI-SQUARE (X2):..........................................................................................200

8.3 REGRESSION:..................................................................................................205

ii

Page 3: Measurement and Evaluation (Book).docx

UNIT-1:

INTRODUCTION

1.1 EVALUATION, ASSESSMENT, MEASUREMENT AND TEST:

1.1.1 Evaluation:

Literally, the term evaluation means “appraisal”, “Judgment” or “assessment”,

“calculation”, “estimation” or “rating” of a thing.

According to the International Dictionary of Education (by G Terry & JB

Thomas) evaluation means “value judgment” on an observation, performance test or

any data whether directly measured or inferred. Evaluation is the qualitative

assessment of a thing. It answers the question “How good”?

A. D. Jones defines evaluation as “the process of finding the value of

something”. He further says “the process of evaluation is the attempt to find the worth

of any enterprise.

The Oxford Advanced Learner’s Dictionary defines the term “Evaluate” as to

find out or form an idea of the amount or values of something. When we evaluate

something, we mean to determine the value or worth of that thing. Evaluation is,

actually, the process through which we collect information of something and then

make a decision in the light of that information.

So we can say that evaluation is concerned with making judgments about

things. When we act as evaluators, we attribute “value” or “worth” to behavior,

objects and processes. In the wider community, for example, one may make

evaluative comments about a play, clothes, restaurant, a book or someone’s behavior.

We may enjoy a play, admire someone’s clothes, to speak about some restaurant and

so on and so forth. Invariably, these are rather simple, straight forward comments of

value or worth because this judgment is not based on appropriate and relevant data.

1

Page 4: Measurement and Evaluation (Book).docx

According to William Wiersma and Stephen G. Jurs, “The more effective

evaluation requires the judgment which is based on appropriate and relevant data”.

For example, to say that a film is ‘good’ or ‘bad’ is not the judgment based on

appropriate and relevant data. It is, therefore, not the exact evaluation of the film. It

well-written script, tight direction mood-enhancing music, suitable characters and so

forth; because this judgment is based on some appropriate and relevant data. These

are the characteristics upon which we can make a judgment about something.

Educational Evaluation:

The Concept of Evaluation in Education:

Educational Evaluation is a specific term which is used for the judgment of the

educational objectives. Educational Evaluation seeks to determine how the student

has achieved the stated objectives of the learning situation.

Different educationists have defined Educational Evaluation in different words

some of which are discussed below:

1. “Educational Evaluation is a systematic process of collecting, analyzing and

interpreting information to determine the extent to which pupils are achieving

instructional objectives”.

–– Norman E. Gronlund

2. “Educational Evaluation is the systematic process of collecting and analyzing

data in order to determine whether, and to what degree, objectives have or

being achieved”.

–– L.R. Gay

3. “Educational Evaluation is the estimation of the growth and progress of pupils

towards objectives or values in the curriculum”.

–– Writestone

2

Page 5: Measurement and Evaluation (Book).docx

4. “Educational Evaluation is the defined as the process of determining the extent

to which educational objectives are achieved by the student”.

–– Remmers

Approaches to Evaluation

Evaluation in our schools is essentially concerned with two major approaches

to making judgments:

1. Product Evaluation:

It is the evaluation of students’ performance in a specific learning context.

Such kind of evaluation seeks to determine how well the student has achieved

the stated objectives of the learning situation. In this sense the student’s

performance is seen as a product of the educational experience. A school

report is an example of Product Evaluation.

2. Process Evaluation:

It is that kind of evaluation that seeks to examine the experiences and

activities involved in we learning situation. It makes judgment about the

process by which students acquired learning. In more simple words, it

examines the process of learning experience before it has concluded. For

example, the evaluation of the nature of students-teacher interaction,

instructional methods, school curricula, a specific programmes etc. are the best

examples of Process Evaluation.

1. Curriculum Evaluation

Curriculum Evaluation, as is clear from the name, is the evaluation of a certain

curriculum i.e. an instructional programme. It is used to determine the

outcome of a programme and to decide whether to accept or reject a

programme. This evaluation helps in the further development of the

3

Page 6: Measurement and Evaluation (Book).docx

curriculum materials for continuous improvement. For a better learning, it is

necessary to assess a new programme in order to find out whether the desired

outcomes are being achieved or not. The use of evaluation techniques should

enable the curriculum workers to make steady progress in improving the

curriculum. Curriculum evaluation should not only be a means for judging

educational effectiveness, but also should lead to useful decisions that can

serve as a powerful force to improve the educational process. Careful

evaluation should demonstrate the strengths and weaknesses in the curriculum

so that necessary changes can be made in the instructional programme.

2. Programme Evaluation

Programme Evaluation is used for judging the effectiveness of a programme

or a special project. This evaluation is used to make a decision about

programme installation and modification. It helps to obtain evidence to

support or oppose a programme. Outside education, ‘programme evaluation' is

used as a means of determining the effectiveness, efficiency and acceptability

of any form of programme. But within education, we can use the term in a

similar way as in the case of evaluating the effectiveness of a new writing, or

reading programme in primary schools. A curriculum evaluation may qualify

as a programme evaluation if the curriculum is focused on change or

improvement. Programme evaluation, however, does not involve appraisal of

curricula (e.g. evaluation of a computerized student record keeping system.)

3. Personnel Evaluation

The evaluation of personnel is the assessment of the performance of a working

personnel in an organization. That is why it is also called performance

appraisal or staff evaluation. In education, 'personnel evaluation' is very much

necessary for adopting appropriate appraisal, plans and procedures to achieve

the goal of education. According to McNeil, J.D. "Evaluation of the

4

Page 7: Measurement and Evaluation (Book).docx

performance of working personnel can be an effective instrument for helping

people in growing and developing in their roles. It could be used as a

mechanism of continuing education and learning from one another. Through a

well- organized appraisal system every employee (r.3) can create learning

spaces for himself in the system in which he works. A good personnel

evaluation helps the employee to recognize his/ her own strengths and

weaknesses in order to enable him to improve his performance in a given role.

It also helps in identifying people for the purpose of motivating, training and

developing them for new roles or existing roles.

4. Institutional-Evaluation:

Institutional Evaluation is the evaluation of the total programme of a school,

college, university or other educational institution. The evaluation of an

institution is used to collect information and data on all aspects of the function

of that institution. The basic aim of this evaluation is to determine the degree

to which instructional objectives are being met and to identify areas of

strength and weakness in the total programme. An institutional evaluation

involves more than the administration of tests to students; it may require any

combination of questionnaire, interviews, and observations with data being

collected from all persons in the institution community, including

administrators, teachers, and counsellors. The major component of

institutional evaluation is the institution testing. The more comprehensive the

testing program, the more valuable are the resulting data. That is why, for

achieving the most valuable resulting data, institutional testing programme

should include measurement of achievement, aptitude, personality and

interest. Tests selected for an institutional evaluation must match the

objectives of the institution and be appropriate for the students to be tested.

5

Page 8: Measurement and Evaluation (Book).docx

Need or Importance of Evaluation

Evaluation plays pivotal role in teaching-learning process. It helps in

providing information about the success or failure of an educational objective. It

shows whether the student has achieved the required objective or not, and to what

degree has the goal been reached? So evaluation provides relevant information to the

decision-makers need about input, output, operation of a programs, and placement of

student in programs. Levels of understanding can be assessed. and future educational

objectives set, based on student needs. Similarly, appropriate activities can be planned

by the teacher based on the knowledge of the attributes of the student. Evaluation also

makes it easy for the teacher to form objective, select content, and plan for learning

experiences. It also provides a guideline about all aspects of the teaching- learning

process. Without evaluation we cannot be aware of the effectiveness or

ineffectiveness of an educational program or objective.

Evaluation is as necessary for student as for teacher or decision-makers. Its

importance for the student is great because the whole process of education is for the

benefit of the student. The student is the centre of interest in the teaching learning

process. For the student, evaluation provides feedback regarding better strengths and

weaknesses. It encourages the student for better study and increases his motivation.

Improvement of the teacher's teaching and the student's learning through judgment,

using available information is the ultimate need of the evaluation process.

In a nutshell, evaluation plays central role in the teaching learning process. It

serves as a guiding principle for the selection of supervisory techniques and also as a

means for improving school-community relation.

1.1.2 Assessment: Concept of Assessment

Literally assessment means the act of judging or assessing a person or

situation or event. It is the classification of someone or something with respect to its

6

Page 9: Measurement and Evaluation (Book).docx

worth. Assessment is a general term that includes the full range of procedures used to

gain information about student learning (observations, ratings of performances or

projects, paper-and-pencil tests) and the formation of value judgments concerning

learning progress. A test is a particular type of assessment that typically consists of a

set of questions administered during a fixed period of time under reasonably

comparable conditions for all students. (Linn and Groundlund, 2000)/ Assessment

may include both quantitative descriptions (measurement) and qualitative descriptions

(non-measurement) of students. In addition, assessment always includes value

judgments concerning desirability of the results. Assessment may or may not he base

on measurements; when it is, it goes beyond simple quantitative description.

The process of collecting, synthesizing, and interpreting information to aid in

decision making is called assessment. For many people, the, words,, classrooms,

assessment evoke images of pupils taking paper-and-pencil, test, teachers scoring

them, and grades being assigned to the pupils based upon their performance.

Assessment, as the term is used here, includes the full range of information that

teacher gather in their classrooms; information that helps them understand their

pupils, monitor their instruction, and establish a viable classroom culture. It also

includes the variety of ways teachers gather, synthesize, and interpret that

information.' Assessment is a general term that includes all the ways through which

teachers gather information in their classrooms.

Need for Assessment in Education

As long as there is need for the educator to make some instructional decisions,

curricular decision, and selection decision. Placement or classification decisions based

on the present or anticipated educational status of the child so long will there be need

for assessment in educational enterprise. To the modern educator, the ultimate goal of

assessment is to facilitate learning. This could be done in a number of ways, in each

way a separate type of decision is required. The assessment decision also determines

7

Page 10: Measurement and Evaluation (Book).docx

which of tests is to be used for assessment. Thus there is a close relationship between

the purpose of evaluation, evaluate decisions and types of tests to be used for them.

The purposes of assessment are as follows:

Selection Decision

Whenever there will be choice, selection decision is to be made. In our daily

life we see that institutions and organization need persons for their work, they get

responses from several people but they cannot take all of them. They have to make

selection out of them. Assessment of these persons is to be made on the bases of tests

given to them. Tests will provide information, which will help in selection decision.

Some persons will be acceptable while others will not be acceptable. Similarly the

universities have to make section decisions for admitting the students to various

courses. Courses in which hundreds of candidates are applicants, Selection decision is

to make on stronger footing. Naturally some tests are given to the candidates to help

in selection decision such as Aptitude tests, Intelligence tests. Achievement tests or

Prognostic tests are generally given for the purpose of selection decision. There has

been ruling from the judiciary that the scores on these tests should have a good

relationship with the success in the job or the course for which the tests has been

given. If any selection tests does not fulfill this requirement it needs to be improved or

replaced by a better one I Although perfection of such tests cannot be guaranteed but

any institution or organization which is interested in the best students or workers will

continue to make efforts in improving the tests being used for the purpose of

selection.

Placement Decision

Since school education should be provide to all in a welfare state the schools

must make provision for all, they cannot reject the candidates for admission as the

universities or colleges can do. How these candidates placed in different programmes

of school education is to be determined on the basis of their assessment. Such school

8

Page 11: Measurement and Evaluation (Book).docx

determinations are called placement .decision. These decisions are required not only

in the case of those who are with some disadvantage but also with those who are

gifted and talented. The schools have to find one or the other programme for all

school age children depending upon their weakness or strength. Placement tests have

to be different and more useful from selection, tests because they improve the

decision to differentially assign students to teaching programmes. Achievement test

and interview are generally used for placement decision.

Classification Decisions

Assessment is also required to help in making decisions in regard to assigning

a person to one of several different categories, jobs or programmes. These decisions

are called classification decisions because in one particular job or programme, there

may be several levels or categories. To which level or category a particular person of

child be assigned, depends upon the results of the test. Aptitude tests, achievement

tests, interest inventories value questionnaires attitude scale and personality measures

are used for classification decision. There is a minor difference in classification

placement and selection. Classification refers to the cases, where categories are

essentially unordered, placement refers to the case where the categories represent

level of teaching or treatment and selection refers to the case where the persons can be

selected or rejected.

Diagnosis and Remedial Decisions

Assessment is required to locate the students who need special remedial help.

For example what instructional strategies the teacher should use to help a particular

students or a group of student so that the opportunities are maximized to achieve the

objective. Aptitude tests, intelligence tests, diagnostic achievement tests, diagnostic

personality measures etc. may be used to achieve the purpose.

Feed Back

9

Page 12: Measurement and Evaluation (Book).docx

It is not sufficient to assessment student through a test and doing nothing after

that. A good teacher will use tests for the purpose of providing feedback to students.

Feedback may be effective or ineffective depending upon the circumstances.

Feedback will facilitate learning if it confirms the learner's correct responses or

identifies errors and corrects them. Test results made available to parents may be used

for making feedback evaluation device. It is also to be remembered that feedback are

both for the student and teacher because it provide information to both and help in

knowing how will students have learnt and how well the teacher has taught.

Motivation and guidance of learning: Assessment is also used to motivate the

students for more study and providing for learning. However motivation device can

be used positively as well as negatively. Unfortunately most of the schoolteacher use

examination or refusing to grant annual promotion to next class can motivate the

student but if they are motivated with using such evaluation techniques which provide

more confidence to the students in the subject, they will be more effective and lasting.

Aptitude tests, achievement tests, attitude scales, personality measures, interest

inventories, surprise quizzes encourage student for more study and understanding.

Assigning Makers to Students:

The instructional programme remains incomplete if it is not followed by

assessment. Although no teacher chooses teaching profession because he is interested

in evaluating the students but no teacher confines his job to teaching only. He

regularly evaluates his students and assigns them makers. Actually most of the

teachers are giving most of their time to this purpose. If teachers do not evaluate their

students, do not assign those marks or grades, how can they check their effectiveness

of teaching and learning outcome of the students?

Role of Assessment in Education Process

10

Page 13: Measurement and Evaluation (Book).docx

The assessment of learning takes place in an instructional context and.

Consequently, that learning environment shapes the reasons why we evaluate,

influences the purpose for evaluating as well how we evaluate the determines how we

should use the outcomes of our assessment. Assessment is an integral part of

instruction; it is not a separate entity that somehow is loosely attached to the teaching

process. The instruction process and the role of evaluation in it both must be

understood as background to the study of educational measurement. To that end, the

role of assessment in instruction will be described using a model that explains how the

teaching process works.

(A) There are many models that describe the variety of approaches to teaching found

in schools, but the Basic Teaching Model (BTM), introduced by Glaser (1962)

accounts for the fundamental components of most other specific teaching models,

such as the Socratic approach, the individualized instruction approach, or the

computer dominated instructional approach (Joyce and well, 1980). Few teachers

probably follow the BTM steps explicitly to guide their instructional activities. And

though we do not specifically endorse the use of the BTM or any other particular

model, we do advocate instructional approaches, by whatever name, the account for

the fundamental functions represented in the BTM as described next.

The main purposes of the BTM are to identify the major activities of the

teacher and to describe the relationship between activities figure III is a diagram of

the mode. Our primary interest is the Performer Assessment component, but we

cannot understand completely the role of evaluation without understanding how

Performance Assessment affects, and is affected by, other teaching activities.

Instructional Objectives, the first component of the BTM, represents the teacher's

starting point in providing instruction. What should students learn? What skills and

knowledge should be the focus of instruction? What is curriculum and how is it

denned? The second component, Entering Behaviour, indicates that the teacher must

11

Page 14: Measurement and Evaluation (Book).docx

Instructional Objectives

Instructional Objectives

Instructional Objectives

Instructional Objectives

A B C D

E

try to assess the students' level of achievement and readiness to learn prior to

beginning. Instruction. What do the students know already and what are their

cognitive skills like? How receptive to learning are they? Which ones seems self-

motivated? This component indicates a need for evaluation information before

instruction actually begins.

Once the teacher has decided what will be taught and to whom the teaching is

to be directed, the "How?" must be determined. The Instructional Procedures

component deals with the material and methods of instruction the teacher selects or

develops to facilitate student learning. Does the text need to be supplemented with

illustration? Should small group projects be developed? Is there computer software

available to serve as a refresher for prerequisites? At this point instruction could

begin, and often it does, but unless the teacher makes plans to evaluate student's

performance, the students and teacher will never be sure when learning is complete.

The performance Assessment component helps to answer the question, "Did we

accomplish what we set out to do? Tests, quizzes, teacher observations, projects, and

demonstration are evaluation tools that help to answer this question. Thus evaluation

should be a significant aspect of the teaching process; teaching does not occur,

according to the model, unless evaluation of learner performance occurs.

Feedback Loop

The model shows a fifth component, the Feedback Loop that can be used by

the teacher as both a management and a diagnostic procedure. If the results of

evaluation indicate that sufficient learning has occurred, the loop takes the teacher

back to the Instructional Objectives component, and each successive component, so

12

Page 15: Measurement and Evaluation (Book).docx

that plans for beginning the next instructional unit can be developed. (New objectives

are needed, entering behavior is different, and methods will need to be reconsidered,)

But when evaluation results are not so positive, the Feedback Loop is a mechanism

for identifying possible explanations. (Note the arrows that return to each

component.) Were the objectives too vaguely specified? Did students lack essential

prerequisite skills or knowledge? Was the film or text relatively ineffective? Was

there insufficient practice opportunity? Such questions need to be asked and

frequently are. However, questions need to be asked about the effectiveness of the

performance assessment procedures also, perhaps more frequently than they are. Were

the test questions appropriate? Were enough observations made? Were directions

clear to students? The Feedback Loop returns to the Performance Assessment

component to indicate that we must review and assess the quality of out evaluation

procedures, after the fact to determine the appropriateness of the procedures and the

accuracy of the information. Unless the tools of evaluation are developed with care,

inadequate learning may go undetected or complete learning may be misinterpreted as

deficient.

In sum, good teaching requires planning for and using good evaluation tools,

Furthermore, evaluation does not take place in vacuum. The BTM shows that other

components of the teaching process provide cues about what to evaluate, when to

evaluate, and how to evaluate. Our purpose is to identify such cues and to take

advantage of them in building tests and other assessment devices that measures

achievement as precisely as possible.

(B) Assessment decision maker who is concerned about all aspects of the educational

endeavour. The key point to consider and keep in mind is that evaluation involves

appraisal of particular goals or purposes. Useful information may be obtained for

evaluation procedures by both formal and informal mean and should include

information collected during instruction as well as in the end of the course date.

13

Page 16: Measurement and Evaluation (Book).docx

According to Ahmanrt and Giock (1985) School Administrators, guidance personnel,

classroom teacher, and individual students require information that will allow them to

make informed and appropriate decision regarding their respective educational

activities. Ideally, they should be aware of all the alternatives open to them, the

possible outcomes of each alternative, and the advantages and disadvantages of the

respective outcomes, Educational and psychological measurement can help

individuals with these matters.

(C) Tyler, 1966: Airasian and Madaus. 1972: Gronlund 1976:

Thorndike and Hagen, 1977: rightly observe that the data secured through

testing procedures may have uses as give below:

First, measurement data may be employed in the placement of students on one

or another instructional programme. Usually pupils take a pretest to measure whether

they have mastered the skills that are prerequisite to admittance to a particular course

or instructional, sequence. For instance, foreign language and mathematics

programmes are usually arranged in some hierarchical order so that achievement at

each level of learning depends on mastery of the preceding level.

The student is lead from the entering position in the hierarchy to the

terminating phase via intermediate steps, based upon the information provided by a

pretest a student can be placed:

(1) At the most appropriate point in the instructional sequence.

(2) In a programme with a particular instructional strategy on

(3) With an appropriate teacher.

Second, measurement data can be used in formative evaluation. Tests are

administered to students to monitor their success and to provide them with relevant

feedback. The information is employed les to grace a student than to make

14

Page 17: Measurement and Evaluation (Book).docx

instructions responsive to the student's strengths anorweaknesses as identified by the

measurement device, Mastery learning procedures emphasize the use of formative

tests to provide detailed information about each student's grasp of a unit's objectives.

Third, measurement data has a place in diagnostic evaluation. Diagnostic

testing takes over where formative testing /eaves off When a student fails to respond

to the feedback corrective activities associated with formative testing a more detailed

search for the source of the learning difficulty is indicated. Remediation is only

possible when teacher understands the basis of a student's problem and then designs

instruction to address the need.

Forth, measurement data may be used for summation purposes. Such testing is

employed to certify or grade students at the completion of a course or unit of

instruction. Often the result is `final' and follows the student throughout his or her

academic career (as in the case or college and university transcripts). It is this aspect

of evaluation that some educators final particularly objectionable.

Fifth, measurement data are used by employers educational institutions in

making the selection by decisions. Many jobs and slots in education& programme are

limited in number, and there are more applicants than positions. In order to identify

the most promising candidates standardized tests may be administered to the

applicants. The information provided by the tests presumably increases the accuracy

and objectivity of administrator's decisions. College Board examinations are used by

many universities in admitting students to graduate and professional schools likewise

employ data from standardized testing programme make their entrance decisions.

Sixth, school officials in making curricular decisions in order to evaluate

existing programme use measurement data and to decide among instructional

alternative. School administrators need to assess their students' current levels of

performance the strengths and weaknesses of the evidence.

15

Page 18: Measurement and Evaluation (Book).docx

Seventh, measurement data finds a place in personal decision-making.

Individuals confront a variety of choices at any number of points in their lives. Should

they attend college or pursue some other type of post-high school training? What kind

of Job seems most suited to their needs? What sort of training programme should they

enter? Measures of interest, temperament, and ability can give individuals insights

that can prove helpful in the decision-making process.

Types of Assessment

Tests/ and other assessment procedures can be classified in terms of their

functional role in classroom instruction. One such classification system follows the

sequencer which assessment procedures are likely to be used in the classroom. These

categories classify the assessment of student performance in the following – manner:

1. Placement assessment

To determine student performance at the beginning of instruction.

2. Formative assessment

To monitor learning progress during instruction-

3. Diagnostic assessment

To diagnose learning difficulties during instruction.

4. Summative assessment

To assess achievement at the end of instruction.

Although a single instrument may sometimes be useful for more than one

purpose (e.g., both form formative and summative assessment purposes), each of

these types of classroom assessment typically requires instruments specifically

designed for the intended use.

All these types of assessment are discussed below in detail.

Placement Assessment

16

Page 19: Measurement and Evaluation (Book).docx

This is also called Need Analysis Assessment. Placement assessment is

concerned with the student's entry performance and typically focuses on questions

such-as the following: (1) Does the student possess the knowledge and skills needecF

to begin the planned instruction? For example, is a student's reading comprehension at

a level that allows him or her to do the expected independent reading for a unit in

history, or does the beginning algebra student have a sufficient command of essential

arithmetic concepts? (2) To what extent has the students already developed the

understanding and skills that are the goals of the planned instruction? Sufficient levels

of comprehension and proficiencies might indicate-the desirability of skipping certain

units or of being placed in a more advanced course. (3) To what extent do the

student's interests, work habits, and personality characteristics indicate that one mode

of instruction might be better than another (e.g., group instruction versus independent

study)? Answers to questions like these require the use of a variety of techniques:

records of past achievement, pretests on course objectives, self-report inventories,

observational techniques, and so on. The goal of placement assessment is to determine

for each student the position in the instructional sequence and the mode of instruction

that is most beneficial.

Formative Assessment

According to Gron Lund (1990):

Formative assessment of work is used while it is in process of being carried

out so that the assessment affects the development of the works.

Formative Assessment is a part of the instructional process. When

incorporated into classroom practice, it provides the information -needed to adjust

teaching and learning while they are happening. In this sense, formative assessment

informs both teachers and students about student understanding at a point when timely

adjustments can be made. These adjustments help to ensure students achieve, targeted

17

Page 20: Measurement and Evaluation (Book).docx

standards-based learning goals within a set time frame. Although formative

assessment strategies appear in a variety of formats, there are some distinct ways to

distinguish them from summative assessments.

Formative assessment is used to monitor learning progress during instruction;

its purpose is to provide continuous feedback to both student and teaching concerning

learning successes and failures. Feedback to students provides reinforcement of

successful learning and identifies the specific learning errors and misconceptions that

need correction. Feedback to the teacher provides information for modifying

instruction and for prescribing group and individual work. Formative assessment

depends heavily on specially prepared tests and assessments for each segment of

instruction (e.g., unit, chapter. Tests and other types of assessment tasks used for

formative assessment are most frequently teacher made, but customized tests for

publishers of textbooks and other instructional materials also can serve this function.

Observational techniques are, of course, also useful in monitoring student progress

and identifying learning errors. Because formative .assessment is directed toward

improving learning and instruction, the results typically are not used for assigning

course grades.

Diagnostic Assessment

According to Gron Lund (1990):

Diagnostic assessment is concerned with those educational' problems which remains

unsolved even after the corrective prescription of formative assessment.

Diagnostic assessment is a highly specialized procedure. It is concerned with

the persistent or recurring learning difficulties that are left unresolved by the standard

corrective prescriptions of formative assessment. If a student continues to experience

failure in reading, mathematics, or other subjects, despite the use of prescribed

alternative methods of instruction, then a more detailed diagnosis is indicated. To use

18

Page 21: Measurement and Evaluation (Book).docx

a medical analogy, formative assessment provides first-aid treatment for simple

learning problems and diagnostic assessment searches for the underlying causes of

problems that do not respond to first-aid treatment. Thus, diagnostic assessment is

much more comprehensive and detailed. It involves the use of specially prepared

diagnostic tests as well as various, observational techniques. Serious learning

disabilities also are likely to require the services of educational, psychological, and

medical specialists, and given the appropriate diagnosis, the development of an

individualized education plan (IEP) for the student. The aim of diagnostic assessment

is to determine the causes of persistent learning problems and to formulate a plan for

remedial action.

Summative Assessment

The assessment that is carried out at the end of a piece of work is called summative

assessment.

Summative assessment typically comes at the end of a course (or unit) of

instruction. It is designed to determine the extent to which the instructional goals have

been achieved and is used primarily for assigning course grades or free certifying

student mastery of the intended learning outcomes. The techniques used in summative

assessment are determined by the instructional goals, but they typically include

teacher made achievement tests, ratings on various types of performance (e.g.,

laboratory, oral report), and assessments of products (e.g., themes, drawing, research

reports). These various sources of information about student achievement may be

systematically collects into a portfolio of work that may be used to summarize or

showcase the student's accomplishments and progress. Although the main purpose of

summative assessment is grading, or the certification of student achievement, it also

provides information for judging the appropriateness of the course objectives and the

effectiveness of the instruction.

19

Page 22: Measurement and Evaluation (Book).docx

1.1.3 Measurement

Meaning &Definition of Measurement

Literally the verb measure means to find or determine the 'size', `quantity' or

'quality' of anything. According to Chambers Dictionary the term 'measure' means `to

find out the size or amount of something'. "Measurement" in the International

Dictionary of Education (by G Terry Page & J.B. Thomas) means "the act of finding

the dimension of any object and the quantity found by such an act.

The 'Oxford Advance Learner's Dictionary defines `measurement' as the

'standard or system used in stating the size, quantity or degree of something.' It is the

way of assessing something quantitatively. It answers the question "How much?" In

other words we can say that measurement is the quantitative aspect of evaluation.

With the help of measurement we can easily describe students' achievement by telling

their scores. These definitions show that 'measurement' is the quantitative assessment

of something. Now let's see how the term is defined specifically in education. L. R.

Gay, (1985) defines measurement as "a process of quantifying the degree to which

someone or something possesses a given trait, i.e. quality, characteristics or features."

Educational Measurement

(The concept of measurement in education)

In Education, the term 'measurement' is used in its specific meanings. It is the

quantitative assessment of the performance of a student, teacher, curriculum or an

educational program. We can say that the quantitative score used for educational

evaluation is called measurement. The term is used for the data collected about

student or teacher performance by using a measuring instrument in a given learning

situation. It shows the exact quantity or degree of the performance, traits or character

of the person or thing to be measured. For example instead of saying that Hamid is

underweight for his age and height, we can say that Hamid is 18 years old, 5' 8" tall,

20

Page 23: Measurement and Evaluation (Book).docx

and weight only 85 pounds. Similarly, instead of saying that Hamid is a more

intelligent than Zahid, we can say that Hamid has a measured. IQ of 125 and Zahid

has a measured IQ of 88. In each of the above cases, the numerical statement is more

precise, more objective and less open to interpretation than the corresponding verbal

statement.

Steps of measurement

There are two steps used for in the process of measurement. The first step is to

devise a set of operations to isolate the attribute and make it apparent to us. Just a

standard is used for judging the durability of a thing, in the same way educators and

psychologists use various methods for testing the behaviour or performance of a

student. For this purpose they often use Stanford-Binet Tests or other tests that

include operations for eliciting behaviour that we lake to be indicative of intelligence.

The second step in measurement is to express the results of the operations

established in the first step in numerical or quantitative terms. This involves an

answer to the questions, how many or how much? Just millimetre is used as a unit for

indicating the thickness of a thing, in the same way educators and psychologists use

some numerical units for gauging intelligence, emotional maturity and other

attributes. Thus each step in measurement rests on human- fashioned definitions. In

the first step, we define the attribute that interests us. In the second step, we define the

set of operations that will allow us to identify the attribute, and express the result of

our operations.

Difference between Evaluation and Measurement

Some people use 'evaluation' and 'measurement' in the same meaning. Both the

terms are used for the process of assessing the performance of the student and

collecting information about an educational objective. Both tell how effective the

school programme has been and refer to the collection of information, appraisal of

students, and assessment of programme. Some recognize that measurement is one of

21

Page 24: Measurement and Evaluation (Book).docx

the essential components of evaluation. But there is difference between the two terms.

Roughly speaking, `measurement' is the quantitative assessment whereas 'evaluation'

is the quantitative as well as qualitative assessment of the performance of a student or

an educational objective. Measurement is a limited process used for the assessment of

limited and specific educational objectives. On the other hand, evaluation is much

more comprehensive term used for all kinds of educational objectives. Moreover, for

measurement Evaluation is the continuous inspection of all available information

concerning the student, teacher, educational programme and the teaching- learning

process to ascertain the degree of change in students and form valid judgements about

the students and the effectiveness of the programme. On the other hand 'measurement'

is the collection of data about the performance of a student, teacher or curriculum etc.

However, both 'evaluation' and 'measurement' are closed closely related. We

cannot separate one from the other. Both are used for assessing the effectiveness of a

programme of the appraisal of student. Measurement collects data directly from the

objects of concern of the students. Other information is collected from students by

non-testing procedures. Information provided by testing and non-testing is the best

thought of material to be used in the evaluation process.

The Importance of Measurement in Education

Measurement plays very important role in the teaching-learning process.

Without measurement we cannot assess the effectiveness of an educational

programme, the school or its personnel. For effective teaching, it is necessary for the

teacher to be aware of the strengths and weaknesses of his teaching method.

Similarly, for an effective learning, it is necessary for the student to be aware of the

possible outcomes of all the alternatives. He should also be informed about the

advantages and disadvantages of the respective outcomes. All this is impossible

without measurement. Without measurement, how can a teacher be aware his method

of teaching or how a student can be informed about the outcomes of the alternatives.

22

Page 25: Measurement and Evaluation (Book).docx

Without measurement, evaluation is impossible and without evaluation we cannot get

knowledge of the effectiveness of an educational programme. Measurement tells us

about the characteristics of students, their progress in studies and their achievements

in various subjects. It also tells how much or to what extent the instructional

objectives of the school and the individual classroom teacher being achieved?

Measurement serves as a guideline for students to develop their educational and

vocational plans for the future. With the help of measurement, information is gathered

about school programmes, policies, and objectives. This information is conveyed to

parents and other members of the community. Similarly, measurement data are used

by employers and educational institutions in making the selection by decision. With

the help of standardized tests, the administrators collect information about every

applicant. The information provided by the tests increases the accuracy and

objectivity of administrators & decision makers. In this way measurement data are

employed by school officials in making curricular decisions.

In short, measurement occupies the central place in the process of teaching

and learning. It is the only mean through which the educational condition can be

improved.

The Function of Measurement and Evaluation

Measurement' and 'evaluation' are interdependent ("N. We cannot separate one

from the other just as we cannot separate the two sides of a coin. Evaluation is the

qualitative aspect of anything, which is based on the quantitative value (measurement)

of that thing. Without measurement we cannot make an exact evaluation of a thing.

In this respect evaluation and measurement perform the same functions in the

education.

Cron Back, in his book "Essentials of psychological testing" has discussed the

following functions of measurement and evaluation.

23

Page 26: Measurement and Evaluation (Book).docx

(1) Effectiveness of Educational Programme

In education, the concerned people and personnel must be aware of the

effectiveness of an educational programme. This is possible only by making an

evaluation of that programme. By evaluation a teacher is able to know as to what

extent the method of teaching is effective. He is also able to know as to what extent

the equipment of laboratory is effective. This will enable him to improve his method

of teaching make learning process effective.

(2) Prediction

After evaluation it is possible to predict the performance of students in future.

By evaluation we know the aptitude and interest etc. with the help of which we guide

them to take admission in institution which is according to his aptitude and interest.

So, on the basis of evaluation we can plan for the future.

(3) Selection

Measurement and evaluation is used during the selection of suitable persons

for different jobs in Govt. as well as semi Govt. departments.

(4) Classification

Evaluation is helpful in the classification in all educational institutions. At the

end of every year, some tests are given to students to check their ability and make

classification on the basis of results obtained from these tests.

Another educational psychologist, Camp, adds that evaluation plays important

function in making maladjusted students, students as useful members of the society by

finding their interests and attitudes. Students suffering from inferiority complex can

also be treated after their proper evaluation.

In short, evaluation and measurement have important functions in education.

They serve as guidelines for students, teachers, ' counsellors and administrators.

24

Page 27: Measurement and Evaluation (Book).docx

1.1.4 Test

Measurement and evaluation are the two processes that are used to collect

information about the strengths and weaknesses of an educational programme or the

performance of a student, teacher or other personnel. But these processes need some

instruments for their operations. Such instruments are called tests. So, the instruments

that are used to measure the sample of students' behaviour under specific conditions

are called tests. In other words we can say that:

"A test is a systematic procedure for measuring a sample of students' behaviour under

specific conditions."

Some other definitions of test are given below:

1. A procedure for critical evaluation; a means of determining the presence,

quality, or truth of something.

2. A series of questions, problems, or physical responses designed to determine

knowledge, intelligence, or ability.

3. The means by which the presence, quality, or genuineness of anything is

determined: (e.g. a test of a new product.)

4. The trial of the quality of something: (e.g. to put to the test.)

5. A particular process or method for trying or assessing.

6. A set of problems, questions, etc., for evaluating abilities or performance.

A test consists of a number of questions to be answered, a series of problems

to be solved, or a set of tasks to be performed by the examinees. The questions might

ask the examinees to define a word, to do arithmetic computations, or to give some

information. The questions, problems and tasks are called test items.

Difference between Test, Measurement and Evaluation:

25

Page 28: Measurement and Evaluation (Book).docx

William Wiersma and Stephen G. Jurs (1990) in their book "Educational

Measurement and Testing" remarks that the terms of Testing, measurement,

assessment and evaluation are used with similar meanings but they are not

synonymous though they are related with each other. They define these terms as

follows:-

Test: "(It) has a narrower meaning than either measurement or assessment. Test

commonly refers to a set of items or questions under specific conditions. When a test

is given, measurement takes place; however, all measurement is not necessarily

testing".

Measurement: "For all practical purposes assessment and measurement can be

considered synonymous. When assessment is taking place, information or data are

being collected and measurement is being conducted".

Evaluation: "Evaluation is a process that includes measurement and possibly testing

but it also contains the notion of a value judgment. If a teacher administer a test to a

class and computes the percentages of correct responses, measurement and testing

have taken place. The scores must be interpreted which may mean converting them to

values like As Bs Cs and so on or judging them to be excellent, good, fair or poor.

This process is evaluation because the value judgments are being made".

Another distinction is given by Normane E. Gronlund (1985) who defines

these terms as follows in the book "Measurement and Evaluation in Teaching".

Test: "An instrument or systematic procedure for measuring a sample of behaviour.

(Answers the question "How well does the individual perform-either in comparison

with others or in comparison with a domain of performance tasks"?

Measurement: "The process of obtaining numerical description of the degree to

which an individual possesses a particular characteristic. (Answers the question "How

much?").

26

Page 29: Measurement and Evaluation (Book).docx

Evaluation: "The systematic process of collecting, (Classroom) analyzing and

interpreting information to determine the extent to which pupils are achieving

instructional objectives. (Answers the question "How good").

Similarly Anthony J. Nitko (1983) in his book "Educational Tests and

Measurement" makes the distinction between Test, Measurement and Evaluation in

the following words:

Tests: "Tests are systematic procedures for observing persons and describing them

with either a numerical scale or a category system. Thus tests may give either

qualitative or quantitative information". Measurement: "Measurement is a procedure

for assigning numbers to specified attributes or characteristics of persons in a manner

that maintains the real world relationships among persons with regard to what is being

measured".

Evaluation: "Evaluation involves judging the value or worth of a pupil of an

instructional method or of an educational program. Such judgements may or may not

be based on information obtained from tests".

Robert L. Ebel and David A. Frisible (1986) in their book "Essentials of

Educational Measurement" rightly observe.

"All tests are a subset of the quantitative tools or techniques that are classified

as measurements. And all measurement techniques are a subset of the quantitative and

qualitative techniques used in evaluation."

Table showing relationship Between Testing, Measurement and Evaluation:

Test Measurement Evaluation

An instrument or systematic

procedure for measuring a

sample of behavior

The process of obtaining a

numerical description of the

degree to which an individual

A systematic processes of

collecting and analyzing data

in order to make decisions

27

Page 30: Measurement and Evaluation (Book).docx

Ability Tests Personality Tests

Achievement Test Aptitude Test Intelligent Test

Objective TestsEssay Tests

Attitude Tests Character Tests Interest TestsAdjustment Tests

possesses a particular

characteristic”

Answers the question ‘How

well does the individual

perform-as compared to

others.

Answers the question ‘How

much?’

It answers the ‘How good’

It is means of collecting

information

It gives Numerical Value to

some trait.

Involves qualitative and

quantitative assessment and

decision-making.

Its objective is to find out the

facts pertaining to some

aspect.

Its objective is to present the

information objectively.

Its objective is no make

decisions about all

components of educational

system

Test is only a instrument to

obtain data

Measurement quantifies data

and is essential part of

evaluation

Depends upon testing and

measurement for data

Types of Tests

TESTS

28

Page 31: Measurement and Evaluation (Book).docx

As it is shown in the diagram above, tests can be classified into two broad

categories according to the behaviour tested: ability tests and personality tests. These

two types are discussed in detail and are further classified into sub-types in the

following lines.

(A) Ability Tests

These tests are used to test the ability of a student. These tests measure the

maximum performance of a student that a student can do. Ability tests are further

classified into three types; (1) achievement tests, (2) aptitude tests, (3) intelligence

tests. These are discussed in the lines below.

(1) Achievement tests: These tests are used to appraise the outcomes of

classroom instruction. They measure the attained ability of a student i.e. what a

student has learnt to do. Achievement tests are further classified into two types of

tests i.e. 'Essay type tests' and 'objective type tests'. (These two types of tests will be

discussed in detail in the next question).

(2) Aptitude tests: Aptitude Tests are those tests that are used to measure the

potential ability of a student i.e. what a student can learn to do. They measure the

capacity of a student to learn a given content. According to Hull, C. L. "An aptitude

test is a psychological test designed to predict an individual's potentialities for success

or failure in a particular occupation, subject for study, etc. this shows that an aptitude

test is a test designed to discover what potentiality a given person has for learning

some particular vocation or acquiring some particular skill. Achievement tests and

aptitude tests seem to be the same. But the distinction between the two is that they are

different in use. If a test is used to measure the present attainment, it is called

achievement test. And if a test is used to predict the future level of performance, it is

called an aptitude test.

29

Page 32: Measurement and Evaluation (Book).docx

(3) Intelligence Tests: Intelligence Tests are those tests that are used to measure

the native capacity or the overall mental ability of a student. These are also called

scholastic aptitude tests or tests of mental ability. There are many kinds of intelligent

tests but the most popular one is the concept of (IQ) introduce by Termen. IQ is

computed by dividing the mental age (MA) of a student by his physical age or

chronological age (PA or CA) i.e. the actual age of the student. Then the result is

multiplied by 100.

I.Q = MACA

× 100

Where:

I.Q. = Intelligence Quotient

M.A. = Mental Age

C.A. = Chronological Age (Physical Age)

(B) Personality Tests

Tests used for the assessment of personality of a student are called personality

tests. They measure the typical performance of a student i.e. what a student will do.

They are universally administered almost all over the world in various fields,

vocations, institutions, and for the selection of recruits. In Pakistan, too, personality

tests are used for job selection and for the selection of army recruits like ISSB

examinations. Personality tests include attitude tests, interest tests, adjustment and

temperament tests, character tests, and tests of other motivational and interpersonal

characteristics.

Uses of Tests

Tests play important role in teaching- learning process. Without tests we

cannot make evaluation or assessment of a student's or neither teacher's performance

nor we can collect information about the effectiveness of an educational programme.

30

Page 33: Measurement and Evaluation (Book).docx

That is why tests are very important in education. They motivate students for learning.

They serve a number of purposes in a variety of educational activities. The following

are the different uses of tests;

1. Uses of tests in teaching process

With the help of the result obtained from tests, teachers can easily collect

information about aptitude, intelligence, interests, attitude and the overall

performance of the students. He comes to know the strengths and weaknesses of his

teaching method. It becomes easy for the teacher to grade students in a subject. Teats'

results enable him to know how the future success of a student in a subject can be

predicted.

2. Uses of tests in learning process

The student is the centre of intere ;t in teaching –learning process. All kinds of

educational activiti .;s are performed for th sack of student. That is why the use and

importance of tests in th process of learning is greater than in any other activity. Tests

hel ,students in knowing their strengths and weaknesses in a subject. The resul,S

obtained from these tests serve as guideline for students. They motivate students to

study.

3. Uses of Tests in Guidance

Tests show the overall performance of the students. Therefore; they enable the

examiner to know how to guide students educational and vocational choice. Tests also

make parents aware o the aptitude of their children and can make a plan for their

proper guidance. The result of the tests in itself serves as a guideline for the students.

4. Uses of Tests in Administration

The results obtained from the tests provide the administrators of the

deportment with useful information In the light of these tests, they can easily decide

how to promote students, how to admit them an&how to modify (trie.7) school

31

Page 34: Measurement and Evaluation (Book).docx

objectives, instructional methods and curricula. They can then easily decide how to

make the teaching–learning processes effective.

5. Uses of Tests in Research

The data collected from tests are uses as powerful tools in research and

experimentation in classroom. The research workers use these data in their genetic or

ease study research.

In short, tests are used in almost all educational activities. They are the real

tools with the help of which information about teachers, students, curricula and etc.

are gathered. And in the light of this information, teaching and learning process is

improved.

1.2 THE PURPOSE OF TESTING

Introduction:

The purpose of test is usually included the test is announced or at the

beginning of the semester when the evaluation procedures are described as a part of

the general orientation to the course. Should there be any doubt whether the purpose

of the test is clear to all pupils, however it culd be explained again at at the time of

testing. This is usually done orally. The only time, a statement of the purpose of the

test needs to be included in the written direction is, when the test is to be administered

to several sections taught by different teachers, then a written statement of purpose

ensures, greater uniformity. There are various types of test being applied in the

educational institutions, because no a child’s ability interests and personality. One test

measures only a specific ability that is why school administers use many different

types of tests even in one single area such as intelligence, move than one test are

needed over a period of years to obtain a reliable estimate of ability each test serves

its own purpose, however, testing and evaluation serve following purposes.

32

Page 35: Measurement and Evaluation (Book).docx

Types of Testing:

There are four types of testing.

Placement Testing:

Most placement tests constructed by classroom teachers are pretests designed

to measure.

1. Whether pupils possess the prerequisite skills needed to succeed in a unit or

course or.

2. To what extent pupils have already achieved the objectives of the planned

instruction.

In the first instance we are concerned with the pupils readiness to begin the

instruction. In the second we are concerned with the appropriateness of our

planned instruction for the group and with proper placement of each pupil in

the instructional sequence.

Formative Testing:

Formative tests are given periodically during instruction on monitor pupils

learning progress and to provide ongoing feedback to pupils and teacher, formative

testing reinforces successful learning and reveals learning weaknesses in need of

correction. A formative test typically covers some predefined segment passes a rather

limited simple of learning tasks. The test items may be easy or difficult, depending on

the learning tasks in the segment of instruction being tested, formative tests are

typically criterion referenced mastery test, but norm-referenced survey tests can also

survey this function, ideally, the test will be constructed in such a way that corrective

prescription can be given for missed test items or sets of lest items. Because the main

purpose of the test is to improve learning the result are seldom used for assigning

grades.

33

Page 36: Measurement and Evaluation (Book).docx

Diagnostic Testing:

Diagnosis of persistent learning difficulties involves much more then

diagnostic testing, but such tests are useful in the total process. The diagnostic test

takes up where the formative test leaves off if pupils do not respond to the feedback

corrective prescriptions of formative testing. A more detailed search for the source of

testing we will need to include a number of a test items in each specific area, with

some slight variation from item to items in diagnosing pupils, difficulties in adding

whole numbers, for example we would want to include addition problems containing.

Various numbers combination with some not requiring carrying and some requiring

carrying, to pinpoint the specific types of error each pupil is making. Because our

focus is on the pupils learning difficulties, diagnostic test must be constructed in

accordance with the most common sources of error that pupils encounter. Such tests

are typically confined to a limited area of instruction, and the test items tend to have a

relatively low level of difficulty.

Summative Testing:

The summative test ig given at the end of a course or unit of instruction, and

the results are used primarily for assigning grades or certifying pupil mastery of the

instructional objectives. The result can also be used for evaluating the effectiveness of

the instruction. The end of the course test (final examination) is typically a norm-

referenced survey test that is broad in coverage and includes test items with a wide

range of difficulty. The more restricted end of unit, summative test might be norm

referenced or criterion referenced depending on whether mastery or developmental

outcomes are the focus of instruction.

Purpose of Testing:

1. To Certify Pupils’ Achievements / Grading:

Tests are given to the students to ascertain their achievements tests provide the

teacher with student’s actual achievements instead of an intuitive generalization based

34

Page 37: Measurement and Evaluation (Book).docx

on simple observation. These tests given the teacher an objective and comprehensive

picture of each pupil’s progress. This is important because all concerned persons

(students themselves, student’s parents, teachers, counselors, administrators,

employers, admission officers, and even community) need to know students

performed in school and in particular courses.

To report Student’s Progress to Parents:

Testing gives the teacher in objective and comprehensive picture of each

pupil’s progress, so that it could be presented to the present. These reports from the

foundation for most effective cooperation between parents and teachers, which results

improved learning.

To Report to Administrators:

The results of tests indicate the extent to which the school’s objectives are

being achieved from the results of evaluation the administrators become able to

identify the weak points and strengths in the teaching programs of their schools and

take necessary action for their improvement.

To Assess Learner’s Needs:

To test the pupils’ knowledge and skills at the beginning of instruction enables

the teacher to answer the questions like: Do the pupils possess the abilities and skills

needed to proceed with the instruction? What, and to what level have the pupils

already mastered the intended outcomes? This information helps the teacher in

planning his instructional activities.

To Provide Relevant Instruction:

Testing provide a type of continuous feedback, about the usefulness of the

instructional process it helps the teacher in changing and adapting the instructional

activities continuously according to the student’s needs.

35

Page 38: Measurement and Evaluation (Book).docx

To Furnish Instruction:

Testing factions as an instructional device it not only increases the self-

knowledge of the students, but also the attainment of specific objectives. This practice

of giving ‘tests’ is common in our institutions through these the students become

aware of their speed of progress, errors, and present status on the basis of which they

plan their further efforts.

To Provide Guidance and Counseling:

The results of tests are especially useful for guidance and counseling of the

students. These are useful in assisting the students with educational and vocational

decisions, guiding them in the selection of curricular and co-curricular activities, and

helping them solve personal and social adjustment problems.

To Know the level of Achievement of Objectives:

The first step in the instructional process is to determine the extent to which

the pupils achieved the instructional objectives. Testing and evaluation help in this

regard tests are useful in determining the learning outcomes of classroom instruction.

The teacher can evaluate the success or failure of classroom learning in relation to the

test results. The teacher then accordingly adjusts the level and direction of classroom

instruction.

To Analyze the Instruction Objectives:

The information from carefully developed tests and evaluation is used to

assess the appropriateness and attainability of the instructional objective. The

instructional objectives are modified in the light of the evaluation information.

To Discover Maladjusted Children:

In every school there are some students who present severe problems of

educational or social adjustment. These include the withdrawn, the unhappy, the

mentally retarded, and others who are not adjusting to the pattern of the school. The

36

Page 39: Measurement and Evaluation (Book).docx

standardized tests help the teachers and counselors to understand and help such

students.

To Appraise Educational instrumentalities:

Testing and evaluation is useful, in appraisal for educational instrumentalities

such as teachers, teaching methods, teaching materials and text books.

To Conduct Research:

Test and evaluation data is important in research programs. The information

obtained from evaluation is used to compare the effectiveness of different curricula,

different teaching methods and different organizational plans techniques of

evaluation, and to find out the ways to improve to teaching learning process.

To Change the Curricula:

One purpose of the tests and evaluation is to find out the weak points in the

curriculum so that it could be changed in accordance will the need, of the society.

To measure Behavior in Controlled Situation:

Another purpose of tests is to measure the behavior of the subject or student

under controlled conditions.

1.3 GENERAL PRINCIPLES OF ASSESSMENT:

Assessment is an integrated process for determining the nature and extent of

student learning and development. In order to make this process effective, the

following principles are taken into consideration.

1) Clearly specify what is to be assessed the priority in the assessment process.

The effectiveness of assessment depends as much on a careful description of

what to assess as it does on the technical qualities of the assessment procedure

used. When assessing student learning, this means clearly specifying the

intended learning goals before selecting the assessment procedures to use.

37

Page 40: Measurement and Evaluation (Book).docx

2) An assessment procedure should be selected because of its relevance to the

characteristics or performance to be measured. Assessment procedures are

frequently selected on the basis of their objectivity, accuracy or convenience.

3) Comprehensive assessment requires a variety of procedures. No single type of

instrument or procedure can assess the vast array of learning and development

outcomes emphasized in a school program. Multiple choice and short answer

tests of achievement are useful for measuring knowledge, understanding, and

application outcomes, but essay tests and other written projects are needed to

assess the ability to organize and express ideas. A complete picture of student

achievement and development requires the use of many different assessment

procedures.

4) Proper use of assessment procedure requires an awareness of their limitations.

Assessment procedures range from very highly developed measuring

instruments to rather crude assessment devices. Even the best educational and

psychological measuring instrument yield results that are subject to various

types of measurement error.

Not best or assessment asks all the questions or poses all the problems that

might appropriately be presented in a comprehensive coverage of the

knowledge, skills and understanding relevant to the content standards or

objectives of a course or instructional sequence. Instead only a sample of the

relevant problems of questions is presented.

Even in a relatively narrow part of a content domain, such as understanding

photosynthesis or the addition and subtraction of fractions, there are a host of

problems that might be presented, but any given test or assessment samples

but a small fraction of those problems. Limitations of assessment procedures

do not negate the value of tests and other types of assessments. A keen

awareness of the limitations of assessment instruments makes it possible to

38

Page 41: Measurement and Evaluation (Book).docx

use them more effectively. Cruder the instrument, the greater its limitations

and consequently, the more caution required in its use.

5) Assessment is a means to an end, not an end in itself. The use of assessment

procedures implies that some useful purpose is being served and that the user

is clearly aware of this purpose. To blindly gather data about students and then

file the information away is a waste of both time and effort. Assessment is best

viewed as a process of obtaining information on which to base educational

decisions.

1.4 TYPE OF EVALUATION PROCEDURE

The evaluation process can basically be carried out at two main levels;

programme and student. Student evaluation can be further be subdivided into

formative and summative evaluation.

Evaluation procedure

Programme Evaluation Student evaluation

Formative evaluation Summative evaluation Diagnostic evaluation

Proagramme Evaluation:program evaluation is systematic method for collecting,

analyzing, and using information to answer questions about projects, policies and

programs, particularly about their effectiveness and efficiency.

When our concern is judging the compatibility between the aims and the learning out

comes of a programme, the emphasis is on the efficacy of that programme. On the

other hand a ‘good’ programme may be badly implemented. The task of quality and

control is to maintain and maximize the efficiency of a programme.

The quality content of a programme is to determined, among other factors, by

39

Page 42: Measurement and Evaluation (Book).docx

i. Its conceptual quality.

ii. Logical relevance to the need of the student.

iii. Simplicity and comprehensibility in terms of readability and literacy level of

the content.

iv. Relative stability and survival value in the literature.

v. Applicability to familiar and novel situation.

To matter how good a programme may be; the maintenance system must be

well facilitated. The school administrators head of subjects unit, supervisors. The

teacher and the pupils must be activity involved if successful implementation of the

programme is to be realized, the teacher being the main executor of the programme

must be will trained not just to be able to teach facts but so select facts that relate to

other facts and principles. The teacher education programmes in the advanced teacher

colleges and the universities must prepare teachers to be able to teach their subjects

effectively.

In order to be implemented a programme should be designed in such a way

that under favourable conditions certain intended learning outcomes will emerge. The

school teacher, the headmaster and supervisor must gather information from time to

time in order determine the success or seakness of the programme. If desirable

outcomes are observed, the focus of all concerned with instruction should be to

improve the programme through an effective maintenance system. If the product

(students) produced are of poor quality, corrective measures are selected and applied

in order to achieve the desired results. If after all these efforts, the products are still

found to be poor the programme is usually abandoned.

Several process are involved in the input out put process. The teacher is the

most important component of the maintenance process of the programme.t he

interacts with the students with the staff, experts and administrators and forms a

bridge between hem and learning materials. Often he acts as the input analyzer and an

40

Page 43: Measurement and Evaluation (Book).docx

identifier as well as the teaching agent of the programme.t he external sensor

examines the learning environment to identify changes perhaps economic, political

and psychological or social within the environment that can destabilize the system.

The input analyzer processes all the information supplied by the external

sensor and transmits it to the school administrator for appropriate action. He analyze

and organize information obtained form the input variable into a comprehensible

structure to be used in planning activities. The identifier (usually the teacher or his

head of department) examines the out put and the internal working conditions of the

maintance system. It is he who provide the decision rules (head master) with a realibel

picture of the internal condition of the system. The input output information provided

by the analyzer and the identifier becomes the input of decision rule and it is utilized

by the headmaster to produce a decision policy or instruction to the teacher.

Any given programme introduced into school setting is not left in its naked

form but assumes a different from for that setting. It contents are emphasized as

teacher, administrators and students.

Programme education can be carried out through the use of surveys, interview,

experimental students and so on.

Student Evaluation:

As pointed out earlier, testing forms an integral part of student evaluation. The

purpose of this type of evaluation is to determine how well a students is performing in

a programme. Through a series of oral questions, paper-pencil tests, manipulative still

tests. Tutorials discussions, tutorials, individualized instruction, assignments, projects

and so on the student is gradually guided towards a desired goal. Basically there are

two types of student evaluation.

i. Forative and ii. Summative

41

Page 44: Measurement and Evaluation (Book).docx

i. Formative Evaluation:

Formative evaluation aims at ensuring a healthy acquisition and development

of knowledge and skills by students. Formative evaluation is also used to identify

students in order to guide them towards desiable goals. As student needs and

difficulties are identified, appropriate remedial measures are taken to solve such

problems. The purpose is to find out whether after learning experience students are

able to do what they were previously unable to do. A short term objectives of

formative evaluation may be to help student perform well at the end of the

programme. It is a process of channeling input variables through a process that will

yield expected outputs. The classroom is variables through a process that will yield

expected outputs. The classroom teacher is the best formative evaluator. Formative

evaluation attempts.

1. To identify the content (knowledge or skills).

2. To appreciate the level of cognitive abilities such as memorization,

classification, comparison, analysis, explanation, quantification, application

and so on.

3. To specify the relationship between content and levels of cognitive abilities.

In other words, formative e evaluation provides the evaluator with useful

information about the strength or weakness of the student within an instruction

context.

1. Formative evaluation is alone during an instructional programme.

2. The instructional programme should aim at the attainment of certain objectives

during the implementation of the programme.

3. Formative evaluation is done to monitor learning and modifying the

programme if needed before its common completion.

4. Formative evaluation is for current students.

42

Page 45: Measurement and Evaluation (Book).docx

Characteristics of Formative Evaluation

1. It relatively focuses on molecular analysis.

2. It is because seeking.

3. It is interested in the broader experience of the programme users.

4. It is designing exploratory and flexible.

5. It seeks to identify influential variables.

6. It requires analysis of instructional material for mapping the hierarchical

structure of the learning tasks and actual teaching of the course for a certain

period.

Summative Evaluation

Summative evaluation is primary concerned with purposes progress and

outcomes of the teaching learning process attempts as far as possible to determine to

what extent the broad objective of a programme have been achieved. It is based on the

following assumptions.

1. That the programmer’s objectives are achieved.

2. That the teaching learning process has been conducted effiently.

3. That the teacher student material interaction have been conductive to learning.

4. That there is uniformity in classroom conditions for all learners.

Unlike formative evaluation, which is guidance oriented summative evaluation

is judgmental in nature. Promotion examination, the first school leaving certificate

examination, the public examination belongs to this form of evaluation. Summative

evaluation carries threat with it in that the student may have no knowledge of the

evaluator.

According to A.F Nikto (1983) summativn already completed programme,

procedure or product. Summative evaluation is done at the conclusion instruction and

measures the extent to which student have attained the desired out comes.

43

Page 46: Measurement and Evaluation (Book).docx

Chief Characteristics of Summative Evaluation:

1. It lends to the use of well-defined evaluation design.

2. It Focus on analysis.

3. It provides descriptive analysis.

4. It trends to stress local effects.

5. It is unobtrustive and non-reactive as far as possible.

6. It is concerned with broad range of issues.

7. Its instruments are reliable and valid.

Difference between the Summative and Formative Evaluation

In the beginning these terms applied for the evaluation of curricular work

only. M. Seriven explains the difference between these terms as follows in his book

Evaluation the asurus (1980).

“Formative evaluation is conducted during the development or improvement

of a programme or product (or person) it is an evaluation conducted for in-house but

is may be done by an internal or external evaluator (preferably) a combination.

Summative evaluation, on the other hand, is conducted after completion of a

programme. (or a course of study) and for the benefit of some external audience or

decision malker. (e.g funding agency or future possible users) though it may be done

by an internal or an external evaluator or by a combination”.

Gloria, Hitchok and other (1986) state the difference between the summative

and formative avaluation in these words. “It is fairly straight forward to produce an

“ideal” type of either a summative or a formative profiles. It is far more difficult to

combine the two into one unified system. The undervaluing philosophies of the two

appear difficult to reconcile”.

Following are the main differences between these types of evaluation:

1. They differ in purpose, nature and timing.

44

Page 47: Measurement and Evaluation (Book).docx

2. Summative evaluation is the terminal assessment of performance at the end of

instruction but formative evaluation is the assessment made during the

instructional phase to inform the teacher about progress learning and what

more is to be done.

3. The summative evolution limits the use of profile and record of achievement

but they are regulary use in formative evaluation.

4. In summative evaluation, the assessment is done to test learning outcomes

against a set of objectives criteria with out revealing the details of the route to

the teacher, which the student followed in reaching that point. Formative

evaluation takes the form of a dialogue between the student and teacher in

which both determine the task.

Broad Differences Formative and Summative

Characteristic Formative Summative

Purpose To monitor progress of student getting feedback

To check final status of student

Content focus Detailed narrow scop Gernal Board Scope

Methods Daily assignments Projects

Observations Projects

Frequency Daily Weekly, quarterly etc.

1.5 NORM- REFERENCED AND CRITERION REFERENCED TEST:

Test designed to provide, a measure of performance that is interpretable in

terms of an individual’s relative standing in some known group is called norm-

referenced test. A norm group may be made up of a students at the local level, district

level, provincial level or national level.

Types of Norms: There are two type of norms which are following.

45

Page 48: Measurement and Evaluation (Book).docx

a) National Norms: Most standardized achievement and aptitude test require

national norms because the tests are intended for used across the country. The

norm group should represent the population of student in the country.

b) Local Norms: There are many communities where local norms are more

useful than the national for example there may be some cities where the

citizen who are above national averages in educational and socioeconomic

level.

Characteristics of Norm-Referenced Test

1. Its basic purpose in to measure student’s achievement to curriculum based

skill. Therefore it covers majority of the course.

2. It is prepared for a particular grade level. As the test is curriculum based

therefore. It can only be applied to a particular class for which it is prepared.

3. It classifies achievement as above average, average or below average for a

given grade.

4. It is generally reported in the form or percentile rank, linear standard score,

normalized standard score and grade equivalent score.

5. Norm-referenced test is likely to have times that are very difficult for the

grade level so student can be ranked.

Drawbacks of Norm–Referenced Test

1. Test item that are answered correctly by most of the pupils are not included in

these test because of their inadequate contribution to response variance. They

will be the items that deals with the important concepts of course content.

2. Norm-Referenced test compare an individual performance to the performance

of a group called norm group an entirely different conclusion will be reached

if the norm group is a collection of university seniors majoring in physics.

46

Page 49: Measurement and Evaluation (Book).docx

a) Criterion – Referenced Test:

1. According to Gronlund (1985) a test designed to provide a measure of

performance that is interpretable in terms of a clearly defined and delimited

domain of learning talks is called criterion-referenced test.

2. According to wiersna and Stephen (1990) criterion referenced test describe.

The performance of the student in the herm of actual skills or task that are

included in the test.

b) Haracteristics of Criterion-Referenced Test:

1. It measures student’s achievement of curriculum based skills.

2. It is prepared for a particular grade or course level.

3. It has balanced representation of goals and objectives.

4. It can be administered before and after instruction.

5. It is used to evaluate the curriculum plan, instructional progress and group

student’s interaction.

c) Limitation of CRTS:

CRTs tell only whether a learner has reached proficient in a task area but does

not show how good or poor in the learner’s level of ability.

Task included in CRTs may be highly influenced by a given teacher interests

or biases, leading to general validity problem.

Only some area readily land themselves for listing specific test can be built

and this may be a constructing element for teacher.

1.6 EDUCATIONAL:

“Educational assessment can be defined as the process of documenting

knowledge skills, attitude and beliefs”.

Or

47

Page 50: Measurement and Evaluation (Book).docx

“The process of collecting synthesizing and interpreting information to

assessment.”

General Principles of Assessment:

Following are the main principles of assessment.

1. Clearly specify what is to be assessed has priority in the assessment process.

2. An assessment procedure should be selected because of its relevance to the

characteristics or performance to be measured.

3. Comprehensive assessment requires a variety of procedures.

4. Proper use of assessment procedures requires an awareness of their

limitations.

5. Assessment is a means to an and not an end in itself.

Clearly specify what is to be assessed:

General statements from, content standard or from course objectives can be a

helpful starting point but in most cases teachers needs to be more specific for

assessment process to be effective. Thus specification of the characteristic to be

measured should precede the selection or development of assessment procedures.

Specify the intended learning goals before selecting the assessment procedure to use.

Example:

Content standard in the field of physics night specify that students. Should

understand idea documents in field of physics.

1. Assessment may be in the form of multiple choice.

2. Short answer

3. Essay questions

4. Numerical questions

To establish assessment priorities for such a standard teacher needs to answer

the questions such as the following.

48

Page 51: Measurement and Evaluation (Book).docx

Q1. What idea?

Q2. What document?

Q3. What concepts of physics?

The general statement in standard does not answer such questions, but they

must be either explicitly or implicitly, to develop assessments.

Assessment must be relevant to the performance to be measured:

Assessments procedures are frequently selected on the basics of their

objectivity, accuracy or convenience although there criteria are important they are

secondary to main criterion.

Examples:

If teachers goal is that students should learn written skills or creative writing,

composition, sentence structure so the if multiple choice will be option for assessment

then it will be poor one, teacher must include story writing, easy, summaries or such

type of things for improving writing stills of a child.

“Close match between the intended learning goals and type of assessment is

must”.

Comprehensive assessment requires a variety of procedure:

A lot of procedures are required to assess the knowledge of a person about

anything. Things which are to be assessed also play a vital role in connectivity with

the procedure some of the procedures are given below.

Multiple choice

Short answer

Essay test

Written projects

Observational technique

49

Page 52: Measurement and Evaluation (Book).docx

Multiple-Choice and short answer test of achievement are useful for

measuring knowledge, understanding, and application outcomes, but essay tests and

other written project are needed to assess the ability to organize and express ideas.

Projects that require students to formulate problems, accumulate to formulate

problems, accumulate information through library research or collect data (e.g

through experimental observations or interviews) are needed to measure certain skills

in formulating and sawing problems observational techniques are needed to assess

performance skills and various aspects of students behavior and self-report techniques

are use full for assessing interests and attitudes. A complete picture of students

achievement and development requires the use of many different assessment

procedure.

Proper use of Assessment Procedure Requires an Awareness of their Limitations

Not a single test can assess whatever the teacher want every procedures has its

plus points and negative pointes or we can say it is not suitable for the things to be

assessed. So one must how about it and takes care of it so that we can get correct

assessment results.

Some of the major problems are

1. Sampling error

2. Chance factor

3. Incorrect interpretations

Sampling Error:

An achievement test may not adequately sample a particular domain of

instructional content. An observational instrument design to assess a student’s social

adjustment may not sample enough behavior for a dependable index of this trait.

Sampling can be controlled though careful application of established

measurement procedures.

50

Page 53: Measurement and Evaluation (Book).docx

Chance Factor:

A second source of error is caused by chance factors influencing assessment

results, such as guessing on objective Tess, subjective scoring on essay test, errors in

judgment on observation devices and in consistent responding on self report

instrument.

Through the careful use of assessment procedure we are able to keep these

error of assessment to a minimum.

Incorrect Interpretation:

The incorrect interpretation of measurement results constitutes another major

source of error. We usually, more precise the result than the requirement that’s why

this problem waists Result must be precise accurately.

Assessment is a mean to an end, not an end in itself:

The use of assessment procedure implies that some useful purpose implies that

some useful purpose is being served and that the user is clearly aware of his purpose.

The blindly gather data about students and then file the information away is a waste of

time and effort. Assessment is best viewed as a process of obtaining information on

which to base educational decisions.

Conclusion:

All the principles are very important because they are directly linked with the

inter predation of information if the requirement is not fulfilled than assessment will

be wrong.

51

Page 54: Measurement and Evaluation (Book).docx

UNIT-2:

JUDGING THE QUALITY OF THE TEST

Definition: Test percentile scores are just one type of test scores you will find on your

child's testing reports. Many test reports include several types of scores. Percentile

scores are almostalways reported on major achievement that are taken by your child's

entire class. Percentile scores will also be found on individual diagnostic test reports.

Understanding test percentile scores is important for you to make decisions about

your child's special education program.

Test percentile scores commonly reported on most standardized assessments a child

takes in school. Percentile literally means perhundred. Percentile scores on teacher-

made tests and homework assignments are developed by dividing the student's raw

score on her work by the total number of points possible. Converting decimal scores

to percentiles is easy. The number is converted by moving the decimal point two

places to the right and adding a percent sign. A score of .98 would equal 98%.

Test percentiles on a commercially produced, norm-referenced or standardized test,

are calculated in much the same way, although thecalculations are typically included

in test manuals or calculated with scoring software.

If a student scores at the 75th percentile on a norm-referenced test, it canbe said that

she has scored at least as well, or better than, 75 percent of students her age from the

normative sample of the test. Several othertypes of standard scores may also appear

on test reports.

Percentile rank

From Wikipedia, the free encyclopedia

The percentile rank of a score is the percentage of scores in its frequency distribution

that are the same or lower than it. For example, a test score that is greater than or

52

Page 55: Measurement and Evaluation (Book).docx

equal to 75% of the scores of people taking the test is said to be at the 75th percentile

rank.

Percentile ranks are commonly used to clarify the interpretation of scores on

standardized tests. For the test theory, the percentile rank of a raw score is interpreted

as the percentages of examinees in the norm group who scored at or below the score

of interest.mm

Percentile ranks (PRs) are often normally distributed (bell-shaped) while normal

curve equivalents (NLEs) are uniform and rectangular in shape. Percentile ranks are

not on an equal-interval scale; that is, the difference between any two scores is not the

same between any other two scores whose difference in percentile ranks is the same.

For example, 50 _ 25 = 25 is not the same distance as 60 _ 35 = 25 because of the

bell-curve shape of the distribution. Some percentile ranks are closer to some than

others. Percentile rank 30 is closer on the bell curve to 40 than it is to 20.

The mathematical formula is

ce+0.5 fiN

X 100%

where c is the count of all scores less than the score of interest, f is the frequency of

the score of interest, and Nis the number of examinees in the sample. If the

distribution is normally distributed, the percentile rank can be inferred from the

standard score.

2.1 VALIDITY, METHODS OF DETERMINING VALIDITY:

Introduction:

Tests play a central role in the evaluation of pupil learning. They provide

relevant measures of many important learning outcomes. Tests and other evaluation

instruments serve a variety of uses in the school, for example test of achievement

might be used for selection, placement, diagnosis or certification of mastery.

53

Page 56: Measurement and Evaluation (Book).docx

When constructing or selecting tests and other evaluation instruments, the

most important question is to what extent will be the interpretation of the scores be

appropriate, meaningful and useful for the intended application of the results?So

validity is always concerned with the specific use of results.

Factors Influencing Validity:

A careful examination of test items will indicate wether the test appears to

measure the subject matter content and the mental function that the teacher is

interested in testing. Following are the factors that prevent the test items from

functioning as intended and there by lower the validity of the interpretation.

1. Unclear Direction:

Directions that do not clearly indicate to the pupil how to respond to the items

will reduce validity.

2. Reading Vocabulary and Sentence Structure too Difficult:

Vocabulary and sentence structure that is too complicated for the pupils will

distort the meaning of the test results.

3. Inappropriate level of difficulty:

Items that are too easy or too difficult also lower validity.

4. Poorly constructed items:

Test items that provide clues to the answers will measure the pupil’s alertness

in detecting clues as well as those aspects of pupil performance that the test is

intended to measure.

5. Ambiguity:

Ambiguous statements confuse the pupil and so causes to discriminate in a

negative direction.

6. Inadequate time limits:

Time limits that do not provide pupil with enough time to consider the items

reduce the validity.

54

Page 57: Measurement and Evaluation (Book).docx

7. Test too short:

If the test is too short to provide a representative sample of the performance

we are interested in, its validity will suffer accordingly.

8. Improper arrangement:

Test items are arranged in order of difficulty, with the easiest items first.

Placing difficult items early may cause pupils to spend too much time.

9. Identifiable pattern of answers:

Placing correct answers in some systematic patern will enable pupils to guess

the answers more easily.

Methods of Determining Validity:

There are several methods of determining the validity of measuring

instruments which we may call.

1. Content Validity:

Content validity is evaluated by showing how well the content of the test

samples the class of situations. It is especially important in the case of

achievement and proficiency measures. It is also known as “face validity”.

2. Concurrent Validity:

It is evaluated by showing how well test scores correspond to already accept

measures of performance or status made at the same time. For example, we

may give a social studies class a test on knowledge of basic concepts in social

studies and at the same time obtain from its teacher report on these abilities as

far as pupils in the class are concerned. If the relationship between the test

scores and the teacher’s report of abilities is high. The test will have high

concurrent validity.

3. Predictive Validity:

It is evaluated by showing how well prediction made from the tests are

confirmed by evidence gathered at some subsequent time. When the tester

55

Page 58: Measurement and Evaluation (Book).docx

wants to estimate how well a student may be able to do in college courses on

the basis of how well he has done on test he took in secondary schools.

4. Construct Validity:

It is evaluated by investigating what psychological qualities a test measures. It

is ordinarily used when the tester has no definitive criterion measure of what

he is concerned with and hence must use indirect measures. This type of

validity is usually involved in such tests as those of study habits,

appreciations, understanding and interpretation of data.

Conclusion:

In short we can say that validity is specific to the purpose and situation for

which a test is used. A test can be reliable without being valid but the converse is not

true in other words. It is conceivable that a test can measure some quality with a high

degree of consistency without measuring at all the quality it was actually intended to

measure.

2.2 FACTORS AFFECTING VALIDITY

Test experts generally agree that the most important quality of test is its

validity. The word “Validity” means “effectiveness” or “Soundness”. It refers to the

accuracy with which a thing is measured.

Types of Validity:

Validity is classified into three categories. 1) Content Validity. 2) Criterion

related validity. 3) Construct Validity.

A good measuring instrument is that which is valid with respect to all these

three categories. These are discussed below.

56

Page 59: Measurement and Evaluation (Book).docx

i. Content Validity:

Content validity is the degree to which a test measures an intended content

area. In other words the content validity of a test refers to the extent to which

the test content represents a specified universe of content.

For Example: for example if a teacher taught a course of biology and would

like to give a test at the end of the course.

ii. Construct Validity:

Construct validity is the degree to which a test measures an intended

hypothetical construct. In other words construct validity refers to the extent to

which the test throughout the major. There will be a correlation between the

new measuring approach / tool with standardized measure of ability in this

very discipline (like GRE subject test).

iii. Concurrent Validity:

Concurrent validity is the degree to which the scores on a test are related to the

scores on another, already established, test administered at the same time or to

some other valid criterion available at the same time.

For Example:We may give a social studies class a test based on knowledge of

basic concepts in social studies and at the same time obtain from its teacher a

report on these abilities. As far as pupils in the class are concerned measures

to construct that it claims to measure.

For Example: Examples of the construct are intelligence, creativity ability to

apply principles and ability to reason. For example if a teacher wants to

measure the ability to reason, and give two reasoning tests to his class.

iv. Criterion related Validity:

This type of validity is used to predict about the upcoming or futures or

current performance and it correlated the test results with another criterion of

interest. (Coz by, Zool).

57

Page 60: Measurement and Evaluation (Book).docx

For Example: If for an educational program, measures are developed to

assess the cumulative student learning.

v. Predictive Validity:

Predictive validity is the degree to which a test can predict how well an

individual will do in future situation. In other words, predictive validity means

the validity of a test or examination which is based upon its correlation with

some future variable.

For Example: For example one speaks of the predictive validity of school

examination for future success in higher education. Similarly, if a small test

gives the same standing to an individual in a test which was achieved by him

in a much longer test, it will be culled concurrently validity.

Methods of Determining Validity:

The methods of deterning validity is also termed as, forms of Expressing

validity. There forms, generally used for expressing validity index of the test.

1. Correlation Coefficient:

Test scores are correlated with that of criterion scores. The obtained

coefficient of correlation is the extent of validity index of the test.

2. Expectancy Table:

Test scores are evaluated or correlated with the rating of the supervisors. It

provides empirical probabilities of the validity index.

3. Cross Validation:

It means to have another look for correlation coefficient with another criterion

or expect any tables with other criterion. It is of two types.

a. Empirical validation

b. Logical or rationale validation

58

Page 61: Measurement and Evaluation (Book).docx

2.3 RELIABILITY, AND METHODS OF DETERMINING RELIABILITY:

Meaning and Definition:

Reliability means consistency of measurement in the words of Ebel and

frisbie, “The ability of a test to measure the same quantity when it is administred to an

individual on two different occasions by two different teser is called reliability. The

reliability indicates the degree to which measurement can be relied upon, to measure

the same thing each time is used.

In simple words we can say that a good measuring instrument (test) should be

reliable in reporting the results if it is done by the same group of student under the

same conditions.

Reliability is also called dependability or trustworthiness reliability is the

degree to which a test consistently measures whatever it measures. The more reliable

a test is, the more confidence we can have that the scores obtained from the

administration of the test are essentially the same scores that would be obtained if the

test where re-administered. An unreliable test is essentially useless. For example, if an

intelligence test was unreliable then a student soring an IQ of 120 today might score

an IQ of 140 tomorrow and a 95 the day after tomorrow. On the other hand if the test

is reliable then the IQ of a student will remain nearly the same each time the test is

administered. The reliability of a test depends upon the number of questions consisted

by it. A test will be more reliable if it possesses more questions. In this respect,

objective type tests are more reliable because its sampling is more extensive.

We can take another expert opinion to understand the meaning of reliability.

“If a clinical thermo meter on three successive determinations, for example yielded

reading of 97o, 103o and 99.6ofor the same patient, it would not be considered very

reliable.

59

Page 62: Measurement and Evaluation (Book).docx

Reliability, of course is a necessary but not a sufficient condition for using a

test. A highly reliable test may be totally invalid or may not measure anything that is

psychologically of educationally significant.

The reliability of a single of a single test score is expressed quantitatively in

terms of the instruments standard error of measurement. If the standard error of

measurement, for example, is 2.5, we can say that there are approximately two

chances in three (more precisely 68 in 100) that the true score falls between 72.5 and

77.5 when the obtained score is 75. By definition, an unreliable test cannot possible

be valid. The necessary degree of reliability however depends on the use that is made

of test scores.”

Methods of Determining Reliability:

For determining reliability, it is necessary that the test should be valid and it

should measure what it is designed to measure. It should be administered to an

appropriate person or group of parsons for whom the test has been developed.

Reliability is a statistical measure and therefore it can be computed by using different

statistical methods. Which have been stated in detail on next page.

1. Test-retest method:

When the reliability of the results are two measured, then at that very situation

the test retest method is used in this method the tests are subjected to the group

of students at different perrods of time. The scores obtained from first and

second time can be correlated in order to check the stability and persistency of

test.

In test re-test reliability the time factor counts a lot in very close retesting the

results are approximately the same, yielding high correlation. But when the

retest is administered after an year or two, as the result of changes in the

characteristics of students, there are expected to be large variation in the

outcome and therefore stability will be low.\

60

Page 63: Measurement and Evaluation (Book).docx

Limitation:

i. The co-efficient of reliability established through test-retest method is

erroneous.

ii. The reliability determined through test-retest method has memory of carry

over effect.

iii. The test retest method is not an objective method ascertaining reliability of the

test.

2. Equivalent forms method:

The second method of ascertaining reliability is alternate form method or

method of equivalence. Through this method one has to use two alternate or

equivalent tests in order to establish the reliability. This method is used to see

the reliability of test for measuring certain content area. It is applied to

standardized tests only as they have two or more forms of the same test

available.

Equivalent forms are used in the same group and in close succession. The

result of both the tests are correlated. The correlation shows the degree to

which both tests are measurining the same content area. Sometimes the

equivalent forms are used with time interval. Results obtained by this method

provide both stability and reliability of test. This method is generally

considered to be the best method.

Limitation:

i. Finding the reliability through this method is cumbersome because it is

difficult to judge the quality of a test which is equivalent in each and every

respect.

61

Page 64: Measurement and Evaluation (Book).docx

ii. This process is more time consuming and also it is not free from carry over

effect.

iii. More over establishment of reliability through this method is not feasible for

each and every type of test.

3. Split half method:

As the name indicates in Split-Half- method the approach is to split the test

into two reasonable equivalent halves. Such independent sub-test are then used

as a source of the two independent scores needed for reliability estimation.

In this method a test is administered to a group of students. Before scoring, the

test is split into two equal halves. Generally odds and evens are separated. By

marking each part separately each student gets two different scores which are

correlated. The correlation gives a measure of internal consistency, of the test.

Reliability of the test is estimated by applying spearman Brown formula:

2 x Reliability on12

test

1+Reliability on12

test

Like equivalent forms method the split-half method helps in determining the

reliability of test items are representative sample of the content.

Limitation:

i. The general criticism of split-half method is concerned with splitting on the

test. As there is no rule, one may go for applying this own conscience in

splitting the test into two halves. The way of splitting varies from person to

person which affects the reliability coefficient.

ii. The second criticism is concerned with the items difficulty. Generally the

items of a test are arranged in ordered of difficulty but this fact is not true for

each and every type of test. Say for example, without knowing the difficulty

62

Page 65: Measurement and Evaluation (Book).docx

level of items if one goes for splitting all the difficulty items in one half and

the simple item in another half, it will affect the reliability coefficient

adversely.

4. Kuder Richardson method:

Richardson developed several formulas for measuring internal consistency of

a test. Kudder Richard son formula zo and z1 and generally applied. But due

to simplicity of the operation formula z1 is always preffered.

Reliability (K R Z1) = K

K−1 [1−M (K−M )K S2 ]

K= the number of items in the test.

M= mean of the test scores;

S= standard deviation of the test scores.

Summary: The following methods are used for determining reliability of a test.

A Test – Retest method i. Immediate (without interval)

B Equivalent form method ii. With time interval

C Split half method iii. Immediate

D Kuder – Richardson

formula

iv. With interval

5. Parallel form Reliability:

When the different sets or different parts of a test (suppose questionnaire a and

questionnaire B) are developed but they must have a linkage (in a sense of

knowledge, skills and behaviors) and then these assessments instruments are

subjected on the same group. The result obtained from these groups are then

correlated which can show the reliability of the test in regards of the alternate

sets of instruments.

6. Inter - rater method of Reliability:

63

Page 66: Measurement and Evaluation (Book).docx

The Measures of the reliability about the different judges agree upon the

decisions about the assessment is called inter rater method of rebility. The

answers cannot effectively interpret by human observes and for that very

purpose the inter-rater reliability is of utmost importance.

2.4 FACTORS AFFECTING RELIABILITY:

Reliability:

The degree or the extent of the similarities among the results obtained on

several occasion or in other words it can be defined as the degree to which an

assessment instruments elicit stable and consistent plethora results.

Reliability means consistency of measurement. The words of Ebel & Frisbie

“The ability of a test to measure the same quantity when it is administered, to an

individual on two different occasions by two different testers is called reliability”.

Reliability also called dependability or trust worthiness. Reliability is the

degree to which a test consistently measures whatever it measure.

Factors which affects the reliability:

The factors which badly affects the reliability are as under:

The examinee:

Fatigue burden, lack of motivation, carelessness.

Trait of Test:

Ambiguous items, poorly worded direction tricky questions in familiar format.

Conditions of test- taking and marking:

Poor examination condition, excessive heat or cold carelessness in marling,

disregards or lack of clear standards for scoring, computational errors.

There are also some factors which affects on reliability, which are as under:

64

Page 67: Measurement and Evaluation (Book).docx

1. A very important factor influencing test reliability is the number of test items.

That is the greater number of items in a test, the more reliable the test.

2. Other things being equal the narrower the rang of difficulty of the items of a

test the granter the reliability.

3. Evenness in scaling is factor influencing the reliability of a test other things

being equal a test every scaled is more reliable than a test that has gaps in the

scale of difficulty of its items.

4. Other things being equal, inter-dependent items tend to decrease the reliability

of a test.

5. The more objective the scoring of a test the more reliable is the test.

6. Chance in getting the correct answer to an items is a factor which lowers the

test reliability.

7. Other things being equal, the more homogenous the material of a test the

greater its reliability.

8. Other thing being equal, the more common the experiences called for in a test

are the members of the group taking the test more reliable the test.

9. Other things being equal the same test given late in the school year (i.e. after

covering the unit in the class) is more reliable that when given Carly in the

year (i.e. without teaching the unit).

10. Other things being equal, each, question in a test lower the reliability of test. A

test answered by the systematic relall or recognition of orderly facts or

experience is more reliable than a test answered by sudden insight because of

novelty.

65

Page 68: Measurement and Evaluation (Book).docx

11. Lengthy items lower the reliability because certain factors in the item will be

over or under estimated.

12. Inadequate or faulty directions, failure to provide suitable illustrations of the

task lower the reliability.

13. Strange or unusual words of items lower the reliability.

14. The accuracy with which a test is timed is an important factor in test

reliability.

15. Difference in incentive and effort tend to make tests unreliable. The appeal of

a test is stronger with some individuals than with others, and is stronger with

an individual at one time than at another.

16. Accidents occurring during the examination such as breaking a pencil, running

out of link, or defective test booklets influence the reliability of the test.

Outside disturbances also lower the reliability.

17. The interval between the test and retest is important for reliability estimate.

18. Cheating in the examination is another factor which lowers the reliability

because the score of the individual may increase or decrease unduly.

19. Illness, worry, excitement though less important still they influence the

reliability of the test.

References

Murad Ali Katozai 1st Edition, June, 2013 Measurement and Evaluation.

Dr. Mohammad Nooman & Obaid Ullah 1st Edition June 27th 2013 A Manual

of Educational & Social Science and Research Methodologies.

66

Page 69: Measurement and Evaluation (Book).docx

2.5 PRACTICALITY:

Meaning:

The word “Practicality” means “feasibility” or “us ability”.

A test will be practicable if it is easy to administrated, easy to interpret and

economical in operation. A good test is that which have sufficiently simple

instructions so that it can be administered even by a person of low level intelligence.

Tests having difficult instructions and requiring high level training for administering

them and expensive for wide use in schools are social to have low usability or

practicability. Practicality refers to the economy of time effort and money in testing.

In other words a test should be.

Easy to design

Easy to administer

Easy to interpret

Test of Practicality of a Measuring Instrument:

The practicality attribute of a meaning instrument can be estimated regarding

its economy convenience and interpretability. Economy consideration suggests that

some mutual benefits is required between the ideal research project and that which the

budget can afford. The length of measuring in strument is an important area where

economic pressures are swiftly left.

Convince test suggest that the measuring instrument should be easily

manageable. For this purpose one should pay proper attention to the layout of the

measuring instrument. For example 9, questionnaire with clear instructions and

instrument, is examples of this questionnaire that lack these features.

67

Page 70: Measurement and Evaluation (Book).docx

Characteristics of Practicality:

There are many characteristics of practicality they are.

1. The test should be free from drawbacks and limitations of both essay type and

objective type tests. They should have they merits and good point to both these

type of test. For this purpose a test should have both essay and objective type

of test and questions so that it may cover at the time, the whole course as well

as improve the writing skill f the students.

2. It should not require long answer for essay type questions.

3. It should have large number of short essay type questions so that it may cover

the slow course in 9, short time.

4. It should not be prepared for evaluation the knowledge and information of the

students.

5. It should be arranged in the social and economical conditions of the country.

6. There should be no choice in the given questions. Students should have to

answer all the questions. This will discourage selective study.

68

Page 71: Measurement and Evaluation (Book).docx

UNIT-3:

APPRAISING CLASSROOM TESTS (ITEMS ANALYSIS)

3.1 THE VALUE OF ITEM

3.1.1 Item Analysis

Item is a statistical technique which is used for selecting and rejecting the

items of a test on the basis of their difficulty value and discriminative power. Item

analysis is a general term that refers to the specific methods used in education to

evaluate test items, typically for the purpose of test construction and revision.

Regarded as one of the most important aspects of test construction and increasingly

receiving duration, it k an approach incorporated into item response theory (ERT),

high serves as an alternative to classical measurement theory (GMT) or classical test

theory (CIT). Classical measurement theory considers a score to Ile the direct result of

a person's true score plus error. It is this error that is of interest as previous

measurement theories have been unable to specify its source. However, item response

theory uses item analysis to differentiate between types of error in order to gain a

clearer(1) The main objective of item analysis is to select the appropriate

understanding of any existing deficiencies. Particular attention is given to individual

test items, item characteristics, probability of answering items correctly, overall

ability of the test taker, and degrees or levels of knowledge being assessed.

Item analysis is concerned basically with the two characteristics of an item--

difficulty value and discriminative power.

Need of Item Analysis

Item analysis is a technique by which the test items are selected and rejected.

The selection of items may serve the purpose of the designer or test constructor,

because the items have the such characteristics. The following are the main purpose of

the test:

69

Page 72: Measurement and Evaluation (Book).docx

(a) Classification of students or candidates.

(b) Selection of the candidates for the job.

(c) Gradation is an academic purpose to assign grades or divisions to the students.

(d) Prognosis and promotion of the candidates or students.

(e) Establishing individual differences, and

(f) Research for the verification of hypotheses.

The different purposes require different types of test having the items of

different characteristics. The selection or entrance test includes the items of high

difficulty value as well as high power of discrimination. The promotion or prognostic

test has the items of moderate difficulty value. There are various techniques of item

analysis which are used these days.

The Objectives of Item Analysis

(1) The following are the main objectives of item analysis technique: items for the

final drift and reject the poor items which do not contribute in the functioning

of the test. Some items are to be modified.

(2) Item analysis obtains the difficulty values of all the items of preliminary draft

of the test. The items are classified- difficulties, moderate and easy items.

(3) It provides the discriminative power (item reliability; validity) to differentiate

between capable and less capable examines of all the items preliminary draft

of the test. The items are classified on the basis of the indexes-positive,

negative and no discrimination. The negative and no discrimination power

items are rejected out rightly.

70

Page 73: Measurement and Evaluation (Book).docx

(4) It also indicates the functioning of the distructors in the multiple-choice items.

The powerful and poor distructors are changed. It provides the basis for the

modification to be made in some of the items of preliminary draft.

(5) T3he reliability and validity of test depends on these characteristics of a test.

The functioning of a test is increased by this technique. Both these indexes and

considered simultaneously in selecting and rejecting the items of a test.

(6) It provides the basis for preparing the final draft a test. In the final draft items

are arranged in difficulty order. The most easy items are given in the

beginning and most difficult items are provided at the end.

(7) Item analysis is a cyclic technique. The modified items are tried out and their

item analysis is done again to obtain these indexes (difficulty and

discrimination). The empirical evidences are obtained for selecting the

modified items for the final draft.

Functions of Item Analysis

The main function of item analysis is to obtain the indexes of the items which

indicate its basic characteristics. There are three characteristics

(1) Item difficulty value (D. V.) is the proportion of subjects answering each item

correctly.

(2) Discriminative power (D.P.) of item, this characteristic is of two type —

(a) Item reliability— It is taken as the point-biserial correlation between an

item and the total test score, multiplied by the item standard deviation.

(b) Item validity— It is taken as the point biserial correlation between an item

and a criterion score multiplied by the item standard deviation.

The test as a whole should fulfil its purpose successfully; each of its items

must be able to discriminate between high and poor students on the test. In other

71

Page 74: Measurement and Evaluation (Book).docx

words, a test fulfils its purpose with maximum success when each .items serves as

good predictor. Therefore it is essential that each item of the test should be analysed

in terms of its difficulty value and discriminative power for the justification. Item

analysis serves the following purpose

(1) To improve and modify a test for immediate use on a parallel group of

subjects.

(2) To select the best items for a test with regard to its purpose after a proper try

out on the group of subjects selected from the target population.

(3) To provide the statistical check-up for the characteristics of the test items for

the judgment of test designer.

(4) To set up parallel forms of a test. Parallel form of test should not require only

to have Similar items content or type of items but they should also have the

sky& difficulty value and discriminative power. Item analysis' technique that

exactly parallel test can be developed, provides 'the empirical basis.

(5) To modify and reject OF poor items of the test. The poor items may not serve

the purpose of the test. The powerful distractor of items are changed an'tkpoor

distracters are also changed.

(6) Item analysis is usually done of a power test rather than speed test. It speed

test all the items are of the same difficulty value. The purpose of speed test is

to measure the speed and accuracy while speed is acquired through practice.

There is no power test, because the time limit is imposed, therefore these are

the speeded test. The speediness of the test depends on the difficulty values of

the items of the test. Most of the students should reach to last items, in the

allotted time for the test. Item analysis is the study of the statistical properties

of test items. The qualities usually of interest are the difficulty of the item and

its ability or power to differentiate between more capable and less capable

72

Page 75: Measurement and Evaluation (Book).docx

examinees. Difficulty is usually expressed as the percent or proportion getting

the item right, and discrimination as some index comparing success by the

more capable and the less capable students.

Meaning of definition of Difficulty Value (D.V.)

The term difficulty value of an item can be explained with the help of simple

example of extreme ends. If an item of test is answered correctly by every examinee,

it means the item is very easy the difficulty value is 100 percent or proportion is one.

This item will not serve any purpose and there is no use to include such items in a

test. Such items are generally rejected.

If an item is not answered correctly by any of the examinees. None could

answer correctly, it means the item is most difficult, the difficulty value is zero

percent or proportion is also zero. This item will not serve any purpose and there is no

use to include such items in a test. Such items are usually rejected.

"The difficulty value of an item is defined as the proportion or percentage of

the examinees who have answered the item correctly."

—1.1). Guilford

"The difficulty value of an item may be defined as the proportion of certain

sample of subjects who actually know the answer of item."

—Frank S. Freeman

In the definition of difficulty value, it has been stated that it is the percentage

and proportion of examinee's who answer the item correctly, but in the second

definition, the difficulty value is defined as the proportion of certain sample of

subjects who actually know the answer of an item. This statement seems to be most

functional and dependable, because an item can be answered correctly by guessing

but the examinee does not know the answer of the item. The difficulty value depends

73

Page 76: Measurement and Evaluation (Book).docx

on actually knowing the correct answer of an item rather than answering an item

correctly.

In the procedure of item-analysis "correction for guessing” formula is used for

the scores rather than right answers. The difficulty value is also obtained in terms of

standard scores or z-scores.

Methods or Techniques of item Analysis

A recent review of the literature on item analysis indicates that there are at

least twenty three different techniques of item analysis. As it has been discussed that

item analysis technique obtain the indexes for the characteristics of an item. The

following two methods of item analysis are most popular and are widely used.

1) Davis method of item analysis—It is the basic method of item analysis. It is

used for the prognostic test for selecting and rejecting the items on the basis of

difficulty value and discriminative power. The right responses are considered

in obtaining the indexes for the characteristics of an item. The proportion of

right responses on the items are considered for this purpose.

2) Stanley method of item analysis. It is used for the diagnostic test items. The

wrong responses are considered in obtaining the difficulty value and

discriminative power. The wrong responses provide the cause of weakness of

the students. The proportion of wrong responses on an item is considered for

this purpose.

There are separate techniques for obtaining difficulty value and discriminative power

of the items.

(a) Techniques of Difficulty Value.

There are two main approaches for obtaining difficult value.

74

Page 77: Measurement and Evaluation (Book).docx

a1 – Proportion of right responses on an item technique. Davis and Haper have also

used this technique.

a2 – Standard scores or z-scores or normal probability curve.

Technique of Discriminative Power.

b1 – Proportion of right responses on an item technique. Davis and Haper have used

this technique.

3.2 THE PROCEDURE/ PURPOSE OF ITEM ANALYSIS:

The review of literature on item analysis indicates that there are two dozen

techniques of item analysis have been devices to obtain the difficulty value and

discriminative index of an item of a test. It is not possible to describe all the

techniques of item analysis in this chapter. Therefore, most popular and widely used

techniques have been discussed.

Fredrick B. Davis method of Item Analysis of Prognostic test, and

Stanley method of Item Analysis of Diagnostic test.

"The item difficulty value may be defined as the proportion or percentage of

certain sample subjects that actually know the answer of an item.

--Frank S. Freeman

The difficulty value depends on actually knowing the answer rather than

answering correctly i.e. right responses. In objective type test, the items are answered

correctly by guessing rather than actually knowing the answer. It means that an item

may be answered without knowing its answer. Thus, correction for guessing is to be

used for obtaining the scores which may be actual correct responses.

It is important to note that in the procedure of item analysis item wise scoring is

done, while subject wise scoring is done in general. There are several formulas have

75

Page 78: Measurement and Evaluation (Book).docx

been developed by psychomatricians for 'guessing correction'. Some of the important

formula-correction for guessing has been discussed.

Formula-Correction for Guessing

The following two formula-corrections for guessing have been explained.

(a) Guilford's formula-correction for guessing and

(b) Horst's formula-correction for guessing.

(a) Guilford's formuia-correction for Guessing. J. P. Guilford has developed the

following formula-correction for guessing which used for estimating the actual scores

or actually know the answer.

S = R , (1) (n – 1)

where R = right responses on the item

W = wrong responses on the item

n = number of alternatives in the item

S = Actual correct responses on the item.

Example. An item is administered on a group of 50 subjects. The following responses

are obtained on different alternatives of the item.

(a) The functions of item analysis

(b) Selection of good items-8

(c) Rejection of poor items--7

3.2 MAKING THE MOST OF EXAMS: PROCEDURES FOR ITEM ANALYSIS:

One of the most important (if least appealing) tasks confronting faculty

members is the evaluation of student performance. This task requires considerable

76

Page 79: Measurement and Evaluation (Book).docx

skill, in part because it presents so many choices. Decisions must be made concerning

the method, format, timing, and duration of the evaluative procedures. Once designed,

the evaluative procedure must be administered and then scored, interpreted, and

graded. Afterwards, feedback must be presented to students. Accomplishing these

tasks demands a broad range of cognitive, technical, and interpersonal resources on

the part of faculty. But an even more critical task remains, one that perhaps too few

faculty undertake with sufficient skill and tenacity: investigating the quality of the

evaluative procedure.

Even after an exam, how do we know whether that exam was a good one? It is

obvious that any exam can only be as good as the items it comprises, but then what

constitutes a good exam item? Our students seem to know, or at least believe they

know. But are they correct when they claim that an item was too difficult, too tricky,

or too unfair?

Lewis Aiken (1997), the author of a leading textbook on the subject of

psychological and educational assessment, contends that a "postmortem" evaluation is

just as necessary in classroom testing as it is in medicine. Indeed, just such a

postmortem procedure for exams exists--item analysis, a group of procedures for

assessing the quality of exam items. The purpose of an item analysis is to improve the

quality of an exam by identifying items that are candidates for retention, revision, or

removal. More specifically, not only can the item analysis identify both good and

deficient items, it can also clarify what concepts the examinees have and have not

mastered.

So, what procedures are involved in an item analysis? The specific procedures

involved vary, but generally, they fall into one of two broad categories: qualitative

and quantitative.

77

Page 80: Measurement and Evaluation (Book).docx

Qualitative Item Analysis

Qualitative item analysis procedures include careful proofreading of the exam

prior to its administration for typographical errors, for grammatical cues that might

inadvertently tip off examinees to the correct answer, and for the appropriateness of

the reading level of the material. Such procedures can also include small group

discussions of the quality of the exam and its items with examinees who have already

taken the test, or with depaitinental student assistants, or even experts in the field.

Some faculty use a "think-aloud test administration" (cf. Cohen, Swerdlik, & Smith,

1992) in which examinees are asked to express verbally what they are thinking as they

respond to each of the items on an exam. This procedure can assist the instructor in

determining whether certain students (such as those who performed well or those who

performed poorly on a previous exam) misinterpreted particular items, and it can help

in determining why students may have misinterpreted a particular item.

Quantitative Item Analysis

In addition to these and other qualitative procedures, a thorough item analysis

also includes a number of quantitative procedures. Specifically, three numerical

indicators are often derived during an item analysis: Item difficulty, item

discrimination, and distractor power statistics.

Item Difficulty Index (p)

The item difficulty statistic is an appropriate choice for achievement or

aptitude tests when the items are scored dichotomously (i.e., correct vs. incorrect).

Thus, it can be derived for true-false, multiple-choice, and matching items, and even

for essay items, where the instructor can convert the range of possible point values

into the categories "passing" and "failing."

The item difficulty index, symbolized p, can be computed simply by dividing

the number of test takers who answered the item correctly by the total number of

students who answered the item. As a proportion, p can range between 0.00, obtained

78

Page 81: Measurement and Evaluation (Book).docx

when no examinees answered the item correctly, and 1.00, obtained when all

examinees answered the item correctly. Notice that no test item need have only one p

value. Not only may the p value vary with each class group that takes the test, an

instructor may gain insight by computing the item difficulty level for a number of

different subgroups within a class, such as those who did well on the exam overall and

those who performed more poorly.

Although the computation of the item difficulty indexp is quite

straightforward, the interpretation of this statistic is not. To illustrate, consider an item

with a difficulty level of 0.20. We do know that 20% of the examinees answered the

item correctly, but we cannot be certain why they did so. Does this item difficulty

level mean that the item was challenging for all but the best prepared of the

examinees? Does it mean that the instructor failed in his or her attempt to teach the

concept assessed by the item? Does it mean that the students failed to learn the

material? Does it mean that the item was poorly written? To answer these questions,

we must rely on other item analysis procedures, both qualitative and quantitative ones.

Item Discrimination Index (D)

Item discrimination analysis deals with the fact that often different test takers

will answer a test item in different ways. As such, it addresses questions of

considerable interest to most faculty, such as, "does the test item differentiate those

who did well on the exam overall from those who did not?" or "does the test item

differentiate those who know the material from those who do not?" In a more

technical sense then, item discrimination analysis addresses the validity of the items

on a test, that is, the extent to which the items tap the attributes they were intended to

assess. As with item difficulty, item discrimination analysis involves a family of

techniques. Which one to use depends on the type of testing situation and the nature

of the items. I'm going to look at only one of those, the item discrimination index,

symbolized D. The index parallels the difficulty index in that it can be used whenever

79

Page 82: Measurement and Evaluation (Book).docx

items can be scored dichotomously, as correct or incorrect, and hence it is most

appropriate for true-false, multiple-choice, and matching items, and for those essay

items which the instructor can score as "pass" or "fall."

We test because we want to find out if students know the material, but all we

learn for certain is how they did on the exam we gave them. The item discrimination

index tests the test in the hope of keeping the correlation between knowledge and

exam performance as close as it can be in an admittedly imperfect system.

The item discrimination index is calculated in the following way:

1. Divide the group of test takers into two groups, high scoring and low scoring.

Ordinarily, this is done by dividing the examinees into those scoring above

and those scoring below the median. (Alternatively, one could create groups

made up of the top and bottom quintiles or quartiles or even deciles.)

2. Compute the item difficulty levels separately for the upper (Pupper) and lower

(Plower) scoring groups.

3. Subtract the two difficulty levels such that D = P upper - Plower

How is the item discrimination index interpreted? Unlike the item difficulty

levelp, the item discrimination index can take on negative values and can range

between -1.00 and 1.00. Consider the following situation: suppose that overall, half of

the examinees answered a particular item correctly, and that all of the examinees who

scored above the median on the exam answered the item correctly and all of the

examinees who scored below the median answered incorrectly. In such a situation P,

upper, = 1.00 and P lower = 0.00. As such, thevalue of the item discrimination index D

is 1.00 and the item is said to be a perfect positive discriminator. Many would regard

this outcome as ideal. It suggests that those who knew the material and were well-

prepared passed the item while all others failed it.

80

Page 83: Measurement and Evaluation (Book).docx

Though it's not as unlikely as winning a million-dollar lottery, finding a

perfect positive discriminator on an exam is relatively rare. Most psychometricians

would say that items yielding positive discrimination index values of 0.30 and above

are quite good discriminators and worthy of retention for future exams.

Finally, notice that the difficulty and discrimination are not independent. If all

the students in both the upper and lower levels either pass or fail an item, there's

nothing in the data to indicate whether the item itself was good or not. Indeed, the

value of the item discrimination index will be maximized when only half of the test

takers overall answer an item correctly; that is, whenp = 0.50. Once again, the ideal

situation is one in which the half who passed the item were students who all did well

on the exam overall.

Does this mean that it is never appropriate to retain items on an exam that are

passed by all examinees, or by none of the examinees? Not at all. There are many

reasons to include at least some such items. Very easy items can reflect the fact that

some relatively straightforward concepts were taught well and mastered by all

students. Similarly, an instructor may choose to include some very difficult items on

an exam to challenge even the best-prepared students. The instructor should simply be

aware that neither of these types of items functions well to make discriminations

among those taking the test.

[material omitted...]

Conclusion

To those concerned about the prospect of extra work involved in item analysis,

take heart: item difficulty and discrimination analysis programs are often included in

the software used in processing exams answered on Scantron or other optically

scannable forms. As such, these analyses can often be performed for you by personnel

in your computer services office. You might consider enlisting the aid of your

81

Page 84: Measurement and Evaluation (Book).docx

departmental student assistants to help with item distractor analysis, thus providing

them with an excellent learning experience. In any case, an item analysis can certainly

help determine whether or not the items on your exams were good, ones and to

determine which items to retain, revise, or replace.

Understanding Item Analysis Reports

Item analysis is a process which examines student responses to individual test

items (questions) in order to assess the quality of those items and of the test as a

whole. Item analysis is especially valuable in improving items which will be used

again in later tests, but it can also be used to eliminate ambiguous or misleading items

in a single test administration. In addition, item analysis is valuable for increasing

instructors' skills in test construction, and identifying specific areas of course content

which need greater emphasis or clarity. Separate item analyses can berequested for

each raw score' created during a given ScorePak® run. Sample

Sample item analysis (30K PDF*)

A basic assumption made by ScorePak® is that the test under analysis is

composed of items measuring a single subject area or underlying ability. The quality

of the test as a whole is assessed by estimating its "internal consistency." The quality

of individual items is assessed by comparing students' item responses to their total test

scores.

Following is a description of the various statistics provided on a ScorePak®

item analysis report. This report has two parts. The first part assesses the items which

made up the exam. The second part shows statistics summarizing the performance of

the test as a whole.

82

Page 85: Measurement and Evaluation (Book).docx

Item Statistics

Item statistics are used to assess the performance of individual test items on

the assumption that the overall quality of a test derives from the quality of its items.

The ScorePak® item analysis report provides the following item information:

Item Number

This is the question number taken from the student answer sheet, and the

ScorePak® Key Sheet. Up to 150 items can be scored on the Standard Answer Sheet.

Mean and Standard Deviation

The mean is the "average" student response to an item. It is computed by

adding up the number of points earned by all students on the item, and dividing that

total by the number of students.

The standard deviation, or S.D., is a measure of the dispersion of student

scores on that item. That is, it indicates how "spread out" the responses were. The

item standard deviation is most meaningful when comparing items which have more

than one correct alternative and when scale scoring is used. For this reason it is not

typically used to evaluate classroom tests.

Item Difficulty

For items with one correct alternative worth a single point, the item difficulty

is simply the percentage of students who answer an item correctly. In this case, it is

also equal to the item mean. The item difficulty index ranges from 0 to 100; the

higher the value, the easier the question. When an alternative is worth other than a

single point, or when there is more than one correct alternative per question, the item

difficulty is the average score on that item divided by the highest number of points for

any one alternative. Item difficulty is relevant for determining whether students have

learned the concept being tested. It also plays an important role in the ability of an

item to discriminate between students who know the tested material and those who do

83

Page 86: Measurement and Evaluation (Book).docx

not. The item will have low discrimination if it is so difficult that almost everyone

gets it wrong or guesses, or so easy that almost everyone gets it right.

To maximize item discrimination, desirable difficulty levels are slightly higher

than midway between chance and perfect scores for the item. (The chance score for

five-option questions, for example, is 20 because one-fifth of the students responding

to the question could be expected to choose the correct option by guessing.) Ideal

difficulty levels for multiple-choice items in terms of discrimination potential are:

Format Ideal Difficulty

Five-response multiple-choice 70

Four-response multiple-choice 74

Three-response multiple-choice 77

True-false (two-response multiple-choice) 85

(from Lord, F.M. "The Relationship of the Reliability of Multiple-Choice Test to the

Distribution of Item Difficulties," Psychometrika, 1952, 18, 181-194.)

ScorePak® arbitrarily classifies item difficulty as "easy" if the index is 85% or

above; "moderate" if it is between 51 and 84%; and "hard" if it is 50% or below.

Item Discrimination

Item discrimination refers to the ability of an item to differentiate among

students on the basis of how well they know the material being tested. Various hand

calculation procedures have traditionally been used to compare item responses to total

test scores using high and low scoring groups of students. Computerized analyses

provide more accurate assessment of the discrimination power of items because they

take into account responses of all students rather than just high and low scoring

groups.

The item discrimination index provided by ScorePak® is a Pearson Product

Moment correlation2 between student responses to a particular item and total scores

84

Page 87: Measurement and Evaluation (Book).docx

on all other items on the test. This index is the equivalent of a point-biserial

coefficient in this application. It provides an estimate of the degree to which an

individual item is measuring the same thing as the rest of the items.

Because the discrimination index reflects the degree to which an item and the

test as a whole are measuring a unitary ability or attribute, values of the coefficient

will tend to be lower for tests measuring a wide range of content areas than for more

homogeneous tests. Item discrimination indices must always be interpreted in the

context of the type of test which is being analyzed. Items with low discrimination

indices are often ambiguously worded and should be examined. Items with negative

indices should be examined to determine why a negative value was obtained. For

example, a negative value may indicate that the item was mis-keyed, so that students

who knew the material tended to choose an unkeyed, but correct, response option.

Tests with high internal consistency consist of items with mostly positive

relationships with total test score. In practice, values of the discrimination index will

seldom exceed .50 because of the differing shapes of item and total score

distributions. ScorePak® classifies item discrimination as "good" if the index is above

.30; "fair" if it is between .10 and.30; and "poor" if it is below .10.

Alternate Weight

This column shows the number of points given for each response alternative.

For most tests, there will be one correct answer which will be given one point, but

ScorePak® allows multiple correct alternatives, each of which may be assigned a

different weight.

Means

The mean total test score (minus that item) is shown for students who selected

each of the possibleresponse alternatives. This information should be looked at in

conjunction with the discrimination index; higher total test scores should be obtained

85

Page 88: Measurement and Evaluation (Book).docx

by students choosing the correct, or most highly weighted alternative. Incorrect

alternatives with relatively high means should be examined to determine why "better"

students chose that particular alternative.

Frequencies and Distribution

The number and percentage of students who choose each alternative are

reported. The bar graph on the right shows the percentage choosing each response;

each "#" represents approximately 2.5%. Frequently chosen wrong alternatives may

indicate common misconception among the students.

Difficulty and discrimination Distributions

At the end of the Item Analysis report, test items are listed according their

degrees of difficulty (easy, medium, hard) and discrimination (good, fair, poor). These

distributions provide a quick overview of the test, and can be used to identify items

which are not performing well and which can perhaps be improved or discarded.

Test Statistics

Two statistics are provided to evaluate the performance of the test as a whole.

Reliability Coefficient

The reliability of a test refers to the extent to which the test is likely to produce

consistent scores. The particular reliability coefficient computed by ScorePak®

reflects three characteristics of the test:

The intercorrelations among the items -- the greater the relative number of

positive relationships, and the stronger those relationships are, the greater the

reliability. Item discrimination indices and the test's reliability coefficient are

related in this regard.

The length of the test -- a test with more items will have a higher reliability, all

other things being equal.

86

Page 89: Measurement and Evaluation (Book).docx

The content of the test -- generally, the more diverse the subject matter tested

and the testing techniques used, the lower the reliability.

Reliability coefficients theoretically range in value from zero (no reliability) to

1.00 (perfect reliability). In practice, their approximate range is from .50 to .90 for

about 95% of the classroom tests scored by ScorePak®.

High reliability means that the questions of a test tended to "pull together."

Students who answered a given question correctly were more likely to answer other

questions correctly. If a parallel test were developed by using similar items, the

relative scores of students would show little change.

Low reliability means that the questions tended to be unrelated to each other in

terms of who answered them correctly. The resulting test scores reflect peculiarities of

the items or the testing situation more than students' knowledge of the subject matter.

As with many statistics, it is dangerous to interpret the magnitude of a

reliability coefficient out of context. High reliability should be demanded in situations

in which a single test score is used to make major decisions, such as professional

licensure examinations. Because classroom examinations are typically combined with

other scores to determine grades, the standards for a single test need not be as

stringent. The following general guidelines can be used to interpret reliability

coefficients for classroom exams:

Reliability Interpretation

.90 and above Excellent reliability; at the level of the best standardized tests

.80- .90 Very good for a classroom test

.70 - .80 Good for a classroom test; in the range of most. There are

probably a few items which could be improved.

87

Page 90: Measurement and Evaluation (Book).docx

.60 - .70 Somewhat low. This test needs to be supplemented by other

measures (e.g., more tests) to determine grades. There are

probably some items which could be improved.

.50 - .60 Suggests need for revision of test, unless it is quite short (ten or

fewer items). The test definitely needs to be supplemented by

other measures (e.g., more tests) for grading.

.50 or below Questionable reliability. This test should not contribute heavily

to the course grade, and it needs revision.

The measure of reliability used by ScorePak® is Cronbach's Alpha. This is the

general form of the more commonly reported KR-20 and can be applied to tests

composed of items with different numbers of points given for different response

alternatives. When coefficient alpha is applied to tests in which each item has only

one correct answer and all correct answers are worth the same number of points, the

resulting coefficient is identical to KR-20.

(Further discussion of test reliability can be found in J. C. Nunnally,

Psychometric Theory. New York: McGraw-Hill, 1967, pp. 172-235, see especially

formulas 6-26, p. 196.)

Standard Error of Measurement

The standard error of measurement is directly related to the reliability of the

test. It is an index of the amount of variability in an individual student's performance

due to random measurement error. If it were possible to administer an infinite number

of parallel tests, a student's score would be expected to change from one

administration to the next due to a number of factors. For each student, the scores

would form a "normal" (bell-shaped) distribution. The mean of the distribution is

assumed to be the student's "true score," and reflects what he or she "really" knows

about the subject. The standard deviation of the distribution is called the standard

88

Page 91: Measurement and Evaluation (Book).docx

error of measurement and reflects the amount of change in the student's score which

could be expected from one test administration to another.

Whereas the reliability of a test always varies between 0.00 and 1.00, the

standard error of measurement is expressed in the same scale as the test scores. For

example, multiplying all test scores by a constant will multiply the standard error of

measurement by that same constant, but will leave the reliability coefficient

unchanged.

A general rule of thumb to predict the amount of change which can be

expected in individual test scores is to multiply the standard error of measurement by

1.5. Only rarely would one expect a student's score to increase or decrease by more

than that amount between two such similar tests. The smaller the standard error of

measurement, the more accurate the measurement provided by the test.

(Further discussion of the standard error of measurement can be found in J. C.

Nunnally, Psychometric Theory. New York: McGraw-Hill, 1967, pp.172-235, see

especially formulas 6-34, p. 201.)

A Caution in Interpreting Item Analysis Results

Each of the various item statistics provided by ScorePak® provides

information which can be used to improve individual test items and to increase the

quality of the test as a whole. Such statistics must always be interpreted in the context

of the type of test given and the individuals being tested. W. A. Mehrens and I. J.

Lehmann provide the following set of cautions in using item analysis results

(Measurement and Evaluation in Education and Psychology. New York: Holt,

Rinehart and Winston, 1973, 333-334):

Item analysis data are not synonymous with item validity. An external

criterion is required to accurately judge the validity of test items. By using the

89

Page 92: Measurement and Evaluation (Book).docx

internal criterion of total test score, item analyses reflect internal consistency

of items rather than validity.

The discrimination index is not always a measure of item quality. There is a

variety of reasons an item may have low discriminating power:

a) extremely difficult or easy items will have low ability to discriminate but such

items are often needed to adequately sample course content and objectives;

b) an item may show low discrimination if the test measures many different

content areas and cognitive skills. For example, if the majority of the test

measures "knowledge of facts," then an item assessing "ability to apply

principles" may have a low correlation with total test score, yet both types of

items are needed to measure attainment of course objectives.

Item analysis data are tentative. Such data are influenced by the type and

number of students being tested, instructional procedures employed, and

chance errors. If repeated use of items is possible, statistics should be recorded

for each administration of each item.

Raw scores are those scores which are computed by scoring answer sheets

against a ScorePak® Key Sheet. Raw score names are EXAM1 through EXAM9,

QUIZ1 through QUIZ9, MIDTRMI through MIDTRM3, and FINAL. ScorePak®

cannot analyze scores taken from the bonus section of student answer sheets or

computed from other scores, because such scores are not derived from individual

items which can be accessed by ScorePak®. Furthermore, separate analyses must be

requested for different versions of the same exam. Return to the text. (anchor near

note 1 in text)

A correlation is a statistic which indexes the degree of linear relationship

between two variables. If the value of one variable is related to the value of another,

they are said to be "correlated." In positive relationships, the value of one variable

tends to be high when the value of the other is high, and low when the other is low. In

90

Page 93: Measurement and Evaluation (Book).docx

negative relationships, the value of one variable tends to be high when the other is

low, and vice versa. The possible values of correlation coefficients range from -1.00

to 1.00. The strength of the relationship is shown by the absolute value of the

coefficient (that is how large the number is whether it is positive or negative). The

sign indicates the direction of the relationship (whether positive or negative). Return

to the text.

*Software capable of displaying a PDF is required for viewing or printing this

document. Adobe Reader is available free of charge from the Adobe Web site at

http://www.adobe.com/products/acrobatireadstea.html

QUESTION:

A few years ago in your Shiken column, you showed how to do item analysis

for weighted items using a calculator (Brown, 2000, pp. 19-21) and a couple of

columns back (Brown, 2002, pp. 20-23) you showed how to do distractor efficiency

analysis in a spreadsheet program. But, I don't think you have ever shown how to do

regular item analysis statistics in a spreadsheet. Could you please do that? I think

some of your readers would find it very useful.

ANSWER:

Yes, I see what you mean. In answering questions from readers, I explained

more advanced concepts of item analysis without laying the groundwork that other

readers might need. To remedy that, in this column, I will directly address your

question, but only with regard to norm-referenced item analysis. In my next Statistics

Corner column, I will address another reader's question, and in the process show how

criterion-referenced item analysis can be done in a spreadsheet.

The Overall Purpose of Item Analysis

Let's begin by answering the most basic question in item analysis: Why do we

do item analysis? We do it as the penultimate step in the test development process.

Such projects are usually accomplished in the following steps:

91

Page 94: Measurement and Evaluation (Book).docx

1. Assemble or write a relatively large number of items of the type you want on

the test.

2. Analyze the items carefully using item format analysis to make sure the items

are well written and clear (for guidelines, see Brown, 1996, 1999; Brown &

Hudson, 2002).

3. Pilot the items using a group of students similar to the group that will

ultimately be taking the test. Under less than ideal conditions, this pilot testing

may be the first operational administration of the test.

4. Analyze the results of the pilot testing using item analysis techniques. These

are described below for norm-referenced tests (NRTs) and in the next column

for criterion-referenced tests (CRTs).

5. Select the most effective items (and get rid of the ineffective items) to make a

shorter, more effective revised version of the test.

Basically, those five steps are followed in any test development or revision project.

Item Analysis Statistics for Norm-Referenced Tests

As indicated above, the fourth step, item analysis, is different for NRTs and

CRTs, and in this column, I will only explain item analysis statistics as they apply to

NRTs. The basic purpose of any NRT is to spread students out along a general

continuum of language abilities, usually for purposes of making aptitude, proficiency,

or placement decisions (for much more on this topic, see Brown, 1996, 1999; Brown

& Hudson, 2002). Two item statistics are typically used in the item analysis of such

norm-referenced tests: item facility and item discrimination.

Item facility (IF) is defined here as the proportion of students who answered a

particular item correctly. Thus, if45 out of 50 students answered a particular item

correctly, the proportion would be 45/50 = .90. An IFof .90 means that 90% of the

students answered the item correctly, and by extension,that the item is very easy. In

Screen 1, you will see one way to calculate IFusing the Excel® spreadsheet for item 1

92

Page 95: Measurement and Evaluation (Book).docx

(I1) in a small example data set coded 1 for correct and 0 for incorrect answers.

Notice the cursor has outlined cell C21 and that the function/formula typed in that cell

(shown both in the row above the column labels and in cell B21) is = AVERAGE

(C2:C19), which means average the ones and zeros in the range between cells C2 and

C19. The result in this case is .94, a very easy item because 94% of the students are

answering correctly.

All the other NRT and CRT item analysis techniques that I will discuss here

and in the next column are based on this notion of item facility. For instance, item

discrimination can be calculated by first figuring out who the upper and lower

students are on the test (using their total scores to sort them form the highest score to

the lowest). The upper and lower groups should probably be made up of equal

numbers of students who represent approximately one third of the total group each. In

Screen 1, I have sorted the students from high to low based on their total test scores

93

Page 96: Measurement and Evaluation (Book).docx

from 77 for Hide down to 61 for Hachiko. Then I separated the three groups such that

there are five in the top group, five in the bottom group, and six in the middle group.

Notice that Issaku and Naoyo both had scores of 68 but ended up in different groups

(as did Eriko and Kimi with their scores of 70). The decision as to which group they

were assigned to was made with a coin flip.

To calculate item discrimination (ID), I started by calculating IFfor the upper

group using the following: = AVERAGE(C2:C6), as shown in row 22. Then, I

calculated IFfor the lower group using the following: = AVERAGE(C15:C19), as

shown in row 23. With IFupper and IFlower in hand, calculatingIDsimply required

subtracting IFupper–IFlower. I did this by subtracting C22 minus C23, or = C22 -C23, as

shown in row 24, which resulted in an IDof .20 for I1.

Once I had calculated the four item analysis statistics shown in Screen 1 for Il,

I then simply copied them and pasted them into the spaces below the other items,

which resulted in all the other item statistics you see in Screen 1. [Note that the

statistics didn't always fit in the available spaces, so I got results that looked like ###

in some cells; to fix that, I blocked out all the statistics and typed alt oca and

thusadjusted the column widths to fit the statistics. You may also want to adjust the

number of decimal places, which is beyond the scope of this article. You can learn

about this by looking in the Help menu or in the Excel manual.

Ideal items in an NRT should have an average IFof .50. Such items would thus

be well centered, i.e., 50 percent of the students would have answered correctly, and

by extension, 50 percent would have answered incorrectly. In reality however, items

rarely have an IFof exactly .50, so those that Ell in a range between .30 and .70 are

usually considered acceptable for NRT purposes.

Once those items that fall within the .30 to .70 range of IFs are identified, the

items among them that have the highest IDs should be further selected for inclusion in

94

Page 97: Measurement and Evaluation (Book).docx

the revised test. This process would help the test designer to keep only those items

that are well centered and discriminate well between the high and the low scoring

students. Such items are indicated in Screen 1 by an asterisk in row 25 (cleverly

labeled "Keepers").

For more information on using item analysis to develop NRTs, see Brown

(1995, 1996, 1999). For information on calculating NRT statistics for weighted items

(i.e., items that cannot be coded 1 or 0 for correct and incorrect), see Brown (2000).

For information on calculating item discrimination using the point-biserial correlation

coefficient instead of ID, see Brown (2001). For an example NRT development and

revision project, see Brown (1988).

Conclusion

I hope you have found my explanation of how to do norm-referenced item

analysis statistics (item facility and item discrimination) in a spreadsheet clear and

helpful. I must emphasize that these statistics are only appropriate for developing and

analyzing norm-referenced tests, which are usually used at the institutional level, like,

for example, overall English language proficiency tests (to help with, say,admissions

decisions) or placement tests (to help place students into different levels of English

study within a program). However, these statistics are not appropriate for developing

and analyzing classroom oriented criterion-referenced tests like the diagnostic,

progress, and achievement tests of interest to teachers. For an explanation of item

analysis as it is applied to CRTs, read the Statistics Corner column in the next issue of

this newsletter, where I will explain the distinction between the difference index and

the B-index.

95

Page 98: Measurement and Evaluation (Book).docx

3.3 ITEM DIFFICULTY:

Definition:

“item difficulty is a measure of the proportion of individuals who responded correctly

to each test item.” Item difficulty in a test determined by two proportion of

individuals who correctly respond to the item in particular.

“item difficulty of a test for a particular group is evaluated by the percentage of

participates who respond correctly.”

Explanation:

Item difficulty is simply the percentage of students taking two tests who

answered the item correctly. The larger the percentage getting an item rights the easier

two items. The higher the difficulty index, the easier the item is understood to be

(wood, 1960). To compute the item difficulty, divide the number of people answering

the item correctly by the total number of people answering item. The proportion for

the item is usually denoted by P and it called item difficulty. The range is from 0% to

100%.

Examples:

To determine the difficulty level of test items, a measure called difficulty

Indene is used. This measure asks teachers to calculate the proportion of students who

answered the test correctly. By looking each alternative (for multiple choice), we can

also find out if there are answers choices that should be replaced. For example, let’s

we give a multiple choice quiz and there were four answer choices (A,B,C and D).

Two following talde illustrates how many students selected each answer choice for

Question # 1 and # 2.

Questions A B C D

#1 0 3 24* 3

#2 12* 13 3 2

*Devertes correct answers.

96

Page 99: Measurement and Evaluation (Book).docx

For question # 1, we can see that A was not a very good distracter no one

selected that answer. We can also compute the difficulty of item by dividing the

number of students who choose two correct answers (24) by the number of total

students (30) by using formula, the difficulty of Question # 1, P is equal to

P= 2430

P= .80

A rough “role – of thumb” is that if the item difficulty is more then 75, it is an

easy item; if the difficulty is below 25, it is a difficult item. Given these parameters,

this item could be regarded moderately easy 10ts (80%) of students got it correct. In

contrast, Question # 2 is much more difficult. ( 1230

=.40).

P= 1230

P= .40

In fail, on question # 2, more students selected an incorrect answer (B) than

selected the correct answer (A). This item should be carefully analyzed to ensure that

B is an appropriate distracter.

Therefore “Item difficulty” should have been named “item easiness;” it

expresses the proportion or percentage of students who answered two items correctly.

3.4 THE INDEX OF DISCRIMINATION

Introduction

1. The index of discrimination is a useful measure of item quality whenever the

purpose of a test is to produce a spread of scores, reflecting differences in

student achievement, so that distinctions may be made among the

97

Page 100: Measurement and Evaluation (Book).docx

performances of examinees. This is likely to be the purpose of norm-

referenced tests.

2. It is a degree to which students with high overall exam scores also got a

particular item correct. It is often referred to as Item Effect, since it is an index

of an item's effectiveness at discriminating those who know the content from

those who do not.

3. The item discrimination index is a point biserial correlation coefficient. Its

possible range is -1.00 to 1.00. A strong and positive correlation suggests that

students who get any one question correct also have a relatively high score on

the overall exam. Theoretically, this makes sense. Students who know the

content and who perform well on the test overall should be the ones who know

the content. There's a problem if students are getting correct answers on a test

and they don't know the content.

Measurement of Index of Discrimination

Examples I If we are using the Item Analysis provided by Scanning Operations,

discrimination indices are listed under the column head ‘Disc.’

RESPNSE TABLE - FORMA

ITEMNO OMIT A B C D E KEY- % DISC

% % % % % %1 0 0 18 82 0 0 C 82 0.222 0 79 0 0 21 0 A 79 0.233 0 4 7 89 0 0 C 89 -0.12

The Index of Discrimination

We examine item discrimination; there are a number of things we should consider.

1. Item difficulty! Very easy or very difficult items are not good discriminators.

If an item is so easy (e.g., difficulty = 98) that nearly everyone gets it correct

or so difficult (e.g., difficulty = 30)

98

Page 101: Measurement and Evaluation (Book).docx

2. That nearly everyone gets it wrong, then it becomes very difficult to

discriminate those who actually know the content from those who do not.

3. That does not mean that very easy and very difficult items should be

eliminated. In fact, they are fine as long they are used with the instructor's

recognition that they will not discriminate well and if putting them on the test

matches the intention of the instructor to either really challenge students or to

make certain that everyone knows a certain bit of content.

4. A poorly written item will have little ability to discriminate.

Example 2

Another measure, the Discrimination Index, refers to how well an assessment

differentiates between high and low scorers. In other words, you should be able to

expect that the high-performing students would select the correct answer for each

question more often than the low-performing students. If this is true, then the

assessment is said to have a positive discrimination index (between 0 and 1) --

indicating that students who received a high total score chose the correct answer for a

specific item more often than the students who had a lower overall score.

If, however, you find that more of the low-performing students got a specific

item correct, then the item has a negative discrimination index (between -1 and 0).

Let's look at an example.

Table 1 displays the results of ten questions on a quiz. Note that the students

are arranged with the top overall scorers at the top of the table 1

Table-1: The Index of Discrimination

Student Total score (%)

Questions

1 2 3

Asif 90 1 0 1

99

Page 102: Measurement and Evaluation (Book).docx

Sam 90 1 0 1

Jill 80 0 0 1

Charlie 80 1 0 1

Sonya 70 1 0 1

Ruben 60 1 0 0

Clay 60 1 0 1

Kelley 50 1 1 0

Justin 50 1 1 0

Tonya 40 0 1 0

“1” indicates the answer was correct; “0” indicates it was incorrect.

Steps to determine the Difficulty Index and the Discrimination Index.

1. After the students are arranged with the highest overall scores at the top, count

the number of students in the upper and lower group who got each item

correct. For Question #1, there were 4 students in the top half who got it

correct and 4 students in the bottom half.

2. Determine the Difficulty Index by dividing the number who got it correct by

the total number of students. For Question #1, this would be 8/10 or p=.80.

3. Determine the Discrimination Index by subtracting the number of students in

the lower group who got the item correct from the number of students in the

upper group who got the item correct. Then, divide by the number of students

in each group (in this case, there are five in each group). For Question #1, that

means you would subtract 4 from 4, and divide by 5, which results in a

Discrimination Index of 0.

4. The answers for Questions 1-3 are provided in Table 1

100

Page 103: Measurement and Evaluation (Book).docx

101

Page 104: Measurement and Evaluation (Book).docx

Table-2

Item # Correct (upper

group)

# Correct

(Lower group)

Difficulty (p) Discrimination

(D)

Question 1 4 4 .80 0

Question 2 0 3 .30 -0.6

Question 3 5 1 .60 0.8

In table 2 we can see that Question #2 had a difficulty index of .30 (meaning it

was quite difficult), and it also had a negative discrimination index of -0.6 (meaning

that the low-performing students were more likely to get this item correct). This

question should be carefully analyzed, and probably deleted or changed. Our "best"

overall question is Question 3, which had a moderate difficulty level (.60), and

discriminated extremely well (0.8).

Recommendations for Determining Index of Discrimination

It is typically recommended that item discrimination be at least .20. It's best to

aim even higher. Items with a negative discrimination are theoretically indicating that

either the students who performed poorly on the test overall got the question correct

or that students with high overall test performance did not get the item correct. Thus,

the index could signal a number of problems:

There is a mistake on the scoring key.

Poorly prepared students are guessing correctly.

Well prepared students are somehow justifying the wrong answer.

In all cases, action must be taken! So, items with negative item difficulty must

be addressed. Items with discrimination indices less than .20 (or slightly over, but still

relatively low) must be revised or eliminated. Be certain that there is only one

102

Page 105: Measurement and Evaluation (Book).docx

possible answer, that the question is written clearly, and that your answer key is

correct.

UNIT-4:

INTERPRETING THE TEST SCORES

4.1 THE PERCENTAGE CORRECT SCORE:

INTERPRETING THE TESTS SCORES

What does score test mean?

A test score is a piece of information, usually a number, that conveys the performance

of an examinee on a test. One formal definition is that it is "a summary of the

evidence contained in an examinee's responses to the items of a test that are related to

the construct or constructs being measured."

Test scores are interpreted with a norm-referenced or criterion-referenced

interpretation, or occasionally both. A norm-referenced interpretation means that the

score conveys meaning about the examinee with regards to their standing among other

examinees. A criterion-referenced interpretation means that the score conveys

information about the examinee with regards a specific subject matter, regardless of

other examinees' scores.

Types of Test Scores

There are two types of test scores: raw scores and scaled scores. A raw score is a

score without any sort of adjustment or transformation, such as the simple number of

questions answered correctly. A scaled score is the results of some transformation

applied to the raw score.

The purpose of scaled scores is to report scores for all examinees on a consistent

scale. Suppose that a test has two forms, and one is more difficult than the other. It

has been determined by equating that a score of 65% on form 1 is equivalent to a

103

Page 106: Measurement and Evaluation (Book).docx

score of 68% on form 2. Scores on both forms can be converted to a scale so that

these two equivalent scores have the same reported scores. For example, they could

both be a score of 350 on a scale of 100 to 500.

Two well-known tests in the United States that have scaled scores are the ACT and

the SAT. The ACT's scale ranges from 0 to 36 and the SAT's from 200 to 800 (per

section). Ostensibly, these two scales were selected to represent a mean and standard

deviation of 18 and .6 (ACT), and 500 and 100. The upper and lower bounds were

selected because an interval of plus or minus three standard deviations contains more

than 99% of a population. Scores outside that range are difficult to measure, and

return little practical value.

Note that scaling does not affect the psychometric properties of a test, it is something

that occurs after the assessment process (and equating, if present) is completed.

Therefore, it is not an issue of psychometrics, per se, but an issue of interpretability.

Interpretation the Score by Criterion Referencing

The raw score is number of points received on a test when the test has been scored

according to the instructions. Raw score is not very meaningful without further

information. Criterion-referenced test interpretation permits us to describe an

individual's test performance without referring to the performance of other

individuals. Thus we might describe a student's performance in terms of the speed,

precision with which a certain task is performed. Criterion-referenced interpretation

of test scores is most meaningful when the test is designed to measure a set of clearly

stated learning tasks. Enough items are used for each interpretation to make

dependable Judgments.

104

Page 107: Measurement and Evaluation (Book).docx

Interpretation the Score by Percentages

In mathematics, a relationship with 100 is called percentage (denoted by %). Often it

is useful to express the scores in terms of percentages for comparison. Consider the

following example.

Grade Class A No. of Students % Class B No. of Students %

A

B

10

25

12.50

31.25

8

6

40

30

C

D

30

15

37.50

18.75

4

2

20

10

Total 80 100 20 100

Ten students from class A and eight students from class B got grade A. It

looks apparently that class A is better in getting A grade but 12.5% of the students

from class A and 40% students from class B got grade A. It is clear from the

percentages that class.

B is far better in getting grade A than class A.

Interpretation the Score by Norm Referencing

Interpretation of scores by norm referencing involves making of scores and

expressing a given score in relation, to the other scores Norm-referenced test

interpretation tells us how an individual is compared with other persons who have

taken the same test. The simplest type of comparison is to rank the scores from

highest to lowest and to note where an individual's score falls. The rest of the scores

serve as the norm group. The given score is compared with the other scores by norm

referencing. If a student's score is second from the top in a group of 20 students, it is a

high score meaning that the scores of 90% of the students are less than him.

Ordering and Ranking

105

Page 108: Measurement and Evaluation (Book).docx

A first step in organizing scores in the listing of scores in order of magnitude

from largest to the smallest score. The data so arranged are called ordered array. By

scanning an ordered array, we can determine quickly the largest score, the smallest

score and other facts about the data.

Ranked data consists of scores in a form that shows their relative position on

some characteristic but does not yield a numerical value for this characteristic. The

order of finish of cars in a race is an example of ranking. If we list the cars as first,

second, third etc. up to the last car, we can say that they were ranked on the

characteristic of overall speed. We know each car's position relative to any other car's

position but we have no precise knowledge of the speed of any car. A high school

teacher ranked Hamid 30th in a class of 100 means that Hamid did better than 70 of

his classmates but poorer than 29. But nothing has been aid about Hamid's general

level of achievement.

Measurement Scales

Measurement scales are of great significance in analyzing and interpreting

results. The important types of measurement scales are:

The Nominal Scale

The lowest measurement scale is the nominal scale. In this scale, each

individual is put into one of the distinct categories or classes. Each class has a name.

The names are just labels. There is no order in these classes. We cannot say that one

class is larger than the other class. You cannot do arithmetic operations (addition,

subtraction, multiplication, division) on this scale.

Examples of the nominal scale are Categorization of blood groups of the

students of a college into A, B, AB and 0 groups. We cannot say that group A is better

than group B. Classification of books in a college library according to subjects.

106

Page 109: Measurement and Evaluation (Book).docx

Distribution of the population of Pakistan according to sex, religion,

occupations, marital status, literacy etc., is examples of the nominal scale.

The Ordinal Scale

When measurements are not only different from category to category but can

also be ranked according to some criterion, they are to be measured on an ordinal

scale. The members of anyone category are considered equal but members of one

category are considered lower than those in another category. The ordinal scale is

one-step higher than the nominal scale because we distribute the individuals not only

in classes but we also order these classes.

Examples of the ordinal scale are Categorization of schools according to their

educational level into primary, middle, secondary or higher secondary is an ordinal

scale. There is an order in these classes. The primary level is lower than the middle

level and the middle level is lower than the secondary level. You cannot do arithmetic

operations on this scale.

Individuals may be classified according to socioeconomic status as low,

medium, high. Intelligence of students may be average, above average or below

average. Classification of examination results into different grades A), A, B, C, D, E

etc. In this measurement scale, we can say that one individual is larger than the other

but we cannot say how large it is.

The Interval Scale

In this scale, it is not only possible to order measurement but also the distance

between two measurements is known. We can say that the difference between two

measurements 30 and 40 is equal to the difference between measurements 40 and 50.

The level of the interval scale is higher than the nominal and the ordinal scales. This

is truly a quantitative scale. A unit of measurement and a zero point are required for

this scale. The selected zero point is not necessarily a true zero. It does not have to

107

Page 110: Measurement and Evaluation (Book).docx

indicate a total absence of the quantity being measured. We measure height in meters

or feet, weight in kilograms or pounds, temperature in centigrade or Fahrenheit,

income in rupees and the time in seconds. Arithmetic operations can be done on this

scale. You can add the income of a wife to that of his husband.

The Ratio Scale

The highest level of measurement is the ratio scale. Equality of ratios as well

as equality of intervals is determined in this scale. Fundamental to the ratio scale is

the true zero point. The measurement of height, weight and length makes use of the

ratio scale.

Frequency Distribution

Data that have been originally collected is called raw data or primary data. It

has not yet undergone any statistical technique. To understand the raw data easily, we

arrange into groups or classes. The data so arranged is called groups data or frequency

distribution.

General rules far the construction of a frequency distribution:

1. Determine the Range. Range is the difference between highest and lowest

scores.

2. Decide the appropriate number of class intervals: 'There is no hard and fast

formula for deciding the number of class intervals. The number of class

intervals is usually taken between 5 and 20 depending on the length of the

data.

3. Determine the approximate length of the class interval by dividing the range

with number of class intervals.

5. Determine the limits of the class intervals taking the smallest scores at the

bottom of the column to the largest scores at the top.

108

Page 111: Measurement and Evaluation (Book).docx

5. Determine the number of scores falling in each class interval. This is done by

using a tally or score sheet.

Example:

The marks obtained by 120 students of first year class in the subject of

Education are given below-Construct a frequency distribution.

57 86 69 62 75 73 80 78 87 83 77 35 70 68 84 73 81 78

61 72 59 98 95 63 76 73 88 60 52 83 86 45 70 53 85 74

62 78 89 84 60 79 91 64 84 85 81 79 90 78 83 50 71 65

76 58 71 79 51 61 61 89 81 74 76 74 82 91 71 76 80 52

71 66 77 65 44 79 95 74 79 63 83 87 77 75 83 48 70 85

61 70 72 67 61 83 75 79 97 75 66 54 81 68 78 75 83 61

33 76 62 55 72 76 78 75 99 80 83 86

The following steps are followed to-make a frequency distribution.

1. Step-1: Range = maximum score-minimum score = 99 — 33 = 66.

2. Step-2: Number of approximate class intervals to be taken is 7.

3. Step-3: Length of the class intervals, usually denoted by i, is.

I = Range

No .of class intervals

The length is usually rounded upward to whole number. Therefore 9.4 is taken

as 10.

4. Step-4: Determine the limits of the class intervals

90 — 99

80 — 89

70 — 79

60 — 69

109

Page 112: Measurement and Evaluation (Book).docx

50 — 59

40 — 49

30 — 39

The lowest class interval is taken in which the minimum scores can be

included. The minimum score is 33. The lowest class interval can be started from 30,

but it is convenient to start the lowest class interval from the score to which addition

of the length of the class intervals is easy. So we start from 30. This is called lower

limit of the class intervals. Add 9 (1 — 1 = 10 — 1 = 9) to the lower limit to get the

upper limit of the first class interval. Now add consequently i = 10 to the lower limits

and upper limits to get the remaining class intervals.

5. Step-5: Distribute the scores in the class intervals by putting a tally mark in the

relevant class interval and count the number of scores in each class interval.

Grade Tallies No. of Students

90 - 99 ||||| ||| 88

80 – 89 ||||| ||||| ||||| ||||| ||||| ||||| 3030

70 – 79 ||||| ||||| ||||| ||||| ||||| ||||| ||||| ||||| ||||| 45

60 – 69 ||||| ||||| ||||| ||||| || 223

50 – 59 ||||| ||||| 10

40 – 49 ||| 3

30 – 39 || 2

Frequency

The number of scores lying in a class interval is called the frequency of that

class interval. For Example two scores lie in the class interval 30-39. Therefore 2 is

the frequency of the class interval 30-39.

Mid-Point of Class Mark

The middle of a class interval is called mid point or class mark and is usually

denoted by X. It is calculated as

110

Page 113: Measurement and Evaluation (Book).docx

Midpoint = X = Lower limit+Upper limit

2

For Example, the mid point of the class interval 30-39 is

X = Lower limit+Upper limit

2=30+39

2

= 692

= 34.5

Measures of Central Tendency:

A single score calculated to represent all the scores is called an average.

Average tends to lie in the centre of an array. That is why averages are called

measures of central tendency. Since averages locate the centre of a data set, these are

also called measures of location.

Several types of average can be defined. Most commonly used averages are

arithmetic mean, median and mode.

The Arithmetic Mean or Mean

The arithmetic mean is the most commonly used average. It is usually called

mean or average. The arithmetic mean is defined as the number obtained by dividing

the sum of the scores by their number. It is denoted by putting bar on the variable

symbol e.g., X (reads as X bar). The formula for calculating the arithmetic mean

ungrouped data is:

X=∑ x

N

Where

Read as sigma, is the Greek symbol means sum of.

X Means sum of the values of variable X.

N is the number of scores of measurements. In order to calculate the arithmetic

mean for grouped data formula is: X fx / f /

111

Page 114: Measurement and Evaluation (Book).docx

Where

fx means the sum of product of values for ‘f’ and `x', f means frequency of

the scores and x means score. f means the sum of all the frequencies of the

distribution.

The Median:

The median of a set of scores is the middle score of the arithmetic mean of two

middle scores in an array. 50% of the scores are less than median and 50% of the

scores are greater than median.

Formula for calculating median for ungrouped data:

Median = ( N+12 )th score

Formula for calculating median for grouped data:

Median = L + if+( N

2−C)

Where

L = lower class boundary of the median class interval.

I = length of the median class interval.

F - the frequency of the median class interval.

N = f

C = the cumulative frequency of the class interval below the median class interval.

The Mode

The mode is the score that occurs greatest number of times in a data set. Mode

does not always exist. If each score occur the same number of times, there is no mode.

112

Page 115: Measurement and Evaluation (Book).docx

There may be more than one mode. If two or more scores occur greatest number of

times, then there are more than one mode.

The mode can be calculated for grouped data with the help of following

formula.

Mode = L+f m−f l

2 f m−f 1−f 2

×i

Where

L = lower class boundary of the modal class interval.

Fm = the maximum frequency.

Fi = the frequency preceding to the modal class.

f2 = the frequency succeeding to the modal class,

I = the length of the modal class interval.

Note: The mode lies in the class interval having maximum frequency. This class

interval is called the modal class.

Empirical Relationship between Mean, Median and Mode:

For moderately skewed distributions, we have the following empirical

relation:

Mode = 3 Median — 2 Mean

Mode = 3 (74.61) — 2 (73.42)

Mode = 76.99

Comparison of Measures of Central Tendency:

The numerical value of every score in a data set contributes to the mean. This

is not true of the mode or median because only the mean is based on the sum of all the

scores. In a single peaked symmetrical distribution mean = median = mode. In

113

Page 116: Measurement and Evaluation (Book).docx

practice, no distribution is exactly symmetrical, so the mode, median and mean

usually have different values. If a population is not symmetrical, the mean, median

and mode will not be equal. The mean is affected by the presence of a few extreme

scores which the median and mode are not. The mean is preferred if extreme values

are not present in the data. Median is preferred if interest is centered on the typical

rather than the total score and if the distribution is skewed. If some scores are missing

so that the mean cannot be computed directly, the median is appropriate. Mode is

preferred only if the distribution is multimodal and a multi-valued index is

satisfactory.

The Quartiles

The values that divide a set of scores into four equal parts are called quartiles

and are denoted by Ql, Q2, and Q3. Q1 is called the lower quartile and Q3 is called

the upper quartile. 25% of the scores are less than Ql- and 75% of the scores are less

than Q3. Q2 is the median. The formulas for the quartiles are given as:

Q1 = ( N+14 )th score

Q2 = 2 ( N+1 )

4= N+1

2th score

and

Q3 = 3(N+1)

4th score

4.2 THE PERCENTILE RANKS:

The Percentiles:

The values that divide a set of scores into hundred equal parts are called

percentiles and are denoted by P1, P2, P3, ……….. and P99.? P25 is the first quartile,

P75 is the third quartile and P50 is the median.

114

Page 117: Measurement and Evaluation (Book).docx

The Percentile Ranks (PR):

The procedure for calculating percentile ranks is the reverse of the procedure

for calculating percentiles. Here we have an individual's score and find the percentage

of scores that lies below it. In the Example, we calculate P78== 83.37. It means that

83.37 is the score below which 78% of the scores fall. If a student has a score of

83.37, we can say that his percentile rank (PR) is 78 on a scale of 100.

Relationships with a Distribution

Computing the Coefficient of Correlation

A coefficient of correlation measures the degree of linear relationship between

two sets of scores. The range of the coefficient is from -- 1 to + I with intermediate

value 0 meaning no linear relationship. There are two extremes: r = + 1 indicates

perfect positive correlation and r = —I indicates perfect negative correlation. The

larger the value of r, the higher is the degree of linear relationship.

Positive Correlation Negative Correlation

No correlation

The most common methods of computing the Coefficient of correlation are:

1. Rank-difference method:

This method is useful when the number of scores to be correlated'-is small or

exact magnitude of the scores cannot be ascertained. The scores are ranked according

to size or some other criterion using numbers 1, 2, 3 n The rank-difference coefficient

of correlation can be computed by the following formula.

Rs = 1 – 6∑ D2

N (N2−1)

115

Page 118: Measurement and Evaluation (Book).docx

Where D == the difference between two rankings.

N = Number of pairs of scores.

2. The Product-moment method

The product-moment coefficient is usually used when the number of scores is

large. Thus this method is used in most research studies. The product-moment

coefficient is usually denoted by r.

rxy =

∑ XY

N−(∑ X

N )(∑ Y

N )√∑ X2

N (∑ XN )

2√∑Y 2

N (∑ YN )

2

Measures of Variability:

Measures of central tendency measure the centre of a set of scores. However,

two data sets can have the same mean, median and mode and yet be quite different in

other respects. For example, consider the heights (in inches) of the players of two

basketball teams.

Team-1: 72 73 76 76 78

Team-2: 67 72 78 76 84

The two teams have the same mean height. 75 inches, but it is clear that the

heights of the players of team 2 vary much more than those of team 2. If we have

information about the centre of scores and the manner in which they are spread out we

know much more bout set of scores. The degree to which scores tend to spread about.

an average value is called dispersion.

The Range

It is the simplest measure of dispersion. The range of a set of scores is the

difference between maximum scores and minimum scores.

In symbols

116

Page 119: Measurement and Evaluation (Book).docx

Range = Xm – Xo

Where Xm is the maximum score and Xo is the minimum score. Quartile

Deviation:

The quartile deviation is defined as half of the difference between the third

and the first quartiles.

In symbols

Q. D. = (Q3 – Ql) / 2

Where

Q1 is the first quartile and

Q3 is the third

The Mean Deviation or Average Deviation:

The average deviation is defined as the arithmetic mean of the deviations of

the scores from the mean or median; the deviations are taken as positive. In symbols

M.D. = ∑ ¿ X−X∨¿N

¿

For grouped data

M.D. = ∑ f ∨X−X∨¿

∑ f¿

The Standard Deviation:

The standard deviation is the positive square root of the arithmetic mean of the

squares of deviations of all the scores from their mean.

S = √∑ ¿¿¿¿

Short formula for calculating standard deviation

117

Page 120: Measurement and Evaluation (Book).docx

S = √∑ X2

N−(∑ X

N )2

The Coefficient of Variation:

Karl Pearson introduced a relative measure of dispersion known as coefficient

of variation (denoted by c.v). It expresses the standard deviation as a percentage of the

arithmetic mean of a data set. It is number without units and is used to compare

variation in two or more distributions. The smaller value of the c.v. indicates lesser

variation. It is also used as a criterion for consistent performance of the students,

players etc.

C.V. SX

× 100

Standard Scores:

A frequently used quantity is statistical analysis is the standard score or Z-

score. The standard score for a data value in the number of standard deviations that

the data value is away from the mean of the data set.

Z = X−X

S

The Normal Curve:

Before explaining the normal distribution, some basic concepts of probability

is given below an event is a specified result That mayor may not occur when an

experiment is performed. For example, in tossing of a coin once, appearance of head

is an event, which may or may not occur. The probability of an event is a measure of

the likelihood of its occurrence. A probability near indicates that the event is very

unlikely to occur. Whereas a probability near 1 indicates that the event is quite likely

to occur.

Relative frequency interpretation of probability:

118

Page 121: Measurement and Evaluation (Book).docx

Consider the, experiment of tossing a balanced coin once. There are 50-50

chances the head will appear. Consequently, we assign a probability of 0.5 to that

event. The relative-frequency interpretation is that in a large number of tosses, the

head will appear about half of the time.

Some Basic Properties of the Normal Curve

1. The total area under the normal curve is equal to 1.

2. The normal curve extends indefinitely in both directions.

3. The normal distribution is symmetric about the mean that is the part of the

curve to the left of is the mirror image of the part of the curve to the right of

it.

4. The mean, the median and the mode are equal.

5. Mean deviation is 0.7979 .

6. Quartile deviation is 0.6745 .

7. In a formal distribution,

– 0.674+5 to + 0.6745 covers 50% of the area.

– to + covers 68.27% of the area.

- 2 to + 2 covers 95.45 of the area.

- 3 to IA + 3 covers 99.73% of the area.

4.3 STANDARD SCORES:

Most educational and psychological tests provide standard scores that are

based on a scale that has a statistical mean (or average score) of 100. If a student earns

a standard score that is less than 100, then that student is said to have performed

below the mean, and if a student earns a standard score that is greater than 100, then

119

Page 122: Measurement and Evaluation (Book).docx

that student is said to have performed above the mean. However, there is a wide range

of average scores, from low average to high average, with most students earning

standard scores on educational and psychological tests that fall in the range of 85-115.

This is the range in which 68% of the general population performs and, therefore, is

considered the normal limits of functioning.

Classifying Standard Scores

However, the normal limits of functioning encompass three classification

categories: low average (standard scores of 80-89), average (standard scores of 90-

109), and high average (110-119). These classifications are used typically by school

psychologists and other assessment specialists to describe a student's ability compared

to same-age peers from the general population.

Subtest Scores

Many psychological tests are composed of multiple subtests that have a mean

of 10, 50, or 100. Subtests are relatively short tests that measure specific abilities,

such as. vocabulary, general knowledge, or short-term auditory memory. Two or more

subtest scores that reflect different aspects of the same broad ability (such as broad

Verbal Ability) are usually combined into a composite or index score that has a mean

of 100. For example, a Vocabulary subtest score, a Comprehension subtest score, and

a General Information subtest score (the three subtest scores that reflect different

aspects of Verbal Ability) may be combined to form a broad Verbal Comprehension

Index score. Composite scores, such as IQ scores, Index scores, and Cluster scores,

are more reliable and valid than individual subtest scores. Therefore, when a student's

performance demonstrates relatively uniform ability across subtests that measure

different aspects of the same broad ability (the Vocabulary, Comprehension, and

120

Page 123: Measurement and Evaluation (Book).docx

General Information subtest scores are both average), then the most reliable and valid

score is the composite score (Verbal Comprehension Index in this example).

However, when a student's performance demonstrates uneven ability across subtests

that measure different aspects of the same broad ability (the Vocabulary score is

below average, the Comprehension score is below average, and the General

Information score is high average), then the Verbal Comprehension Index may not

provide an accurate estimate of verbal ability. In this situation, the student's verbal

ability may be best understood by Helping Children at Home and School II: Handouts

for Families and Educators S2-8llooking at what each subtest measures. In sum, it is

important to remember that unless performance is relatively uniform on the subtests

that make up a particular broad ability domain (such as Verbal Ability), then the

overall score (in this case the Verbal Comprehension Index) may be a misleading

estimate.

4.4 PROFILE: >>>>>>>

121

Page 124: Measurement and Evaluation (Book).docx

UNIT-5:

EVALUATING PRODUCT, PROCEDURES & PERFORMANCE

5.1 EVOLUTION THEMES AND TERMS PAPERS:

The evaluation is structured around a logical sequence of seventeen questions

which fall under six evaluation themes. The following are major themes of evaluation.

1. Learning Outcome:-

The quality of learning outcome is the first theme identified in teaching and

learning from work under this theme one of important sub theme is identified.

Attainment of Curriculum Objectives:

Considered the knowledge, skill and understanding of our pupil.

How does the knowledge level of pupil reflect the curriculum objectives for

chosen area?

What opportunities are pupil's afforded use and display their ability applies

their knowledge and skill?

Can pupils use their skills for curriculum reasoning in problem solving?

What is the attitude of pupils to learning curriculum?

Do our pupils enjoy learning? Are they motivate to learning.

In numeracy what do you understand by each of the following skills?

Applying and problem solving

Communicating and expressing

Implementing

Integrating

Reasoning

Understanding of recalling

122

Page 125: Measurement and Evaluation (Book).docx

In the course of a week how many of the following future in yours numeracy

lessons? What opportunities are provided to development of the following context

Oral language

Reading

Writing

Digital literacy

If we have not completed a school improvement plan to date, what do we need

to focus on the support learns out comes and attainment of curriculum objectives in

each curriculum area.

2. Learning Experience.

The quality learning experience is the second theme identified in teaching of

three important sub theme are Identified.

Learning environment

Engaged in learning

Learning to learning

Engaged in Learning

Are students interested and enthused by the content and teaching approaches

used?

Do we encouraged pupil questioning considered teacher input V'S pupil

participation in your class room.

How pupils are active when teacher work?

Collaborative and independent learning.

Progressive skill learning and skill development.

Challenge and support.

To support learning by referring to outcomes and related success criteria to

allow for further enhancement of understanding.

123

Page 126: Measurement and Evaluation (Book).docx

Pupils enjoy learning in class room and are eager to find out more,

All students in class room afforded the opportunity to participate in lesson and

engage with learning.

Learning Environment

To involve the students in development rules which recognize the rights of

responsibilities of the community.

Prepare supervision of pupils both within the class and at break times within

the school setting.

All the recourses well organized, labeled and clear to all learners.

Celebrates pupils learning and achievements through a range of display.

Concrete and visual materials, centers of interest and display of pupil work.

Learning to Learn

Learning to learning is the third sub theme of learning experiences.

To engage the pupils to monitor their own progress in learning for learning

technique to utilize them properly in class room to develop the skills of learner by

proper planning of lessons.

To allow the learner to communicate work with other in the clam.

How do we enable the student learner to develop their personal organization to plan

out their own work study and revision skills do we teach.

To teach the pupils how to organized prose nil the work.

To make the pupil creative and give the opportunity for collaborative work.

3. Teac’s Practice

The quality of teacher's practice is the third theme of teacher and learning

from work. Under this theme four sub themes are identified

124

Page 127: Measurement and Evaluation (Book).docx

Preparation for teaching

Teaching approach

Management of pupils

Assessment

1. Preparation for Teaching

Learning outcome:

Do we provide class, relevant and differentiated learning out comes to pupil?

How pupils are made aware of what they are going to learn?

Are pupil familiar with the expected success criteria in learning activities.

Written Plans

Are the long and short terms plans prepared in accordance with the rules for

primary teacher.

Does the planning clearly indicate expected learning out comes, teaching

approaches resources and activities.

Monthly Progress Report

Do our cantos miosuila provide a clear picture of the progression and

continuity of pupil learning across the curriculum.

Literacy and Numeracy

Are the literacy and numeracy opportunities identified across the curriculum?

How we identified these opportunities an are whole school plans individual

planning?

125

Page 128: Measurement and Evaluation (Book).docx

Resources

How satisfied are we with the resources, materials and equipment we have

with in our class room and available within the school? Are the necessary and relevant

material & readily available?

Assessment

Reflect on the use of assessment as an aid to teaching and learning how do we

plan for assessment.

Does our planning reflect whole school assessment policy?

How do we incorporate best practice as ????? in assessment guide line 2007

into our teaching and learning?

Teaching Approach

Learning Outcome

How are lessons guided by expected learning outcome and linked to

curriculum.

What provision is made to ensure expected learning outcomes are achieved

during lesson?

Focus of Learning

Is attention given within each curriculum area.

To the systematic development and application of knowledge and skill

including ICI?

Pupils leaving timely and does it happen at a regular interval

Analysis use of Assessment Information

Information teacher's setting of learning targets and activities for individual

pupils group, the whole class pupils group the whole class,

Inform the school improvement plan and as revise and update whole school

improvement target.

126

Page 129: Measurement and Evaluation (Book).docx

What is term piper?

Term Paper:

Definition

“A term paper has two purposes the student should demonstrate an

understanding of the material as well as the ability to communicate that understanding

effectively.

Writing term papers gives students practical experience in writing at length

communicating thoughts and idea through the written word is a necessary skill in any

profession.

A term paper is a research paper written by students over an academic term

accounting of a large part of a grade. Terms papers are generally intended to describe

on event, a concept or argue paint. A term is a written original work discussing a topic

in detail, usually several typed paper in length and is often due at the end of semester

there is much overlap between term papers and research paper "The term Paper" was

originally used to describe a paper (usually a research based) that was due to at the

end of "term" either a semester or quarter, depending on which unit of measure a

school used common usage has "term paper" and "research paper" as interchangeable

but this is not completely accurate. Not all term papers involve academic research and

not all research papers are one term papers.

Term papers date back to the beginning of the 19th century when print could

be reproduced cheaply and written text of all types (reports memoranda specifications,

and scholarly articles) could be easily produced and disseminated during the year

from 1870 to 1900, mouton and Holmes (2003) write that American education was

transformed as writing become a method of discoursed and research the hallmark or

learning.

127

Page 130: Measurement and Evaluation (Book).docx

Importance:

Right away that you are cognizant of the fundamentals of composing A+

research projects here are some extra mysteries to guarantee. Never forget to

dependably edit your term paper articles.

The term paper is a necessary evil for every college student. Many students

wonder why the need to regurgitate lectures and research on paper, but the term paper

actually serves an important purpose for a college education and farther careers.

Effects:

In addition to the immediate effects on a student's course grade and grade

point average, a term paper will be valuable when searching for or advancing in

careers.

Term Paper Evolution

Term paper has been graded according to the following criteria.

The Cortical Section

Range depth and quality of literature research on your topic.

The author has integrated a variety of key pieces of literature on the topic but

so representing the consent state of research as well as covering various view

point.

The author has integrated a variety of key pieces of literature but focuses too

much on one particular author or view point.

No independent literature research has been carried out the author exclusively

refers to pieces of literature that have been assigned as course reading.

Correctness of theoretical part

The content of individual pieces of literature and giving than appropriate

prominence.

128

Page 131: Measurement and Evaluation (Book).docx

The contents of individual pieces of literature are largely represented correctly

although the student may give too much prominence to individual.

The literature review reveals that the student has not fully under stood large

parts of the literature. The content of theoretical part is, as a result incorrect to

the considerable degree.

Presentation of Literature Review Development of Argument:

The student has represented the views of prominent scholars on the topic and

has developed critical argument and support of or against the literature

represented has / her literature review is focused on the research question and

relevant to it.

The student presents the current state of research concerning the topic at hand

too much on one particular view point.

The literature review shows lack of focus. The author presents bit and piece

that are loosely related to the topic at hand.

Presentation of Results

The tables and figures are legible and easily to grasp at the first glance It is

evident that the author has spent find best visual means of presentation.

The formatting of tables and figures is satisfactory yet not always easy to

grasp to grasp at the first glace.

The student has not attempted to use tables and figures to support his / her

argument.

The student's plain how he/she arrived at the result represented and indicates

their significance to the topic or field of linguistics in which paper is written.

129

Page 132: Measurement and Evaluation (Book).docx

The author largely points out the most striking result at the same time;

however he/she concentrates too much discussing aspects that are not entirely

relevant to research question at hand.

The author mainly lists example from her data no comparison of her results

with those of previous studies is offered.

Language (Vocabulary, Grammar, Style).

The student uses the academic writing register/ style with appropriate

linguistic terminatories.

The language used is largely suitable for an academic piece of writing but the

paper exhibit some mainly recurring.

The student uses writing style which is an inappropriate for an academic

paper. There are great number of grammatical mistake and Paragraph Lake

coherence.

Further instructions End rubric for term paper grading

Each student must submit an independently-written report of their term paper

project. Team members are welcome to share literature related to their project theme

and may work jointly to develop hypotheses, predictions and experimental design.

Nonetheless, the organization and text of each report must be developed

independently by each team member. Normal rules concerning plagiarism apply. If

you have any questions about this, best ask us first.

The written term paper must have the following structure and include all of the

following elements:

THE PAPER SHOULD BE A MAXIMUM IF 5 PAGES (double-spaced; Times

New Roman 12 font), excluding the title page and literature cited.

First (title) page must include:

130

Page 133: Measurement and Evaluation (Book).docx

1. Descriptive title

2. Author

3. Abstract (NOTE: MAXIMUM 200 WORDS)

Subsequent pages must include:

4. Introduction

5. Hypotheses and predictions (these can be incorporated into the introduction or

presented below a separate sub-heading)

6. Study Area and Organisms

7. Methods and experimental design

8. Significance of work

9. Literature Cited (use a journal format of your choice (e.g. Journal of Ecology,

Oecologia)

The following criteria will be used for grading the report:

1. Abstract: Does the abstract reflect the title of the project and the aim and scope of

the work? Does it contain essential information on rationale, hypothesis, study system

and significance? Is it written clearly? (10 points)

2. Introduction: Does the introduction start by introducing a significant question in

community ecology? Are statements supported by appropriate citations from

published literature? Is the initial question refined through the introduction to

statement of the objective of the study? (10 points)

3. Hypotheses and predictions: Are the hypotheses presented clearly related to the

objective of the study and are they logically connected to the ideas presented in the

introduction? Are hypothesis stated correctly (i.e., do they provide an explanation for

an observation)? Are predictions logically connected to the hypotheses? Are

alternatives to the primary hypothesis acknowledged (where appropriate)? (10 points)

131

Page 134: Measurement and Evaluation (Book).docx

4. Study area and organisms: Is necessary background information on the

ecology/natural history of study organisms and the study site presented? Is there

unnecessary/irrelevant information that could have been omitted? Is the choice of

organism/study sit appropriate given the objective of the study? (10 points)

5. Methods and Experimental Design: Is the explanation of the methods clear (use

figures if necessary)? Are the methods appropriate to test the hypothesis proposed?

Are essential methodological details included? This might include, for example,

replication of treatments, size of treatments, duration of experiment, description of

independent and dependent variables. Has the author considered potential

confounding effects that might interfere with the ability to test the hypothesis? Are

these recognized/addressed (where possible)? (10 points)

6. Significance, originality and creativity of work: This is the justification

statement for the project – this does NOT mean just the conservation/management

(i.e., applied) importance of your proposed work. I am looking here for a statement to

indicate how this project can move the field of community ecology forward. Does this

study potentially provide new insights into how communities are organized? Are the

results from this study system broadly applicable to other groups of organisms or

other kinds of interactions? What other studies might build on the results from this

one – or would the results of this study allow you to infer something new about this

system that you could then go on to test? (10 points)

7. Literature Cited: Are the papers in the body of the proposal cited in here? Are the

references cited here in the body of the text? Is consistent formatting used? (5 points)

8. Presentation and clarity: Are the different sections of the proposal well linked?

Are the ideas presented clearly – and can they be followed from one section of the

proposal to the next? Is the writing style clear (topic sentences introduce themes

presented in each paragraph; concise language used; spelling and grammar

132

Page 135: Measurement and Evaluation (Book).docx

acceptable)? Use of tense and active/passive voice is consistent? Note: Use the past

tense to describe results found in previous studies; use future tense "we will..." to

describe work that you propose to do (5 points).

5.2 EVALUATING GROUP WORK & PERFORMANCE

Evaluating group work can provide valuable information about the degree to

which:

The use of group work enhanced (or otherwise) student achievement of

learning outcomes and engagement.

The use of group work enhanced (or otherwise) evaluator delivery or

assessment of the unit of study,

Specific Questions:

Evaluator can ask more specific questions about:

The response of individual students to group work as compared to individual

work.

Group work process versus the group work product.

The effectiveness of group work in class and/or out of class to enhance

learning,

The appropriateness of group work.

Organizational, planning, management and monitoring issues.

Strengths and weakness of group work and ideas for improvement.

Diversity issues (did some students find it easier or harder, benefit more

than others and why, what about issues of power),

The ways in which explained, facilitated, managed and monitored the groups.

The overall nature of the unit o study.

133

Page 136: Measurement and Evaluation (Book).docx

Timing of Evaluation:

Evaluation an occur at any time during the unit of study program, but it

usually occurs at the end of the semester or at the end of the task that is being

undertaken and evaluated.

Ideally students should be given time to reflect upon their experiences prior to

completing any form of evaluation especially if evaluator desire some specific

information about their experiences of group work or have a specific reflection

component within the work being evaluated.

It is also important to clearly explain why undertaking evaluating?

It's a good idea to explain all of this at the start of the unit of study and to

provide opportunities for students to reflect along the way.

Evaluation can also be built into the requirements of the group work tasks by

asking students to complete an evaluation of their own or the whole groups

experience of group. This could also be a requirement of their assessment. It is up to

evaluator whether or not to allocate marks.

Method for Collecting Data for Evaluation:

There is no single method for designing or conducting an evaluation method

can be quantitative or qualitative, formal or informal, formative or summative, self

administrated or externally administered, or any combination of these. There are

advantages and disadvantages to each method and evaluator will largely depend upon

the purpose of the evaluation and the content, material practices, tasks or activities

being evaluated,

1. Questionnaire:

Questionnaire is a common method of approach that involves having students

complete a survey in the class. When evaluator designing questionnaire ensure that

there is an introduction which explains the purpose of the evaluation, that there are

134

Page 137: Measurement and Evaluation (Book).docx

clear instructions for completion and that the questions are unambiguous. The

questions posed can be open ended or closed, or a combination.

Open Ended Questions:-

These questions have the advantage of allowing students to identify what were

the most important elements of their experience. A disadvantage is that they may not

write much or may be nothing at all.

Closed Questions:

These are statements that allow students to rate their agreement or

disagreement with a comment or statement by using a Liker Scale.

Strongly agree Agree Neural Disagree Strongly agree disagree

Students are usually willing to answer these questions, especially if the questionnaires

are anonymous. A disadvantage is that they do not give detailed response or answer

"why" or "how" questions.

2. Checklist:

Checklist is another method that can provide basic data.

An example may be a list of provided unit outcomes (knowledge, skills,

attributes, abilities etc) and students circle or tick the ones that apply.

Alternatively evaluator could ask to generate their own list of outcomes, For

example group work provided me with......

Autonomy.

Opportunity to get to know my classmates.

Opportunity to work on a real life problems students are usually willingly to

complete these lists but again the disadvantages is that they do not give detailed

responses or answer "why" or "how" questions.

135

Page 138: Measurement and Evaluation (Book).docx

Evaluation Hand-Out:

Some academies design their own evaluation hand-out that can combine a

number of evaluation methods and are anonymous, quick and easy to complete. They

can take any form, use images, diagrams, comment boxes or questions and lists as

above.

Interview:

Interview can be done individually or in small groups and provide the

opportunity for evaluator to probe for deeper analysis of the process and experience.

The disadvantage of this method is that it can be time consuming for both

evaluator and the students, and in a larger group may be some students may be more

vocal than others.

Focus Group:

Focus group uses a facilitative rather than direct questioning approach and is a

useful way of having students discusses the process of group work. This method

allows students to work off and build upon each other's answers.

The disadvantage is that it is time consuming for both evaluator and students

and there is the added difficult of arranging a time that will suit everyone.

Practicality of the Evaluation Process:

Before making a choice about evaluation method also consider the following

questions:

What resource is needed to undertake the evaluation?

What has to be done in order to undertake the evaluation (printing of forms,

preparation of one-line questionnaire, ordering questionnaires, arranging

interview rooms)?

136

Page 139: Measurement and Evaluation (Book).docx

What levels of participation evaluator require from the students, tutors,

organizations or any other party who were involved in the group work

activities?

Uses of Evaluation:

It is important to consider who will use the evaluations and how it will be

used. This is a key part of the planning process which relates to the purpose of the

evaluation.

It is also important to reflect upon and consider the methods that have been

used to gather information about the effectiveness of group work.

5.3 EVALUATING DEMONSTRATION:

1. The evaluation portion of the demonstration-performance method is where

students get an opportunity to prove that they can do the manoeuvre without

assistance.

2. For the simulated forced approach you should tell students that you will be

simulating an engine failure and that they are to carry out the entire procedure

including all checks and look-out.

3. While the student is performing this manoeuvre you must refrain from making

any comments. Offer no assistance whatsoever, not even grunts or head nods.

You must, however, observe the entire manoeuvre very carefully, so that you

can analyze any errors that the student may make and debrief accordingly.

NOTE: You would interrupt the student's performance, of course, if safety became a

factor.

4. Success or failure during the evaluation stage of the lesson will determine

whether you carry on with the next exercise or repeat the lesson.

137

Page 140: Measurement and Evaluation (Book).docx

Demonstration

1. The explanation and demonstration may be done at the same time, or the

demonstration given first followed by an explanation, or vice versa. The skill

you are required to teach might determine the best approach.

2. Consider the following: You are teaching a student how to do a forced

landing. Here are your options:

a. Demonstrate a forced approach and simultaneously give an explanation of

what you are doing and why you are doing it; or,

b. Complete the demonstration with no explanation and then give a detailed

explanation of what you have done; or

c. Give an explanation of what you intend to do and then do it.

You will find that different instructors will approach the teaching of this skill

differently. The following represents a suggested approach that appears to work best

for most instructors.

On the flight prior to the exercise on forced landings, give a perfect

demonstration of a forced landing. It may be better not to talk during this

demonstration, since you want it to be as perfect as possible to set the standard for the

future performance. There is another advantage of giving a perfect demonstration

prior to the forced landing exercise. Your students will be able to form a clearer

mental picture when studying the flight manual because they have seen the actual

manoeuvre.

a. The next step would be for you to give a full detailed explanation of a forced

landing. During this explanation you would use all the instructional techniques

described previously. You must give reasons for what is expected, draw

comparisons with things already known and give examples to clarify points.

138

Page 141: Measurement and Evaluation (Book).docx

This explanation should be given on the ground using visual aids to assist

student learning.

b. When in the air, give a demonstration, but also include important parts of the

explanation. Usually asking students questions about what you are doing or

should do, will give them an opportunity to prove they know the procedure,

although they have not yet flown it.

c. After completing the forced landing approach, while climbing for altitude,

clear up any misunderstandings the students may have and ask questions.

d. The demonstration and explanation portion of the demonstration-performance

method is now complete and you should proceed to the next part, which is the

student performance and instructor supervi

Evaluation Matrix for the Demonstration

When assessing the demonstration of teaching skills, attention is given to the

applicant's use of didactic solutions. The following matrix transparently describes the

criteria used to evaluate the demonstration. The matrix is indicative instead of

normative, and is used for support when evaluating the demonstration of teaching

skills. In other words, not all of the aspects listed in the matrix need to be assessed

systematically. The evaluators use the criteria listed below to form an overall

appraisal of the demonstration's standard by assessing the quality of the components

that are of a good or better level. If the demonstration includes a preliminary

assignment, all the individual components are assessed in relation to it. If well

grounded, the demonstration may also be virtual, held in real time and interactively.

139

Page 142: Measurement and Evaluation (Book).docx

Component of the demonstration of teaching skills

Passable Satisfactory Good Very good

Objectives The applicant specifies the objectives.

The applicant specifies the objectives clearly

The applicant specifies the objectives taking into account the context, content and target group.

Content

Correspondence between the topic and content of the demonstration

Academic nature of the content Consistency and clarity of

presentation of the content Critical approach Many-sided argumentation Connection between theory and

practice Aptness, diversity and topicality of

the research data used Use of own research results Consideration given to the target

group in the choice of content

The topic and content of the demonstration correspond to each other.

The content is academic.

The applicant presents the content clearly and consistently.

Where appropriate, the applicant uses his/her own research results during the demonstration.

The applicant takes the target group into consideration when.

The topic and content of the demonstration correspond to each other.

The content is academic and topical. The applicant presents the content clearly

and consistently. The applicant examines the conte

critically. The applicant discusses the topics from

many angles. The applicant explains the connection

between theory and practice. The research data discussed are relevant,

many-sided and topical. Where appropriate, the applicant uses

his/her own research results during the demonstration.

140

Page 143: Measurement and Evaluation (Book).docx

Methods

Organization of teaching Motivation of target group Suitable use of teaching methods Suitable use of teaching aids and

materials

The teaching situation is organized appropriately.

The teaching situation is organized appropriately, taking into consideration its objectives, contents, target group and context.

The applicant inspires the target group to engage, stimulates the listeners’ interest and motivates them to participate.

The applicant uses different teaching methods appropriately in terms of the situation, objectives.

Wrap-up

Evaluation of the teaching situation in terms of the objectives set

Consideration given to the target group in solutions related to evaluation

The applicant evaluates the teaching situation in terms of the objectives set.

The applicant evaluates the teaching situation in terms of the objectives set.

The solutions related to evaluation are relevant and take the target group into consideration.

Interaction skills Use of voice Clarity and intelligibility of speech Coherence of oral and written

communication Quality of interaction Other matters improving

communication.

The applicant’s delivery is clear and understandable.

Oral and written communication is coherent.

The applicant’s delivery is clear and understandable.

Oral, written and visual communication is coherent.

The applicant interacts with the listeners in a natural and appropriate manner in teaching situation.

Alignment of the preliminary assignment and the demonstration of teaching skills

The preliminary assignment and the demonstration of

The preliminary assignment lays the foundation for and supports the demonstration, and the two form a

141

Page 144: Measurement and Evaluation (Book).docx

teaching skills are well aligned.

consistent whole.

142

Page 145: Measurement and Evaluation (Book).docx

5.4 EVALUATION OF PHYSICAL MOVEMENTS AND MOTOR SKILLS:

Motor Skills

A motor skill is a function, which involves the precise movement of muscles

with the intent to perform a specific act.

Motor skills are skills that are associated with the activity of the body muscles

like the skills performed in sport. Fine motor skills arc the type that is associated with

small movements of the wrists, hands, feet, fingers and toes.

Motor skills are the ability to make particular bodily movements to achieve

certain tasks. They are a way of controlling muscles to make fluid and accurate

movements. These skills must be learned, practiced and mastered, and overtime can

be performed without thought, for example, walking or swimming. Children are

clumsy in comparison to adults, because they have yet to learn many motor skills that

allow them to effectively accomplish tasks.

Motor skills are also learned and refined in adulthood. If a woman takes up

belly dancing, her first movements will not closely resemble that of the teacher.

Overtime however, she will learn how to control her muscles to make the signature

movements that a belly dancer makes.

Genetic factors also affect the development of motor skills, for example, the

children of a professional dancer are far more likely to be good at dancing, with good

coordination and muscular control, than the children of a biochemist. Gross motor

skills are usually learned during childhood and require a large group of muscles to

perform actions, such as balancing or crawling. Fine motor skills involve smaller

groups of muscles and are used for fine tasks, such as threading a needle or playing a

computer game. These skills can be forgotten if disused over time.

143

Page 146: Measurement and Evaluation (Book).docx

Types of Motor Skills

There are two major categories of motor skills

1. Gross Motor Skills

2. Fine Motor Skills

Gross Motor Skills

Gross motor skills maneuver large muscle groups coordinating functions for

sitting, standing, walking, running, keeping balance and changing positions. These

skills involve skills are those that are typically acquired during infancy and young

childhood to control the large muscles of the body. These skills include sitting,

crawling, walking. According to Anna Maria Wilms Floet, MD, on Medicine.

Throwing a ball, riding a bike, playing sports, lifting and sitting upright are brief

descriptions of large motor movements. Gross motor skills depend upon muscle tone,

the contraction of muscles and their strength for positioning movements.

Fine Motor Skills

Fine motor skills coordinate precise, small movements involving the hands,

wrists, feet, toes, lips and tongue. Features of fine motor control include handwriting,

drawing, grasping objects, cutting and controlling a computer mouse. Experts agree

that one of the most significant fine motor achievements is picking up a small object

with the index finger and thumb referred to as the pincher grip, which usually occurs

between 8 and 12 months of age

Fundamental Motor Skills

Fundamental motor skills are common motor activities with specific

observable patterns. Most skills used in sports and movement activities are advanced

versions of fundamental motor skills. For example, throwing in softball and cricket,

the baseball pitch, javelin throw, tennis serve and netball shoulder pass are all

advanced forms of the overhand throw. The presence of all or part of the overhand

144

Page 147: Measurement and Evaluation (Book).docx

throw can be detected in the patterns used in these sport specific motor skills. Similar

relationships can be detected among other fundamental motor skills and specific sport

skills and movements.

Assessment of Motor Skills

A motor skills assessment is an evaluation of a patient to determine the extent

and nature of motor skill dysfunction. Care providers like physical therapists and

neurologists can perform the assessment, which may be ordered for a number of

reasons. It is not invasive, but does require the completion of a number of tasks. The

length of time required can vary, depending on the test or tests used. It may be

necessary to set aside a full day for testing.

One reason for a motor skills assessment may be to establish a child's baseline

level of motor competency. This can provide a reference point for the future. Physical

education teachers, for example, may perform brief assessments with new students to

determine which kinds of activities would be safe and appropriate for them.

Pediatricians also use such testing to assess their patients. If a child appears to have

developmental delays, this may result in a referral for a more extensive examination

Different Ways to Assess Motor Skills

Motor Skills can be evaluated in different ways .some of them are as follow.

1. Test gross motor skills using range of motion. Assess gross motor skills by

asking the individual to perform a series of movements known as range of motion.

Evaluate range of motion by asking the individual to hold an arm out and move it in a

circular direction. The arm should be able to move in a complete circle when fully

extended. Then ask the individual to stand and place one leg out. Have the individual

move the leg up and down, back and forth and left to right. Note any difficulty in

movement, abnormalities or pain experienced by the individual.

145

Page 148: Measurement and Evaluation (Book).docx

2. Assess gross motor skills using games. Gross motor skills can be evaluated using

games and sports. Ask the individual to kick a ball to test gross motor skills of the leg.

Jumping rope is a great way to evaluate motor skills, because it uses both the arms

and legs working together to accomplish the task. Hopscotch, basketball and walking

on a balance beam are also good ways to evaluate gross motor skills. Look for the

fluidity of movement, problems with balance and hand-eye coordination.

3. Evaluate fine motor skills of arms and legs. Ask the individual to put a

clothespin on the edge of a box. Stringing beads on a shoelace is another way to

assess fine motor skills. Using a stapler and placing a paperclip on a sheet of paper are

also ways to assess fine motor skills. Place an item on the floor and ask the individual

to pick it up using his toes only. Watch the individual perform each task, looking at

how smooth the movements are and how easily the task is completed, and note any

difficulties.

4. Test fine motor skills using common household items. Give the individual a jar

and ask her to unscrew the lid and screw the lid back on. Ask the individual to place

items, such as coins or blocks, into containers such as a bowl, bucket or cup. Draw a

straight line on a piece of paper and have the individual use a pair of scissors to cut

the line on the paper. Using pencils or pens of different sizes, ask the person to pick

up and grasp each pencil/pen. Then ask the individual to trace items drawn on the

paper. Watch for the completion of each task, looking for any problems during each

movement.

5. Assess fine motor skills while getting dressed. Ask the individual to put on and

button up a shirt. Next, have the individual put on a pair of pants that have a snap

closure and a zipper. Give the individual a pair of shoes, which have shoelaces and

not Velcro closures, and ask him to tie the shoes. Watch the individual perform each

task, looking for difficulties, abnormal movements and the ability to perform each

task completely without help.

146

Page 149: Measurement and Evaluation (Book).docx

Some Motor Skills and Their Evaluation for Preschoolers

Dancing, either freestyle or through songs with movements, such as "I'm a Little

Teapot Dance and movement classes, like pre-ballet, can be fun but aren't necessary

for motor-skills developmen

Walking, around the house, neighborhood, or park. For variety, add in marching,

jogging, skipping, hopping, or even musical instruments to form a parade. As they

walk, tell stories, count, or play games. Observe the child how he walks on a piece of

string or tape, a low beam or plank at the playground, or a homemade balance beam.

Playing pretend: Kids boost motor skills when they use their bodies to become

waddling ducks, stiff-legged robots, galloping horses, soaring planes—whatever their

imagination comes up with!

Riding tricycles, scooters, and other ride-on toys; pulling or pushing wagons, large

trucks, doll strollers, or shopping carts.

Playing tag or other classic backyard games, such as Follow the Leader, Red

Light/Green Light, Tails, or Simon Says (avoid or modify games that force kids to sit

still or to be eliminated from play, such as Duck Duck Goose or musical chairs).

Throwing, catching, and rolling large, lightweight, soft balls Swinging, sliding, and

climbing at a playground or indoor play space.

Ball Control Skills

The following ball skills are generic in that they are not specific to a particular sport,

and they are grouped by whether they require one or two balls. Skills are listed in

their approximate order of difficulty. Younger movers may use a plastic ball,

volleyball, or child's basketball, and older movers may use a youth-sized or adult-

sized basketball. Select the highlighted name of a given skill to view a short video

clip.

147

Page 150: Measurement and Evaluation (Book).docx

Assessing Motor Skills in Early Childhood - Using the PDMS (Peabody

Developmental Motor Scale)

Does your toddler have special needs? Early diagnosis of problems in

developmental motor skills is crucial for helping children with special needs. One of

the most popular assessment tools is the Peabody Developmental Motor Scale. Is it

reliable and sufficiently responsive?

After more than ten years of extensive research, a second edition known as the

PDMS-2 finally replaced the first edition of the Peabody Developmental Motor Scale.

The authors, M. Rhonda Folio and Rebecca R. Fewell, claim that the new and updated

version provides better and more in-depth assessment of the gross and fine motor

skills of preschool-age children. The PDMS-2, of course, is just one of the most

commonly-used assessments for measuring the motor skills of toddlers. However, for

children with special needs, the Peabody Development Motor Scale is one of the most

reliable testing instruments used by many professionals, such as therapists,

psychologists, and diagnosticians.

Purpose of the Test

The main purpose of the Peabody Developmental Motor Scale is to test the

motor skills of children. Gross motor skills involve using large muscles such as in

bending, balancing, crawling, walking, and jumping. Fine motor skills, on the other

hand, involve using smaller muscles, particularly the muscles in the hand. A child, at

a specific age, is expected to display proficiency at certain motor skills.

With the PDMS-2, most dysfunctions of motor skills will be identified. And

using the results of the PDMS-2, the special education teacher, parents, and other

professionals of the IEP team can develop a more responsive learning and remediation

program for the child with special needs. Would you want your child to take this

assessment test? The next part describes how the test will be administered.

148

Page 151: Measurement and Evaluation (Book).docx

Administration of the Test

This assessment test is composed of six sub-tests that include special

instructions on how each is administered to the preschool-age child. To keep the

results of the test reliable and precise, the actual instructions on how the test will be

carried out are only given to the test administrators and psychologists. This will

prevent the parents from "preparing" their child to pass the test. But the sub-tests are

given below:

Reflexes – A reflexive action is a quick and automatic reaction to a particular

environmental stimulus. This reaction is measured in this sub-test that is composed of

eight items. This sub-test, however, is administered only to children who are 11

months and younger because reflexes have been observed to be extensively integrated

within 12 months.

Stationary – This sub-test aims to measure the child's ability to maintain balance or

equilibrium. It involves mainly the ability of the preschool-age child to control his or

her body. It is composed of 30 items.

Locomotion – This sub-test evaluates the child's ability to move. The movement

involves crawling, walking, running, and other similar actions. The sub-test has 89

items.

Object Manipulation – In this sub-test, the object that is manipulated is the ball.

Since it is developmentally impossible for babies to even hold a ball, this sub-test is

administered only to children who are older than 11 months. This 24-item sub-test

involves activities such as throwing, catching, and kicking balls.

Grasping – This sub-test primarily measures the preschool-age child's ability to use

the muscles of the hand. Made of 26 items, the sub-test progressively determines their

ability to grasp objects and to control fingers.

149

Page 152: Measurement and Evaluation (Book).docx

Visual-Motor Integration – This sub-test evaluates the child's eye and hand co-

ordination. Aside from controlling muscles, the test determines the level of the child's

visual perception. Some examples of the activities of this 72-item sub-test include

building blocks and copying designs.

5.5 EVALUATING ORAL PERFORMANCE:

Communication skills are taught in a wide range of general education courses

and students are in need of speaking and listening skills that will help them succeed in

future courses and in the workplace. Thus, the assessment of communication skills is

an important issue in general education .Oral assessment is often carried out to look

for students' ability to produce words and phrases by evaluating students' fulfillment

of a variety of tasks such as asking and answering questions about themselves, doing

role-plays, making up mini-dialogues, defining or talking about some pictures given

them. The operations in an oral ability test are either informational skills or

interactional skills. The testing of speaking is widely regarded as the most challenging

of all language tests to prepare, administer and score.

Kind of Oral Communication

Oral communication can also be delivered individually or as part of a team.

Therefore, knowing the kind of oral communication act that is expected is a necessary

step in being able to give useful feedback and ultimately an accurate evaluation

Pronunciations

Pronunciation is a basic quality of language learning. Though most second

language learners will never have the pronunciation of a native speaker, poor

pronunciation can obscure communication and prevent a student from making his

meaning known. When evaluating the pronunciation of students, listen for clearly

articulated words, appropriate pronunciations of unusual spellings, and assimilation

and contractions in suitable places. Also listen for intonation. Are students using the

150

Page 153: Measurement and Evaluation (Book).docx

correct inflection for the types of sentences they are saying? Do they know that the

inflection of a question is different from that of a statement? Listen for these

pronunciation skills and determine into which level student falls.

Vocabularies

Vocabulary comprehension and vocabulary production are always two

separate banks of words in the mind of a speaker, native as well as second language.

Teacher should encourage students to have a large production vocabulary and an even

larger recognition vocabulary. For this reason it is helpful to evaluate students on the

level of vocabulary they are able to produce. Are they using the specific vocabulary

instructed them in the class? Are they using vocabulary appropriate to the contexts in

which they are speaking? Listen for the level of vocabulary students are able to

produce without prompting and then decide how well they are performing in this area.

Accuracy

Grammar has always been and forever will be an important issue in foreign

language study. Writing sentences correctly on a test, though, is not the same as

accurate spoken grammar. As students speak, listen for the grammatical structures and

tools teachers have taught them. Are they able to use multiple tenses? Do they have

agreement? Is word order correct in the sentence? All these and more are important

grammatical issues, and an effective speaker will successfully include them in his or

her language.

Communications

A student may struggle with grammar and pronunciation, but how creative is

she when communicating with the language she knows? Assessing communication in

the students means looking at their creative use of the language they do know to make

their points understood. A student with a low level of vocabulary and grammar may

have excellent communication skills if she is able to make other understand

her/him„ whereas an advanced student who is tied to manufactured dialogues may not

151

Page 154: Measurement and Evaluation (Book).docx

be able to be expressive with language and would therefore have low communication

skills. Don't let a lack of language skill keep the students from expressing themselves.

The more creative they can be with language and the more unique ways they can

express themselves, the better their overall communication skills will be.

Interactions

Ask the students questions. Observe how they speak to one another. Are they

able to understand and answer questions? Can they answer when teacher ask them

questions? Do they give appropriate responses in a conversation? All these are

elements of interaction and are necessary for clear and effective communication in

English. A student with effective interaction skills will be able to answer questions

and follow along with a conversation happening around him. Great oratory skills will

not get anyone very far if he or she cannot listen to other people and respond

appropriately. Encourage your students to listen as they speak and have appropriate

responses to others in the conversation.

Fluency

Fluency may be the easiest quality to judge in your students' speaking. How

comfortable are they when they speak? How easily do the words come out? Are there

great pauses and gaps in the student's speaking? If there are then your student is

struggling with fluency. Fluency does not improve at the same rate as other language

skills. You can have excellent grammar and still fail to be fluent. You want your

students to be at ease when they speak to you or other English speakers. Fluency is a

judgment of this ease of communication and is an important criterion when evaluating

speaking.

Suggestions for Improvement

Offer suggestions (rather than criticisms) for improved delivery style. Many

students are aware of their difficulties in delivering oral communication and want

152

Page 155: Measurement and Evaluation (Book).docx

feedback and support, and they do want suggestions. Not so useful: "Don't wave your

hands when you talk."Better: "Let's figure out what you're going to do with your

hands so that you don't distract the audience from what you are saying. What feels

more natural to you?"

Present oral communication skills as a set of professional skills that all

professionals learn and practice steadily throughout their lives.

153

Page 156: Measurement and Evaluation (Book).docx

UNIT-6:

PORTFOLIOS

6.1 PURPOSE OF PORTFOLIOS:

Literally Definition:

A) a large, flat, thin case for carrying loose papers or drawings or maps; usually

leather

B) a set of pieces of creative work collected to be shownto potential customers or

employers; ' the artist had put together a portfolio of his work"; "every actor

has a portfolio of photographs"

C) A collection of various company shares, fixed interest securities or money-

market instruments.

Terminological Ally Definition:

A portfolio is a purposeful collection of student work that exhibits the

student's efforts, progress, and achievements in one or more areas of the curriculum.

The collection must include the following:

Student participation in selecting contents.

Criteria for selection.

Criteria for judging merits.

Evidence of a student's self-reflection.

It should represent a collection of students' best work or best efforts, student-

selected samples of work experiences related to outcomes being assessed, and

documents according growth and development toward mastering identified outcomes.

Purpose of Portfolios:

In this new era of performance assessment related to the monitoring of

students' mastery of a core curriculum, portfolios can enhance the assessment process

by revealing a range of skills and understandings one students' parts; support

154

Page 157: Measurement and Evaluation (Book).docx

instructional goals, reflect change and growth over a period of time; encourage

student, teacher, and parent reflection; and provide for continuity in education from

one year to the next. Instructors can use them for a variety of specific purposes,

including:

Encouraging self-directed learning.

Enlarging the view of what is learned.

Fostering learning about learning.

To promote student control of learning

To track student progress

To demonstrate individual growth

To respond to individual needs

To evaluate and report on student progress

To facilitate student-led conferences

To show process and product

To show final products

To show student achievement with respect to specific curricular goals

To document achievement for alternative credit

To accumulate "best work" for admission to other educational institutions or

program

Demonstrating progress toward identified outcomes.

Creating an intersection for instruction and assessment. ,

Providing a way for students to value themselves as learners.

Offering opportunities for peer-supported growth.

Benefits of Portfolio:

One of the most important benefits of using portfolios is the enhancement of

critical thinking%, skills which result from the need for students tot

Develop evaluation criteria

155

Page 158: Measurement and Evaluation (Book).docx

Students are pleased to observe their personal growth,

They have better attitudes toward their work, and

They are more likely to think of themselves as writers.

Factors that go into the development of a student portfolio assessment:

1. First, you must decide the purpose of your portfolio. For example, the portfolios

might be used to show student growth, to identify weak spots in student work,

and/or to evaluate your own teaching methods.

2. After deciding the purpose of the portfolio, you will need to determine how you

are going to grade it. In (titer words, what would a student need in their

portfolio for it to be considered success and for them to earn a passing grade.

3. The answer to the previous two questions helps form the answer to the third:

What should be included in the portfolio? Are you going to have students put

of all their work or only certain assignments? Who gets to choose?

How to Build a Student Portfolio

The following suggestions will help you effectively design a student portfolio.

1. Set a Purpose for the Portfolio. First, we need to decide what your purpose

of the portfolio is. Is it going to be used to show student growth or identify

specific skills? Are we looking for a concrete way to quickly show parents

student achievement, or are we looking for a way to evaluate your own

teaching methods? Once we have figured out your goal of the portfolio, then

we think about how to use it.

2. Decide How ' You Will You Grade it. Next, we will need to establish how

we are going to grade the portfolio. There are several ways you can grade

students work, we can use a rubric, letter grade, or the most efficient way

would be to use a rating scale. Is the work completed correctly and

completely? Can we comprehend it? we can use the grading scale of 4-1. 4 =

156

Page 159: Measurement and Evaluation (Book).docx

Meets all Expectations, 3 = Meets Most Expectations, 2 = Meets Some

Expectation, 1 = Meets No Expectations. Determine what skills you will be

evaluating then use the rating scale to establish a grade.

3. What will b Included in it. How will we determine what will go into the

portfolio? Assessment portfolios usually include specific pieces that students

are required to know. For example, work that correlates with the Common

Core Learning Standards. Working portfolios include whatever the student is

currently working on, and display portfolios showcase only the best work

students produce. Keep in, mind that we can create a portfolio for one unit and

not the next. We get to choose what is included and how it is included. If you

want to use it as a long-term project and include various pieces throughout the

year, we can. But, we can also use it for short- term projects as well.

4. How Much Will You Involve the Students. How much we involve the

students in the portfolio depends upon the students age. It is important that all

students should understand the, purpose of the portfolio and what is expected

of them. Older students should be give n a checklist of what is expected, and

how' it will be graded. Younger students may 1 of understand the grading

scale so we can give them the option of what w 11 be include d in their

portfolio. Ask them questions such as, why did you choose this particular

piece and does it represent your best work? Involving students in the portfolio

process will encourage them to reflect on their work.

5. Will You Use a Digital Portfolio. With the fast-paced world of technology,

paper portfolios may'become a thing of the past. Electric portfolios (e-

portfolios/digital portfolios) are Teat because they are easily accessible, easy

to transport and easy to use. Today’s students are tuned into the latest must-

have technology, and electronic portfolios arc part of that. With students using

an abundance of multimedia outlets, digital portfolios seem like a great fit.

157

Page 160: Measurement and Evaluation (Book).docx

The uses of these portfolios are the same, students still reflect upon the r work

but only in a digital way.

The key to designing a student portfolio is to take the time to think about what

kind it will be, and how we well manage it. Once we do that and follow the

steps;above, we will find it will be a success.

Types C F Portfolios Duo

1) Best Work Portfolio

This type of portfolio highlights and shows evidence of the best work of

learners. Frequently, this type of portfolio is called a display or showcase portfolio.

For Students, best work is often associated with pride am a sense of accomplishment

and can result in a desire to share their work with o hers. Best work can include both

product and process. It is often correlated with the amount of effort that few learners

have invested in their work. A major advantage of this type of portfolio is that

learners (an select items that reflect their highest level of learning and canexplain why

these it (ms represent their best effort and achievement. Best work portfolios are used

for the following purposes:

Student Achievement. Students may select a given number of entries (e.g., 10) that

reflect their best effort or achievement (or both) in a course of study. The portfolio

can be presented in a student-led parent conference or at a community open house. As

students publicly share their excellent work, work they have chosen and reflected

upon, the experience may enhance their self-esteem.

Post-Secondary Admissions. The preparation of g.post-secondary portfolio targets

work samples from high school that can be submitted forconsideration in the process

of admission to college or university. This portfolio should show evidence of a range

of knowledge, skills, and attitudes, and may highlight particular qualities relevant to

specific programs. Many colleges and universities are adding portfolios to the initial

158

Page 161: Measurement and Evaluation (Book).docx

admissions process while others are using them to determine particular placements

once students are admitted.

Employability. The audience for this portfolio is an employer, .This collection of

work needs to be focused on specific knowledge, skills, and attitudes necessary for a

particular job or career. The school-to-work movements in North America are

influencing an increase in the use of employ-ability portfolios. The Conference Board

of Canada (1092), for example, outlines the academic, personal management, and

teamwork skills that are the foundation of a high-quality Canadian workforce. An

employability portfolio is an excellent vehicle for showcasing these skills.

2) Growth Portfolio

A growth portfolio demonstrates an individual's development and growth over

time. Development can be focused on academic or thinking skills, content knowledge,

self-knowledge, or any area that is important in your setting. A focus on growth

connects directly to identified educational goals and purposes. When growth is

emphasized, a portfolio will contain evidence of struggle, failure, success, and

change. The growth will likely be' an uneven journey of highs and lows, peaks and

valleys, rather than a smooth continuum. What is significant is that learners recognize

growth whenever it occurs and can discern the reasons behind that growth. The goal

of a growth portfolio isfor learners to see their own changes over time and, in turn,

share their journey with others.

A growth portfolio ca -I be culled to extract a best work sample. It also helps

learners see how achievement is often a result of their capacity to self-evaluate, set

goals, and work over time. Growth portfolios car be used for the following purposes:

Knowledge. This portfolio shows students' growth in knowledge in a particular

content area or across several content areas over time. This kind of portfolio can

159

Page 162: Measurement and Evaluation (Book).docx

contain samples of both satisfactory and unsatisfactory work, along with reflections to

guide further learning.

Skills and Attitudes. This portfolio shows students' growth in skills and attitudes in

areas such as academic discipline s, social skills, thinking skills, and work habits. In

this type of portfolio,challenges, difficult experiences, and other growth events can be

included to demonstrate students' developing skills. In a thinking skills portfolio; for

example, students might include evidence showing growth in their ability to recall,

comprehend, apply, analyze, synthesize, and evaluate information

Teamwork. This portfolio demonstrates growth in social skills in a variety of

cooperative experiences. Peer responses and evaluations are vital elements in this

portfolio model, along with self-evaluations. Evidence of changing attitudes resulting

from team experiences can also be included, especially s expressed in self-reflections

and peer evaluations.

Career. This portfolio helps students identify personal strengths related to potential

career choices: The collection can be developed over several years, perhaps beginning

in middle school and continuing throt4;hout high school. The process of selecting

pieces over time empowers young people to make appropriate educational choices

leading toward meaningful careers. Career portfolios mat items from outside the

school setting that substantiate students' choices and create a holistic view of the

students as learners and people. This type of portfolio may be modified for

employment purposes.

3) Showcase Portfolios

Showcase portfolios highlight the best products over a particular time period

or course. For example, a showcase portfolio in a composition class may include the

best examples of different writing genres, such an essay, a poem, a short story, a

biographical piece, or a literary analysis. In a business class, the showcase portfolio

160

Page 163: Measurement and Evaluation (Book).docx

may include a resume, sample business letters, a marketing project, and a

collaborative assignment that demonstrates the individual's ability to work in a team.

Students are often allowed to choose What they believe to be their best work,

highlighting the it achievements and skills. Showcase reflections typically focus on

the strengths of selected pieces and discuss how each met or exceeded required

standards

4) Process Portfolios

Process portfolios, by contrast, concentrate more on the journey, of learning

rather than the final destination or end pro lusts of the learning process. In the

composition class, for example, different stages of the process—an outline, first draft,

peer and teacher responses, early revisions, and a final edited draft—may be required.

A process reflection may discuss why a particular strategy was used, what was useful

or ineffective for the individual in the writing process, and how the student went

about making progress in the face of difficulty in meeting requirements. A process

reflection typically focuses on many aspects of the learning process, including the

following: what approaches fiches work best, which are ineffective, information about

oneself as a learner, and strategies or approaches to remember in future assignments.

5) Evaluation Portfolios.

Evaluation portfolios may vary substantially in their content. Their basic

purpose, however, remains to exhibit a series of evaluations over a course and the

learning or accomplishments of the student in regard to previously determined criteria

or goals. Essentially, this type of portfolio documents tests, observations, records, or

other assessment artifacts required for successful completion of the course. A math

evaluation portfolio may include tests, quizzes, and written explanations of how me

went about solving a problem or determining which formula to use, whereas a science

evaluation portfolio might also include laboratory experiments, science project

outcomes with photo ; or other artifacts, and research reports, as well as tests and

161

Page 164: Measurement and Evaluation (Book).docx

quizzes. Unlike the showcase portfolio, evaluation portfolios do not simply include

the best work, but rather a selection of predetermined evaluations that may also

demonstrate students' difficulties and unsuccessful struggles as well as their better

world. Students who reflect on why some work was successful and other work was

less so continue their learning as they develop their met cognitive skills.

6) Online or e-portfolios

Online or e-portfolios may be one of the above portfolio types or a

combination of different types, a general requirement being that all information and

artifacts are somehow accessible online. A number of colleges require students to

maintain a virtual portfolio that may include digital, video, or Wet -based products.

The portfolio assessment process may be linked to a specific course or an entire

program. As with all portfolios, students are able to visually track and show their

accomplishments to a wide audience,

Conclusion: The portfolio process will continue to be refined and efforts made to

improve students' perceptions if the process as it is intended to develop the self-

assessments skills they will need to improve their knowledge and professional skills

throughout their education careers.

6.3 GUIDELINE AND STUDENTS ROLE IN SELECTION OF PORTFOLIO ENTRIES AND SELF-EVALUATION:

Portfolio:

An organized presentation of an individuals education, work samples, and

skills.

Terminologically a portfolio is a purposeful collection of student work that

exhibits the student’s efforts, progress, and achievements in one or more areas of the

curriculum.

162

Page 165: Measurement and Evaluation (Book).docx

Guidelines:

Identify purpose

Select objectives.

Think about the kinds of entries that will best match instructional outcomes.

Decide who select the entries

Decide how much to include, how to organize the portfolio, where to keep it

and when to access.

Set the criteria for judging the work (rating scales, rubrics, checklists) and

make such student understand the criteria.

Review the student’s progress.

Hold portfolio conferences with students to discuss their progress.

These guidelines are discussed below in detail.

Identify Purpose:

Without purpose, a portfolio is only a collection of student work samples.

Different purposes result in different portfolios. For example, if the student is to be

evaluated on the basic of the work in the portfolio for admission to college, then his

final version of his best work would probably be included in the portfolio.

Select Objectives:

The objectives ot be met be students should be clearly stated a list of

communicative functions can be included for students to check when the feel

comfortable with them and stapled to the inside lover. Students would list the little or

the number of the samples which address this function.

Portfolios also can be organized according the selected objectives addressing

one skill such as writing. The selected objectives will be directly related to the stated

163

Page 166: Measurement and Evaluation (Book).docx

purpose for the portfolio. At any rate, teachers must ensure that classroom instruction

support the identified seals.

Decide how much to include & how to Organize:

Teachers may want to spend some time going over the purpose of the portfolio

at regular intervals with students to ensure that the selected pieces do address the

purpose and the objectives. At regular times, ask students to go through their entries,

to choose what should remain in the portfolio, and what could be replaced by another

work which night be move illustrative of the objectives. Other material no longer

current and/or not useful to document student progress toward attain bent of the

objective should be discarded.

What is the student’s role?

The student’s role of participation in the portfolio will be largely responsible

for the success of the portfolio. For this reason, students must be actively involved in

the choices of entries and in the rationale for selecting those entries.

i. Selecting:

The student’s first role is in selecting some of the items to be4 pair of the

portfolio. Some teachers give students a checklist for making choices. Others leave

students almost freedom in selecting their entries. At an rate student should include

their best and favorite pieces of work along with those showing growth and process.

ii. Reflecting and self-assessing:

An essential component of self-assessment involves the student in reflecting

about their own work. At the beginning students might not know what to saw so

teaching will need to model the kinds of reflection expected from students.

Set the Criteria for Judging the Work:

There are two kinds of criteria needed at this point.

Criteria for individual entries (refers to the section on rubrics for details).

164

Page 167: Measurement and Evaluation (Book).docx

Criteria for the portfolio as a whole.

Assessing the individual entries in a portfolio is different from assessing the

portfolio as a whole. If the purpose of the portfolio is to now student progress then if

is highly probable that some of the beginning entries may not reflect high quality;

however, over several months, the student now have demonstrated growth toward the

stated objectives.

Criteria can be established by teachers alone and/or by teachers and students

together. At and rate, criteria for evaluating the portfolios must be announced a head

of time.

Possibilities of criteria include teacher evaluation and/or observation, student

self-evaluation, peer assessment, and a combination of several teacher’s comments.

Following is a list of suggested criteria for a portfolio as a whole.

Variety: Selected pieces display the range of tasks students can accomplish and skills

they have learned.

Growth: Student work represents the student’s growth in content knowledge and

language proficiency.

Completeness: Students organized the contents systematically.

Organization: Students organized the contents systematically.

Fluency: Selected pieces are meaningful to the students and communicate

information to the teacher.

Accuracy: Student work demonstrates skills in the mechanics of the language.

Goal Oriented: The contents reflect progress and accomplishment of curricular

objectives.

165

Page 168: Measurement and Evaluation (Book).docx

Following Directions: Students followed the teacher’s directions for pieces of the

portfolio.

Neatness: Student work is neatly written, typed or illustrated.

Justification or Significance: Student include reasonable justifications for the work

selected or explain why selected items are significant.

Reference

Katozai, Murad Ali. Measurement & Evaluation. Peshawar. University Publisher,

2013

6.4 USING PORTFOLIOS IN INSTRUCTION AND COMMUNICATION:

Portfolio:

Literally the word “Portfolio” is used in the following meanings:

1. A portable large things and flat briefcase especially of leather used for

carrying papers, pictures, drawings or maps.

2. A list of the financial assests held by an individual or a bank or other financial

institution.

3. The role of the head of a government department e.g. “He holds the portfolio

for foreign affairs”.

4. An organized presentation of an individual’s education, work samples and

skills.

Using portfolios of studne4t work in Instruction and communication:

The term portfolio has become popular buzz word.

Unfortunately, it is not always clear exactly what is meant or implied by the

term especially when used in the context of portfolio assessment. This training

module is intended to clarify the notion of portfolio assessment and help users design

such assessments in a thoughtful manner. We begin with a discussion of the rationale

166

Page 169: Measurement and Evaluation (Book).docx

for assessment alternatives and the discuss portfolio definitions characteristics and

design considerations.

Educators and critics are currently reciting a litany of problems concerning the

use of multiple-choice and other structured format tests for assessing many important

students outcomes. This has been accompanied by an explosion of activity searching

for assessment alternatives.

1. Capture a richer array of what students know and can do than is possible with

multiple-choice tests. Current goals for students go beyond knowledge of facts

and include such things as problems solving critical thinking, lifelong learning

of new information and thinking independently. Goals also include

dispositions such as persistence, flexibility, motivation and self-confidence.

2. Portray the process by which students produce work. It is important for

example that students utilize efficient strategies for solving problems as well

as getting the right answer. It is also important for students to be able to do

such things as monitoring their own learning so that they do when they

perceive they are not understanding.

3. Make our assessment align with what we consider important outcomes for

students in order to communicate the right message to students and other about

what we valve. For example if we emphasize higher order thinking in

instruction but only test knowledge because testing thinking is difficult,

students figure out pretty fast figure out pretty fast what is really valued.

4. Have realistic contexts for the production of work, so that we can examine

what students know and can do in real-life situations.

5. Provide continuous and ongoing information on how students are doing in

order to chronicle development, give effective feedback to students and

encourage students to observe their own growth.

167

Page 170: Measurement and Evaluation (Book).docx

6. Integrate assessment with instruction in a way consistent with both current

theories of instruction and goals for students. Specifically we want to

encourage active student engagement in learning, and student responsibility

for the control of learning. We also want to develop assessment techniques

that in their use, improve achievement and not just monitor it.

7. Using portfolios of student work for assessment, already an instructional tool

in many places, it seen as one potential way to accomplish these things. But

using portfolios will only have these desired effects if we plan them carefully.

Important Points in Portfolio Developing Process:

Some important points in portfolio development process are as follows:

1. It should be consulted to teachers, students, parents and school administrations

in deciding which items would be placed in it.

2. It should be created a shared, clear purpose for using portfolios.

3. It should reflect the actual day-to-day learning activities of students.

4. It should be on-going so that they show students efforts, progress and

achievements over a period of time.

5. Items in portfolio should be collected as a systematic, purposeful and

meaningful.

6. It should give opportunities for students in selecting pieces they consider most

reprehensive of themselves as learners to be placed into their portfolios, and to

establish criteria for their selections.

7. It should be viewed as a part of learning process rather than merely as

recordkeeping tools, as a way to enhance students learning.

8. Students can access their portfolios.

168

Page 171: Measurement and Evaluation (Book).docx

9. Share the criteria that will be used to assess the work in the portfolio as well as

in which the result are to be used. Teachers should give feedback to students,

parents about the use the portfolio.

In conclusion, in portfolio making process some necessary steps are;

assessment of studies should be clearly explained the process should over a certain

time period, portfolio should encourage students to learn, and items in the portfolio

should be multi-dimensional and should address different learning areas. Besides, it is

vitally important that the studies in a portfolio should be designed in order to present

students, performance and development in any time period in detail.

Reference

Katozai, Murad Ali. Measurement & Evaluation. Peshawar. University Publisher,

2013

6.5 POTENTIAL STRENGTH AND WEAKNESSES OF PORTFOLIOS:

Potential Strength of Portfolios

(Or Advantages of Portfolios as Method of Assessment)

Portfolio can present a wide perspective of learning process for students and

enables a continuous feedback for them. Besides this, it enables students to have a

self-assessment for their studies and learning, and to review their progress. Since it

provides visual and dynamic proofs about students' interests, their skills, strong sides,

successes and development in a certain time period, portfolio which is the systematic

collection of the student's studies helps assessing students as a whole. Portfolio is

strong devices that help students to gain the impbrtant abilities such as self-

assessment, critical thinking and monitoring one's own learning. Furthermore,

portfolio provide pre-service teacher assessing their own learning and growth, and

help them become self-directed and reflective practitioners, and contribute them the

individual and professional developments. Mullin (1998) stresses that portfolio

169

Page 172: Measurement and Evaluation (Book).docx

provides teachers to have new perspective in education. For instance, portfolio can

answer these questions: what kind of troubles do students have? Which activities are

more effective or ineffective? What subjects are understood and not understood? How

efficient is the teaching process? Some advantages or strengths of Portfolios are given

below:

1. Portfolio provides multiple ways of assessing students' learning over time

2. It provides for a more realistic evaluation of academic content than pencil-and

paper tests.

3. It allows students, parent, teacher and staff to evaluate the students' strength

and weakness.

4. It provides multiple opportunities for observation and assessment

5. It provides an opportunity for students to demonstrate his/her strengths as well

as weakness.

6. It encourages students to develop some abilities needed to become

independent, self-directed learners

7. It also helps parents see themselves as partners in the learning process.

8. It allows students to express themselves in a comfortable way and to assess

their own learning and growth as learners.

9. It encourages students to think of creative ways to share what they are learning

10. It increases support to students from their parents and enhances

communication among teachers, students and parents.

11. It encourage teachers to change their instructional practice and it is a powerful

way to link curriculum and instruction with assessment

12. It assesses and promotes critical thinking.

170

Page 173: Measurement and Evaluation (Book).docx

13. It encourages students to become accountable and responsible for their own

learning (i.e., self-directed, active, peer-supported, adult learning).

14. It can be the focus of initiating a discussion between student and tutor.

15. It facilitates reflection and self-assessment.

16. It can accommodate diverse learning styles, though they are not suitable for all

learning styles.

17. Portfolios can monitor and assess students' progress overtime.

18. Portfolios can assess performance, with practical application of theory, in real-

time naturalistic settings (i.e., authentic assessment).

19. Portfolios use multiple methods of assessment.

20. Portfolios take into account the judgment of multiple assessors.

21. Portfolios have high face validity, content validity, and construct validity.

22. Portfolios integrate learning and assessment.

23. Portfolios promote creativity and problem solving.

24. Portfolios promote learning about learning (i.e., metacognition).

25. Portfolios can be standardized and used in summative assessment.

26. Portfolios combine subjective and objective, as well as qualitative and

quantitative, assessment procedures.

27. Portfolios can be used to assess attitudes and professional and personal

development.

28. Portfolios enable identification of the unsatisfactory or struggling performer.

29. Portfolios offer teachers vital information for diagnosing students' strengths

171

Page 174: Measurement and Evaluation (Book).docx

and weaknesses to help them improve their performance (i.e., formative

assessment).

30. Portfolios reflect students' progression toward learning outcomes (i.e., student

profiling).

31. Portfolios allow the evaluators to see, the student, group, or community as

individual, each unique with its own characteristics, needs, and strengths.

32. Portfolios serve as a cross-section lens, providing a basis for future analysis

and planning. By viewing the total pattern of the community or of individual

participants, one can identify areas of strengths and weaknesses, and barriers

to success.

33. Portfolios serve as a concrete vehicle for communication, providing ongoing

communication or exchanges of information among those involved.

34. Portfolios Promote a shift in ownership; communities and participants can take

an active role in examining where they have been and where they want to go.

35. Portfolio assessment offers the possibility of addressing shortcomings of

traditional assessment. It offers the possibility of assessing the more complex

and important aspects of, an area or topic.

36. Portfolios cover a broad scope of knowledge and information, from many

different people who know the program or person in different contexts (e.g.,

participants, parents, teachers or staff, peers, or community leaders).

Potential Weaknesses of Portfolios

(Or Disadvantages of Portfolios as Method of Assessment)

1. When portfolios are used for summative assessment, students may be reluctant

to reveal weaknesses.

2. Portfolios are personal documents, and ethical issues of privacy and

172

Page 175: Measurement and Evaluation (Book).docx

confidentiality may arise when they are used for assessment.

3. Difficulties may arise in verifying whether the material submitted is the

candidate's own work.

4. Portfolios take a long time to complete and assess.

5. The portfolio process involves a large amount of paperwork.

6. Portfolio assessment may produce unacceptably low inter-rater reliability,

especially if the assessment rubrics .are not properly prepared or are used by

untrained assessors.

7. May be seen as less reliable or fair than more quantitative evaluations such as

test scores.

8. Can be very time consuming for teachers or program staff to organize and

evaluate the contents, especially if portfolios have to be done in addition to

traditional testing and grading.

9. Having to develop your own individualized criteria can be difficult or

unfamiliar at first.

10. If goals and criteria are not clear, the portfolio can be just a miscellaneous

collection of artifacts that don't show patterns of growth or achievement.

11. Like any other form of qualitative data, data from portfolio assessments can be

difficult to analyze or aggregate to show change.

Portfolio Assessment is Most useful for:

1. Evaluating programs that have flexible or individualized goals or outcomes.

For example, within a program with the general purpose of enhancing

children's social skills, some individual children may need to become less

aggressive while other shy children may need to become more assertive.

173

Page 176: Measurement and Evaluation (Book).docx

2. Each child's portfolio assessment would be geared to his or her individual

needs and goals.

3. Allowing individuals and programs in the community (those being evaluated)

to be involved in their own change and decisions to change.

4. Providing information that gives meaningful insight into behaviour and related

change. Because portfolio assessment emphasizes the process of change or

growth, at multiple points in time, it may be easier to see patterns.

5. Providing a tool that can ensure communication and accountability to a range

of audiences. Participants, their families, funders, and members of the

community at large who may not have much sophistication in interpreting

statistical data can often appreciate more visual or experiential "evidence" of

success.

6. Allowing for the possibility of assessing some of the more complex and

important aspects of many constructs (rather than just the ones that are easiest

to measure).

Portfolio Assessment is not as useful for:

1. Evaluating programs that have very concrete, uniform goals or purposes. For

example, it would be unnecessary to compile a portfolio of individualized

“evidence” in a program whose sole purpose is full immunization of all

children in a community by the age of five years. The required immunizations

are the same, and the evidence is generally clear and straightforward.

2. Allowing you to rank participants or programs in a quantitative or

standardized way (although evaluators or program staff may be able to make

subjective judgments or relative merit).

174

Page 177: Measurement and Evaluation (Book).docx

3. Comparing participants or programs to standardized norms. While portfolios

can (and often do) include some standardized test scores along with other

kinds of “evidence”, this is not the main purpose of the portfolio.

4. May be seen as less reliable or fair than more quantitative evaluations such as

test scores.

5. Can be very time consuming for teachers or program staff to organize and

evaluate the contents, especially if portfolios have to be done in addition to

traditional testing and grading.

6. Having to develop you own individualized criteria can be difficult or

unfamiliar at first.

7. If goals and criteria are not clear, the portfolio can be just a miscellaneous

collection of artifacts that don’t show patterns of growth or achievement.

8. Like any other form of qualitative data, data from portfolio assessments can be

difficult to analyze or aggregate to show change.

6.6 EVALUATION OF PORTFOLIO:

According to Paulson, Paulson and Meyer, (1991, p. 63): “Portfolios offer a

way of assessing student learning that is different than traditional methods. Portfolio

assessment provides the teacher and students an opportunity to observe students in a

broader context: taking risks, developing creative solutions, and learning to make

judgments about their own performances”.

In order for thoughtful evaluation to take place, teachers .must have multiple

scoring strategies to evaluate students' progress. Criteria for a finished portfolio might

include several of the following:

Thoughtfulness (including evidence of students' monitoring of their own

comprehension, metacognitive reflection, and productive habits of mind).

175

Page 178: Measurement and Evaluation (Book).docx

Growth and development in relationship to key curriculum expectancies and

indicators.

Understanding and application of key processes.

Completeness, correctness, and appropriateness of products and processes

presented in the portfolio.

Diversity of entries (e.g., use of multiple formats to demonstrate achievement

of designated performance standards).

It is especially important for teachers and students to work together to

prioritize those criteria that will be used as a basis for assessing and evaluating student

progress, both formatively (i.e., throughout an instructional time period) and

summatively (i.e., as part of a culminating project, a.ctivity, or related assessment to

determine the extent to which identified curricular expectancies, indicators, and

standards have been achieved).

As the school year progresses, students and teacher can work together to

identify especially significant or important artifacts and processes to be captured in

the portfolio. Additionally, they can work • collaboratively to determine grades or

scores to be assigned. Rubrics, rules, and scoring keys can be designed for a variety of

portfolio components. In addition, letter grades might also be assigned, where

appropriate. Finally, some form of oral discussion or investigation should be included

as part of the summative evaluation process. This component should involve the

student, teacher, and if possible, a panel of reviewers in a thoughtful exploration of

the portfolio components, students' decision-making and evaluation processes related

to artifact selection, and other relevant issues.

176

Page 179: Measurement and Evaluation (Book).docx

UNIT-7:

BASIC CONCEPTS OF INFERENTIAL STATISTS

7.1 CONCEPT & PURPOSE OF INFERENTIAL STATISTICS:

Introduction:

The role and importance of statistics in education cannot be denied. In

education we come across with measurement, evaluation and research. Similarly, we

have to make educational policies and budgets. In all these fields we need to make

proper measurement and present the data quantitatively. Thus without statistics we

cannot make proper measurement. As quoted in different statistics books "Planning is

the order of the day, and planning without statistics is inconceivable". Good statistics

and sound statistical analysis assist in providing the basis for the design of educational

policies, monitor policy implications and evaluate policy impact. To generate reliable

and relevant information the data should be collected using appropriate statistical

methods. The materials one uses for data collection should be well designed. The data

analysis should also be done using appropriate statistical method. All these show that

statistics plays vital role in Education Management and educational planning.

Concept of Inferential Statistics

Definition:

The branch of statistics concerned with using sample data to make an

inference about a larger group of data is called inferential statistics.

Example:

For instance the college teacher decides to use the average grade achieved by

one statistics class to estimate the average grade of all the sections of the same

statistics course. This is a problem of estimation, which falls in the inferential

statistics.

177

Page 180: Measurement and Evaluation (Book).docx

In educational research, it is never possible to sample the entire population

that we want to draw a conclusion about. For example, we might want to determine

how well a new way of teaching mathematics can affect mathematical achievement

for all children in Primary 1. However, it would be impossible to test all children in

Primary 1 because of time, resources, and other logistical factors. Instead, we choose

a sample of the population to conduct a study. Then we want to make conclusions - or

inferences, about the entire population based on the results of the study from the

sample.

Quantitative research in education and social science aims to test theories

about the nature of the world in general (or some part of it) based on samples („;?) of

"subjects" taken from the world (or some part of it). When we perform research on the

effect of TV violence on children's aggression, our intention is to create theories that

apply to all children who watch TV, or perhaps to all children in cultures similar to

our own who watch TV. We of course cannot study all children, but we can perform

research on samples of children that, hopefully, will generalize back to the

populations from which the samples were taken. Recall that external validity is the

ability of a sample to generalize to the population.

Purpose of Inferential Statistics

The main purpose of inferential statistics is .to determine whether the findings

from the sample can generalize to the entire population. There will always be

differences between groups in a research study. Inferential statistics can determine

whether the difference between the two groups in the sample is large enough to be

able to say that the findings are significant. If the findings are indeed significant, then

the conclusions can be applied - generalized - to the entire population. On the other

hand, if the difference between the groups is very small, then the findings are not

significant and therefore were simply the result of chance.

178

Page 181: Measurement and Evaluation (Book).docx

To illustrate this practically, imagine an entire room full of socks. You want to

determine whether there are more white socks than green socks in the room.

However, there are too many socks in the room to count them all, so you want to take

a sample of socks. Based on this sample of socks, you will draw a conclusion about

whether there are more white socks than green socks. After you collect your sample,

then you will need to calculate inferential statistics is to determine whether the

colours chosen in your sample likely reflect the colours of socks in the entire room or

if your results were due to chance.

What factors will determine whether the colours in the sample of socks

adequately represents the colours of the entire room? Sample size. If you only pick

two socks, they would probably not represent the entire room. The larger the sample

is, the more representative the sample will be of the entire room and the more likely

the inferential statistics will find a significant result. This is why when conducting

experiments, the larger the sample is, the better: with large samples, the results will

more likely reflect the entire population.

Inferential statistics is the mathematics and logic of how this generalization

from sample to population can be made. The fundamental question is: can we infer

the population's characteristics from the sample's characteristics? Descriptive statistics

remains local to the sample, describing its central tendency and variability, while

inferential statistics focuses on making statements about the population.

Unlike descriptive statistics, inferential statistics provide ways of testing the

reliability of the findings of a study and "inferring" characteristics from a small group

of participants or people (your sample) onto much larger groups of people (the

population). Descriptive statistics just describe the data, but inferential let you say

what the data mean.

179

Page 182: Measurement and Evaluation (Book).docx

7.2 SAMPLING ERROR:

In statistics, sampling error is incurred when the statistical characteristics of a

population are estimated from a subset, or sample, of that population. Since the

sample does not include all members of the population, statistics on the sample, such

as mean and quantities, generally differ from parameters on the entire population.

For example:

If one measures the height of a thousand individuals from a country of one

million, the average height of the thousand is typically not the same as the average

height of all one million people in the country. Since sampling is typically done to

determine the characteristics of a whole population, the difference between the sample

and population values is considered a sampling error.

Population and Samples:

A population is the entire group to which we want to generalize our results. A

sample if a subset of the population might be all adult humans but our sample might

be a group of 30 friends and relatives.

Types of sampling errors:

1. Random sampling

2. Bias problems

3. Non-sampling error

1. Random Sampling:

In statistics, sampling error is the error caused by observing a sampling instead

of the whole population. The sampling error can be found by subtracting the

value of a parameter from the value of a statistic. In nursing research, a sample

error is the difference between sample statistics used to estimate a population

parameter and the actual but unknown value of the parameter.

(Bunns and Grove, 2009)

Parameters and statistics:

180

Page 183: Measurement and Evaluation (Book).docx

A numerical summary of a population is called a parameter, while the same

numerical summary of a sample is called a statistic.

2. Bias Problems:

Sampling bias is a possible source of sampling errors. It leads to the sampling

error which either have a prevalence to be positive or negative. Such errors

can be considered to be systematic errors.

3. Non-sampling Error:

Sampling error can be constrasted with non-sampling error. Non-sampling

error is a catch all term for the deviations from the true value that are not a

function of the sample chosen, including various systematic errors and any

random errors that are not due o sampling. Non-sampling errors are much

harder to quantify than sampling error.

Example of non-sampling error:

Answers given by respondents may be influenced by the desire to impress an

interviewer.

4. Characteristics:

Sampling Error:

1. Generally decreased as the sample size increases (but not proportionally)

2. Depends on the size of the population under study.

3. Depends on the variability of the characteristic of interest in the population.

4. Can be accounted for and reduced by an appropriate sampling plan.

5. Can be measured an controlled in probability sample surveys.

7.3 NULL HYPOTHESIS:

Before defining the term null-hypothesis, it is necessary that we must know

about Hypothesis and statistical hypothesis.

Hypothesis:

A hypothesis is any statement or assumption about any phenomena of nature.

181

Page 184: Measurement and Evaluation (Book).docx

Statistical Hypothesis:

A statistical hypothesis is a statement or assumption about the value of a

population parameter.

For example;

1 = 80 (The population mean is equal to 80)

> 22 (The population mean is greater than 22)

2 # 25 (The population variance is not equal to 25)

1 = 2 (Population mean 1 is equal to population mean 2)

1 - 2 = 0 (there is no difference between 1 and 2)

Null Hypothesis:

The hypothesis to be tested in a test of hypothesis is called null hypothesis. It

is a hypothesis which is tested for possible rejection or mollification under the

assumption that it is true. It is denoted by H0 and usually contains and equal sign.

For example if we want to test that the population mean is 80, then we write.

H0 : = 80

Another definition of ‘Null-Hypothesis’:

Null hypothesis is a type of hypothesis used in statistics that proposes that no

statistical significance exists in a set of given observations.

The null hypothesis attempts to show that no variation exists between

variables, or that single variable is no different than ero. It is presumed to be true

until statistical evidence nullifies it for an alternative hypothesis.

182

Page 185: Measurement and Evaluation (Book).docx

Examples:

Hypothesis:

The loss of my socks is due to alien burglary. (Alien burglary means

unfamiliar theft).

Null Hypothesis:

The loss of my socks is nothing to do with alien burglary.

Alternative Hypothesis:

The loss of my socks is due to alien burglary. In statistics, the only way of

supporting your hypothesis is to refute the null hypothesis. Rather than trying to brave

your idea (the alternative hypothesis) right you must show that the null hypothesis is

likely to be wrong. You have to ‘refute’ or ‘nullify’ the null hypothesis.

7.4 TESTS OF SIGNIFICANCE:

Once sample data has been gathered through an observational study or

experiment, statistical inference allows analysts to assess evidence in favor or some

claim about the population from which the sample has been drawn. The methods of

inference used to support or reject claims based on sample data are known as tests of

significance.

Every test of significance begins with a null hypothesis HO. HO represents a

theory that has been put forward, either because it is believed to be true or because it

is to be used as a basis for argument, but has not been proved. For example, in a

clinical trial of a new drug, the null hypothesis might be that the new drug is no better,

on average, than the current drug. We would write HO: there is no difference between

the two drugs on average.

The alternative hypothesis, Ha, is a statement of what a statistical hypothesis

test is set up to establish. For example, in a clinical trial of a new drug, the alternative

hypothesis might be that the new drug has a different effect, on average, compared to

183

Page 186: Measurement and Evaluation (Book).docx

that of the current drug. We would write Ha: the two drugs have different effects, on

average. The alternative hypothesis might also be that the new drug is better, on

average, than the current drug. In this case we would write Ha: the new drug is better

than the current drug, on average.

The final conclusion once the test has been carried out is always given in

terms of the null hypothesis. We either "reject HO in favor of Ha" or "do not reject

HO"; we never conclude "reject Ha", or even "accept Ha".

If we conclude "do not reject HO", this does not necessarily mean that the null

hypothesis is true, it only suggests that there is not sufficient evidence against HO in

favor of Ha; rejecting the null hypothesis then, suggests that the alternative hypothesis

may be true.

Example

Suppose a test has been given to all high school students in a certain state. The

mean test score for the entire state is 70, with standard deviation equal to 10.

Members of the school board suspect that female students have a higher mean score

on the test than male students, because the mean score from a random sample of 64

female students is equal to 73. Does this provide strong evidence that the overall

mean for female students is higher?

The null hypothesis HO claims that there is no difference between the mean

score for female students and the mean for the entire population, so that = 70. The

alternative hypothesis claims that the mean for female students is higher than the

entire student population mean, so that > 70.s

Steps in Testing for Statistical Significance

1. State the Research Hypothesis

2. State the Null Hypothesis

3. Select a probability of error level (alpha level)

184

Page 187: Measurement and Evaluation (Book).docx

4. Select and compute the test for statistical significance

5. Interpret the results

1) State the Research Hypothesis

A research hypothesis states the expected relationship between two variables.

It may be stated in general terms, or it may include dimensions of direction and

magnitude.

For example,

General: The length of the job training program is related to the rate of job placement

of trainees. Direction: The longer the training program, the higher the rate of job

placement of trainees.

Magnitude: Longer training programs will place twice as many trainees into jobs as

shorter programs.

General: Graduate Assistant pay is influenced by gender.

Direction: Male graduate assistants are paid more than female graduate assistants.

Magnitude: Female graduate assistants are paid less than 75% of what male graduate

assistants are paid.

2) State the Null Hypothesis

A null hypothesis usually states that there is no relationship between the two

variables. For example,

There is no relationship between the length of the job training program and the

rate of job placement of trainees.

Graduate assistant pay is not influenced by gender.

A null hypothesis may also state that the relationship proposed in the research

hypothesis is not true. For example,

185

Page 188: Measurement and Evaluation (Book).docx

Longer training programs will place the same number or fewer trainees into

jobs as shorter programs.

Female graduate assistants are paid at least 75% or more of what male

graduate assistants are paid.

Researchers use a null hypothesis in research because it is easier to disprove a

null hypothesis than it is to prove a research hypothesis. The null hypothesis is the

researcher's "straw man." That is, it is easier to show that something is false once than

to show that something is always true. It is easier to find disconfirming evidence

against the null hypothesis than to find confirming evidence for the research

hypothesis.

(Definitions taken from Valerie J. Easton and John H. McColl's Statistics

Glossary v1.1)

One Tailed and Two Tailed Significant Tests

One important concept in significant testing is whether you use a one tailed or

two tailed test of significance. The answer is that it depends on your hypothesis.

When your research hypothesis states the direction of the difference or relationship,

then you use a one tailed probability. For example, a one tailed test would be used to

test these null hypothesis: Females will not score significantly higher than males on

an IQ test. Superman is not significantly stronger than the average person.

The one tailed probability is exactly half the value of two tailed probability.

7.5 LEVELS OF SIGNIFICANCE:

In hypothesis testing, the significance level is the criterion used for rejecting

the null hypothesis.

The significance level is used in hypothesis testing as follows.

186

Page 189: Measurement and Evaluation (Book).docx

First, the difference between the results of the experiment and the null

hypothesis is determined. Then alluring the null hypothesis is true, the probability of a

difference that large or larger is computed. Finally, the probability is compared to the

significance level.

If the probability is less than ON equal to the significance level, then the null

hypothesis is rejected & the outcome is said to be statistically significant.

Traditionally experiments have used to be statistically significant. Traditionally,

experiments have used either the 0.05 level (sometime called 5% level) on the 0.01

level (1% level), although the choice of levels is largely subjective. The lower the

significance level, the more the data must diverge from the null the 0.01 level is more

conservative than the 0.05 level.

Symbols:

The Greek word alpha () is sometime used to indicate the significance level.

The above explanation shows that the significance level is a value associated to some

statistical value, tests, which indicates the probability of obtaining those on more

extreme results. This value can be interpreted as the probability of obtain those

results. If the null hypothesis were (true) when (sampling is random) on as the

probability of obtaining those results by chance alone. (When sampling is less than

random). The value of this probability (also known as “p”, “p” – value, alpha &

Type I error) runs between 0& 1. The closer to “0” the lower the probability of the

results being found if the null hypothesis were true, on the lower the probability of the

result being a chance result. As stated in beginning, significance levels are used to

reject the null hypothesis that, for example, there is no correlation between variables”

there is no difference between groups on there is no change between treatments”.

A significant level of 0.051 is conventionally used in the social sciences,

although probity as high as “0.10” also be used. Probability greater than 0.10 are

rarely used. A significance level of 0.05 for example indicates that there is a 5%

187

Page 190: Measurement and Evaluation (Book).docx

probability that results are due to chance. A significance level of 0.10 indicates a 10%

probability that the results are due to chance. Thus, using significance levels above

0.10 is rather risky: while using lower significance level is “safer”.

History:

The present day concept of statistical significance originated by Ronald Fisher

when he developed statistical hypothesis testing which he described as test of

significance in his 1925 publication.

Fisher suggested a probability of one-in-twenty (0.05) as a convenient cut off

level of rejection null hypothesis.

Role in Statistics:

Statistical significance play a pivotal role in statistical hypothesis testing

where it is used to determine it a null hypothesis can be rejecting on retained. A null

hypothesis is the greater on general default statement that nothing happened on

changed. For a null hypothesis to be rejected on false, the result has to be identified as

being statistical significant. i.e. unlikely to have occurred by chance alone.

To determine a result is statistically significant a researcher would have to

calculate a p-value which is the probability of observing an effect given that the null

hypothesis is true.

References

www.en.wikipedia.org/wiki/statistical_significance.

M.A. Kotazoi, Measurement & Evaluation: 2013.

188

Page 191: Measurement and Evaluation (Book).docx

7.6 TYPE-I AND TYPE-II ERRORS: REMAINING:

Statistical Errors

Even in the best research project, there is always a possibility that the

researcher will make a mistake regarding the relationship between the two variables.

This mistake is called statistical error.

In statistical test theory the notion of statistical error is an integral part of

hypothesis testing. The test requires an unambiguous statement of a null hypothesis,

which usually corresponds to a default "state of nature", for example "this person is

healthy", "this accused is not guilty" or "this product is not broken". An alternative

hypothesis is the negation of null hypothesis, for example, "this person is not

healthy", "this accused is guilty" or "this product is broken". The result of the test may

be negative, relative to null hypothesis (not healthy, guilty, broken) or positive

(healthy, not guilty, not broken). If the result of the test corresponds with reality, then

a correct decision has been made. However, if the result of the test does not

correspond with reality, then an error has occurred. Due to the statistical nature of a

test, the result is never, except in very rare cases, free of error. Two types of error are

distinguished: type I error and type II error.

In statistics, a type I error (or error of the first kind) is the incorrect rejection

of a true null hypothesis. A type 11 error (or error of the second kind) is the failure to

reject a -false. null hypothesis. A type I error is a false positive. Usually a type I error

leads one to conclude that a thing or relationship exists when really it doesn't, for

example, that a patient has a disease being tested for when really the patient does not

have the disease, or that a medical treatment cures a disease when really it doesn't. A

type II error is a false negative. Examples of type II errors would be a blood test

failing to detect the disease it was designed to detect, in a patient who really has the

disease; or a clinical trial of a medical treatment failing to show that the treatment

works when really it does. When comparing two means, concluding the means were

189

Page 192: Measurement and Evaluation (Book).docx

different when in reality they were not different would be a Type I error; concluding

the means were not different when in reality they were different would be a 'Type II

error.

All statistical hypothesis tests have a probability of making type I and type II

errors. For example, all blood tests for a disease will falsely detect the disease in some

proportion of people who don't have it, and will fail to detect the disease in some

proportion of people who do have it. A test's probability of making a type I error is

denoted by a. A test's probability of making a type II error is denoted by β.

The detail is given below:

Type-I Error:

The first is called a Type I error. This occurs when the researcher assumes that

a relationship exists when in fact the evidence is that it does not. In a Type 1 error, the

researcher should accept the null hypothesis and reject the research hypothesis, but

the opposite occurs. The probability of committing a Type I error is called alpha (a).

A type I error, also known as an error of the first kind, occurs when the null

hypothesis (H0) is true, but is rejected. It is asserting something that is absent, a false

hit. A type I error may be compared with a so-called false positive (a result that

indicates that a given condition is present when it actually is not present) in tests

where a single condition is tested for. Type I errors are philosophically a focus of

skepticism and Occam's razor. A Type I error occurs when we believe a falsehood. In

terms of folk tales, an investigator may be "crying wolf' without a wolf in sight

(raising a false alarm) (Ho: no wolf).

The rate of the type I error is called the size of the test and denoted by the

Greek letter a (alpha). -It usually equals the significance level of a test. In the case of

a simple null hypothesis a is the probability of a type I error. If the null hypothesis is

190

Page 193: Measurement and Evaluation (Book).docx

composite, a is the maximum (supremum) of the possible probabilities of a type I

error.

Explanation:

A Type I Error is also known as a False Positive or Alpha Error. This happens

when you reject the Null Hypothesis even if it is true. The Null Hypothesis is simply a

statement that is the opposite of your hypothesis. For example, you think that boys are

better in arithmetic than girls. Your null hypothesis would be: "Boys are not better

than girls in arithmetic."

You will make a Type I Error if you conclude that boys are better than girls in

arithmetic when in reality, there is no difference in how boys and girls perform. In

this case, you should accept the null hypothesis since there is no real difference

between the two groups when it comes to arithmetic ability. If you reject the null

hypothesis and say that one group is better, then you are making a Type I Error.

Type-II Error

The second is called a Type II error. This occurs when the researcher assumes

that a relationship does not exist when in fact the evidence is that it does. In a Type II

error, the researcher should reject the null hypothesis and accept the research

hypothesis, but the opposite occurs. The probability of committing a Type II error is

called beta.

Generally, reducing the possibility of committing a Type I error increases the

possibility of committing a Type II error and vice versa, reducing the possibility of

committing a Type II error increases the possibility of committing a Type I error.

Researchers generally try to minimize Type I errors, because when a

researcher assumes a relationship exists when one really does not, things may be

worse off than before. In Type II errors, the researcher misses an opportunity to

confirm that a relationship exists, but is no worse off than before.

191

Page 194: Measurement and Evaluation (Book).docx

Type II Error is a statistical term used within the context of hypothesis testing

that describes the error that occurs when one accepts a null hypothesis that is

actually false. The error rejects the alternative hypothesis, even though it does

not occur due to chance.

A type II error accepts the null hypothesis, although the alternative hypothesis

is the true state of nature. It confirms an idea that should have been rejected, claiming

that two observances are the same, even though they are different.

Example:

An example of a type II error would be a pregnancy test that gives a negative

result, even though the woman is in fact pregnant. In this example, the null hypothesis

would be that the woman is not pregnant, and the alternative hypothesis is that she is

pregnant.

In other words, a type DI error, also known as an error of the second kind, occurs

when the null hypothesis is false, but erroneously fails to be rejected. It is failing to

assert what is present, a miss. A type II error may be compared with a so-called false

negative (where an actual 'hit' was disregarded by the test and seen as a 'miss') in a

test checking for a single condition with a definitive result of true or false. A Type II

error is committed when we fail to believe a truth. In terms of folk tales, an

investigator may fail to see the wolf ("failing to raise an alarm"). Again, Ho: no wolf.

The rate of the type II error is denoted by the Greek letter f3 (beta) and related

to the power of a test (which equals 143).

What we actually call type I or type H error depends directly on the null

hypothesis. Negation of the null hypothesis causes type I and type II errors to switch

roles.

The goal of the test is to determine if the null hypothesis can be rejected. A

statistical test can either reject (prove false) or fail to reject (fail to prove false) a null

192

Page 195: Measurement and Evaluation (Book).docx

hypothesis, but never prove it true (i.e., failing to reject a null hypothesis does not

prove it true).

Explanation:

A Type II Error is also known as a False Negative or Beta Error. This happens

when you accept the Null Hypothesis when you should in fact reject it. The Null

Hypothesis is simply a statement that is the opposite of your hypothesis. For example,

you think that dog owners are friendlier than cat owners. Your null hypothesis would

be: "Dog owners are as friendly as cat owners."

You will make a Type II Error if dog owners are actually friendlier than cat

owners, and yet you conclude that both kinds of pet owners have the same level of

friendliness. In this case, you should reject the null hypothesis since there is a real

difference in friendliness between the two groups. If you accept the null hypothesis

and say that both types of pet owners are equally friendly, then you are making a

Type II Error.

7.7 DEGREES OF FREEDOM:

In statistics, the numl er of degrees of freedom is the number of values in the

final calculation of a statistic that are free to vary.

The number of independent ways by which a dynamic system Can move

without violating any constraint imposed of it, is called degree of freedom. In other

words, the degree of freedom can be defined as the min mum number of independent

coordinates that can specify the position of the system completely:

Estimates of statistical parameters can be based upon different amounts of

information or data. The number of independent pieces of information that go into the

estimate of a parameter is called the degrees of 7eedom. In general, the degrees of

freedom of an estimate of a parameter is equal to the number o 'independent scores

that go into the estimate minus the number of parameters used as intermediate steps in

193

Page 196: Measurement and Evaluation (Book).docx

the estimation of the parameter itself (which, in sample variance, is one, since the

sample mean is the only intermediate step).

In many statistical problems we are required to determine the degrees of

freedom. This refers to a positive whole number that indicates the lack of restrictions

in, our calculations. The degree of freedom is the number of values in a calculation

that we can vary.

One step in most statistical inference problems is to determine the number of

degrees of freedom. The number of degree of freedom in a problem is related to

the,precise probability distribution that is to be used in the inference procedure. This

step is an often overlooked but crucial detail in both the calculation of confidence

intervals and the workings of hypothesis tests.

There is not a single general formula for the number of degrees Of freedom for

every inferenceproblem. Instead there are specific formulas to be used for each type

of procedure in inferentialstatistics. In other worlds, the setting that we are working in

will determine how we calculate thenumber of degrees of freedom.

Determining Degree of Freedom:

Number of components that are free to vary about a parameter

Df = Sample size – Number of parameters estimated

Df is n-1 for one sample test of mean

A Few Examples

For a moment suppose that we know the mean of data is 25 and that the values

are 20,10, 50, and one unknown value. To find the mean of a list of data, we add all of

the data and divide by the total number of values. This gives us the formula (20 + 10

+ 50 + x)/4 = 25, where x denotes the unknown. Despite c ling this unknown, we can

use some algebra to determine that x = 20.

194

Page 197: Measurement and Evaluation (Book).docx

Let's alter this scenario slightly. Instead we suppose that we know the mean of

a data set is 25, with values 20, 10; and two unknown values. These unknowns Could

be different, so we use two different variables, A and y to denote this. The resulting

formula is (20 + 10 + x +y)/4 = 25. With some algebra we obtain y = 70 - x. The

formula is written in this form to show that once we choose a value for x, the value

fory is determined. This shows 'that there is one degree of freedom.

Now we'll look at a t ample size of one hundred. If we know that the mean of

this sample data is 20, but do not know he values of any of the data, then there are 99

degrees of freedom. All values must add up t ) a total of 20 x 100 = 2000. Once we

have the values of 99 elements in the data set, then the last one has been determined.

Example

To compute the variance I first sum the square deviations from the mean. The

mean is a parameter: it is a characteristic of the variable under examination as a whole

and is part of describing the overall distribution of values. If you know all the,

parameters you can accurately describe the data. The more parameters you know, that

is to saythe more you fix, the fewer samples fit this mode of the data. If you know

only the mean, there will be many possible sets of data that are consistent with this

model but if you know the mean and the standard deviation, fewer possible sets of

data fit this model.

So in computing the Variance I had first to calculate the mean. When I have

calculated the mean, I could vary any of the scores in the data except for one. If I

leave one score unexamined it can always be calculated accurately from the rest of the

data and the mean itself. Maybe an example can make this clearer.

I take the ages of a class of students and find the mean. If I fix the mean, how

many of the other scores (there are N of them remember) could still vary? The answer

is N-1. There are N-1 independent pieces of information that could vary while the

195

Page 198: Measurement and Evaluation (Book).docx

mean is known. These are the degrees of freedom. One piece of information cannot

vary because its value is fully determined by the parameter (in t its case the mean) and

the other scores. Each parameter that is fixed during our computations constitutes the

loss of a degree of freedom.

If we imagine starting with a small number of data points and then fixing a

relatively large number of parameter: as we compute some statistic, we see that as

more degrees of freedom are lost, fewer and fewer different situations are accounted

for by our model since fewer and fewer pieces of information could in principle be

different from what is actually observed.

So, the interest, to put it very informally, in our data is determined by the

degrees of freedom: if there is nothing that can vary once our parameter is fixed

(because we have so very few data points - maybe just or e) then there is nothing to

investigate. Degrees of freedom can be seen as linking sample size to explanatory

power.

The Standard Deviation is a measure of how spread out numbers are;

Its symbol is a (the greek letter sigma)

The formula is easy: It is the square root of the Variance.

To calculate the variance follow these steps:

Work out the Mean (the simple average of the numbers)

Then for each number: subtract the Mean and square the resIult (the squared

difference).

Then work out the average of those squared differences.

Let suppose we have five values i.e 600,470,170,430 & 300

Mean = 600+470+170+430+300

5 = 1970

5 = 394

196

Page 199: Measurement and Evaluation (Book).docx

Variance: σ 2 =n 2062 +762 +(−224 )2+362+¿¿

= 42,436+5,776+50,176+1,296+8,836

5

= 108,520

5 = 21,704

Variance σ 2 = 2062 +762 +(−224 )2+362+¿¿

= 42,436+5,776+50,176+1,296+8,836

5

= 108,520

5 = 21,704

197

Page 200: Measurement and Evaluation (Book).docx

UNIT-8:

SELECTED TESTS OF SIGNIFICANCE

8.1 T-TEST:

Definition:

i) A t-test helps you compare weather two groups have different average

values (For example, weather men and women have different average

heights).

ii) A t-test asks weather a different between two groups averages unlikely to

have occurred because of random chance in sample selection. A difference

is more likely to be meaningful and “real” if (a) the difference between,

the average is large, (b) the sample size is large, and (c) Responses are

consistently close to the average values and not widely spread out (the

standard deviation is low).

iii) A statistical examination of two population means. A two-sample. T-test

examines weather two samples are different and is commonly used when

the variances of two normal distribution are unknown and when an

experiment uses a small sample size. For example, a t-test could be used to

compare the average floor routine score of the U.S women’s Olympic

gymnastic team to the average floor routine score of China’s women’s

team.

The t-test’s statistical significance and the t-test’s effect size are the two

primary outputs of the t-test. Statistical significance indicates weather the difference

between sample averages is likely to represent an actual difference between

population and the effect size indicates wither that difference is large enough to be

practically meaningful.

198

Page 201: Measurement and Evaluation (Book).docx

The “One sample t-test” is similar to the “independent samples t-test” except it

is used to compare one group’s average value to a single number .x. for practical

purposes you can look at the confidence interval around the average value to gain this

same information.

The “paired t-test” is used when each observation in one group is paired with a

related observation in the other group. For example do Kansans spend money on

movies in January to February. Where each respondent is asked about their January

from their February spending? In fact a period t-test subtracts each respondent’s

January spending from their February spending (yielding the increase is spending),

then take the average of all those increases in spending and looks to see wither that

average is statistically significantly greater than Zero (using a one sample t-test).

The “ranked independent t-test” ask a similar question to the typical unranked

test but it is more robust to outliners (a few bad outliners can make the results of an

unranked t-test invalid).

T-test (Independent Samples)

Dollars spend on movies per month. Stat-wing represents t-test results as

distribution curves. Assuming there is a large enough sample size, the difference

between these samples probably represents a “real’s” difference between population

from they were sampled.

Example:

Let’s say you are curious about wether New Yorkers and Kansans spend a

different amount of money per month on movies. It is impractical to ask every New

Yorker and Kansans about their movie spending, so instead you ask a sample of each

– may be 300 New Yorkers and 300 Kansans – and the average are 14 Dollars and 18

Dollars. The t-test asks wether that difference is probably representative of a real

199

Page 202: Measurement and Evaluation (Book).docx

difference between Kansans and New Yorkers generally or whether that is most likely

a meaningless statistical fluke.

Technically, it asks the following. If there were in fact no difference between

Kansans and New Yorkers generally, what are chances that randomly selected groups

from those populations would be as different as these randomly selected groups are?

For example if Kansans and New Yorks as a whole actually spent the same

amount of money on average. It is very unlikely that 300 randomly selected Kansans

each spend exactly 14 Dollars and 300 randomly selected New Yorkers each spend.

18 Dollars exactly. So if you are sampling yielded those results, you would conclude

that the difference in the sample groups is most likely representative of a meaningful

difference between the populations as a whole.

Statistical Analysis of the T-test:

The formula for the t-test is a ratio. The top part of the ratio is just the

difference between the two means or averages. The bottom part is a measure of the

variability or dispersing of the scores. This formula is essentially another example of

the signal-to-noise metaphor in research the difference between the means is the

signal that in this case, we think our program or treatment introduced into the data, the

bottom part of the formula is a measure of variability that is essentially noise that may

make it harder to see the group difference.

Signal noise:

The top part of the formula is easy to compute----- Just find the difference

between the means. The bottom part is called the standard error of the difference. To

compute it, we take the variance for each group and divide it by the number of people

in that group. We add these two values and then their square root. The specific

formula is given in Figure.

200

Page 203: Measurement and Evaluation (Book).docx

SE ( X T−Xc )=√ varT

nc

+var c

nc

Remember that the variances is simply the square of the standard deviation.

The final formula for the T-test is shown in the given figure as under.

T=XT−XC

√ nar t

nT

+narc

nC

Formula for T-test.

References

O’Mahony, Michael (1986). Sensory Evaluation of Food: Statistical Methods and

procedures.

William H.; Saul A. Teukolsky. William T. Vetterling Br Ain P. Flannery (1992).

Numerical Recipes in C: The Art of \Scientific Computing.

Internet Google, pre Encyclopedia.

8.2 CHI-SQUARE (X2):

The X2-distribution (X is the Greek letter Chi, pronounced Ki) was first

obtained in 1875 by H.R Helmert a German physicist. Later in 1900, Karl Pearson

showed that as n-increasing to infinity a discrete multinomial distribution may be

transformed and made to approach a chi-square distribution. This approximation has

broad application such as a test of goodness of fit, as a test of independence and a test

of homogeneity.

The chi-square distribution contains only one parameter, called the number of

degree of freedom.

201

Page 204: Measurement and Evaluation (Book).docx

Chi-Square Distribution:

Let Z1, Z2 ----- Zn be normally and independently distributed variables with

Zero mean and unit vassance (0, 1). Then the random variable expressed by the

quantity.

X2 =

In otherworld’s it can be defined as “It is the sum of squares of n-indep endant

standardized random variables”.

Properties of Chi-Square Distribution:

Chi-square distribution has the following properties.

1. The chi-square distribution is continuous ranging from Zero to infinity.

2. Total area under the curve is unity.

3. The mean of X2 distribution is equal to the number of degree of freedom i.e. n.

4. The variance of f2 distribution is equal to twice the degree of freedom i.e. 2n.

5. The carve of chi-square distribution is positively skewed.

6. The X2 distribution tends to normal distribution an the number of degrees of

freedom approaches to infinity.

7. Moment generating function of x2 distribution is (1-2+)-n/2

8. X2 distribution is leptokurtic as 2> 3.

Uses of X2 Distribution:

1. X2 is used to test the goodness of fit.

2. X2 is used to test the independence of attributes.

3. X2 is used to test the validity of a hypothetical ratios.

4. X2 is used to test the homogeneity of soosal X2 variances.

202

Page 205: Measurement and Evaluation (Book).docx

5. X2 is used to test whether the hypothical value S2 of population variances

hypothical value S2 of population variances is true on not.

6. X2 is used to test the equality of several population correlation co-efficient.

Goodness of Fit Test:

This test is based on the property that cell probabilities depend upon unknown

parameters, provided that the unknown parameters are replaced with their estimates

and provided that and one degree of freedom is deducted for each parapets estimated”.

To see whether there is evidence of small or large differences, the test statistic to use

is;

x2∑i=1

K (¿−npi)2

npi=¿∑

i=1

u

¿¿¿

With k-1-number of parameters estimated degrees of freedom.

The symbol Oi and ei are represented observed and expected frequencies

respectively. When the observed values are equal to the expected values, the X2 = 0.

The larger the difference between the observed and expected frequencies, the larger

will be the X2 value. A small value of X2 indicates that the fit is good and leads to

accept H0. A large value of X2 indicates that the fit is poor and leads to accept H1.

Contingency Table:

A table consists two & more rows and two or more columns, into which n-

observations are classified according to two different criteria (or variables) is

commonly called, a contingency table.

The simplest form of a contingency table is 2×2 table which is obtained when

both criteria are dichotomized. The totals of the frequencies in each of the rows and

columns are called the marginal total a frequencies. Contingency tables provides a

useful method of comparing two variables.

A 2 × 2 contingency table are as under.

203

Page 206: Measurement and Evaluation (Book).docx

Classes B1 B2 Total

A1 O11 O12 (A1)

A2 O21 O22 (A2)

Total (B1) (B2) N

A contingency table may be extended to higher dimension. i.e. r × c

contingence table, where r represents number of rows and c represents number of

columns.

Testing Hypothesis of Independence in Contingency Table:

The data presented in a contingency table can be used to test the hypothesis

that the two variables of classification are independent. It this hypothesis is rejected,

the two variables of classification are not independent and we say that there is some

also citation (or interaction) between the two variables of classification. To do so, we

must calculate the expected frequencies based on this hypothesis, keeping the

marginal totals fixed.

Let eij denote the expected frequency belonging to Ai and Bj. Assuming the

hypothesis of independence is true, the proportion of members belonging to any class

Ai should be the same and equal to the proportions in the total. Thus

eij( Ai)

=∑i=1

r

eij

∑i=1

4

( Ai)=

(Bj)n

So that

Eij = ( Ai )(Bj)

n

That is, under Ho: The classification are independent, the expected frequency

in any cell is equal to the product of the marginal total common to that cell divided by

the total number of observation.

204

Page 207: Measurement and Evaluation (Book).docx

If our hypothesis of independence is true the difference between observed and

expected frequencies are small and are attributed to sampling error. Large differences

arise of the seeing false. The Chi-square statistic provides a means for deciding

whether the differences are large or small overall. Hence the statistic to use is,

X2=∑i=1

r

❑∑j=1

c

(oij−eij )1 / eij

With (r-1) (c-1) degrees of freedom. Where r represents rows and c represents

the number of columns. A large value of X2 indicates that the null hypothesis is false.

The procedure for testing the null hypothesis of independence in

contingency table is given below:

i) Formulate the null and altonative hypothesis as:

H0: The two variables of classification are independent OR

There is no relationship / Association between the two

variables.

H1: The two variables of classification are not independent; means

they are associated.

ii) Choose a significance level x. The commonly used levels are at x =

0.01, 0.05 etc.

iii) The test statistic use to

X2=∑i=1

r

❑∑j=1

c

(oij−eij )2/ eij

Which, if H0 is true, has an approximate chi-square distribution with (r-1) (c-

1) degrees of freedom.

205

Page 208: Measurement and Evaluation (Book).docx

iv) Compute the expected frequencies under H0 for each cell by the

formula

eij=( Ai ) (Bj )

n¿

( ithrow total ) ( jth column total )Totalnumber of observation

Also calculate the value of X2 and the degrees of freedom.

v) Determine the critical region which depends on X and the number of

degrees of freedom.

iv) Decide as below:

(i) Reject H0, if the computed value of

X2> X2× (r-1) (c-1)

(ii) Accept H0 if

X2> X2× (r-1) (c-1)

References

1. Chudry and Kamal (2004), Introduction to statistical theory part-I. Markazi

Kutab Khana, Urdu Bazar, Lahore, Pakistan.

2. B.L. Agarwal (2003), Programmed Statistics, 2nd Edition. New Age

International (P) Limited Publishers 4835/24, Ansori Road, Daryaganj, New

Delhi – 110002, ISBN: 81-224-1458-3.

8.3 REGRESSION:

In statistics, regression analysis is a statistical technique for estimating the

relationships among variables. It includes many techniques for modelling and

analysing several variables, when the focus is on the relationship between a dependent

variable and one or more independent variables.

206

Page 209: Measurement and Evaluation (Book).docx

In other words regression is a statistical measure that attempts to determine the

strength of the relationship between one dependent variable (usually denoted by Y)

and a series of other changing variables (known as independent variables).

Types of 'Regression'

There are two basic types of regression:

(i) Linear regression

(ii) Multiple regression.

Linear regression uses one independent variable to explain and/or predict the

outcome of Y, while multiple regression uses two or more independent variables to

predict the outcome. The general form of each type of regression is:

Linear Regression: Y = a + bX + u

Multiple Regression: Y = a + b1 X1+ b2 X2 + B3 X3 B3X3 + …… Bt Xt u

Where:

Y = the variable that we are trying to predict

X = the variable that we are using to predict Y

a = the intercept

b = the slope

u = the regression residual.

In multiple regression, the separate variables are differentiated by using

subscripted numbers.

Regression takes a group of random variables, thought to be predicting Y, and

tries to find a mathematical relationship between them. This relationship is typically

in the form of a straight line (linear regression) that best approximates all the

207

Page 210: Measurement and Evaluation (Book).docx

individual data points. Regression is often used to determine how much specific

factors such as the price of a commodity, interest rates, particular industries or sectors

influence the price movement of an asset.

208