using observation checklists to validate speaking-test taskspart. the format of the main suite...

24
Using observation checklists to validate speaking-test tasks Barry O’Sullivan The University of Reading, Cyril J. Weir University of Surrey, Roehampton and Nick Saville University of Cambridge Local Examinations Syndicate Test-task validation has been an important strand in recent revision projects for University of Cambridge Local Examinations Syndicate (UCLES) examinations. This article addresses the relatively neglected area of validating the match between intended and actual test-taker language with respect to a blueprint of language functions representing the construct of spoken language ability. An observation checklist designed for both a priori and a posteriori analysis of speaking task output has been developed. This checklist enables language samples elicited by the task to be scanned for these functions in real time, without resorting to the laborious and somewhat limited analysis of transcripts. The process and results of its development, implications and further applications are discussed. I Background to the study This article reports on the development and use of observation check- lists in the validation of the Speaking Tests within the University of Cambridge Local Examinations Syndicate (UCLES) ‘Main Suite’ examination system (see Figure 1). These checklists are intended to ALTE Level 1 ALTE Level 2 ALTE Level 3 ALTE Level 4 ALTE Level 5 Waystage User Threshold User Independent User Competent User Good User CAMBRIDGE CAMBRIDGE CAMBRIDGE CAMBRIDGE CAMBRIDGE Level 1 Level 2 Level 3 Level 4 Level 5 Key English Test (KET) Preliminary English Test (PET) First Certificate in English (FCE) Certificate in Advanced English (CAE) Certificate of Proficiency in English (CPE) BASIC INTERMEDIATE ADVANCED Figure 1 The Cambridge/ALTE ve-level system Address for correspondence: Barry O’Sullivan, Testing and Evaluation Unit, School of Linguis- tics and Applied Language Studies, The University of Reading, PO Box 241, Whiteknights, Reading RG6 6WB, UK; email: b.e.osullivanKreading.ac.uk Language Testing 2002 19 (1) 33–56 10.1191/0265532202lt219oa Ó 2002 Arnold

Upload: others

Post on 07-Feb-2020

23 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

Using observation checklists to validatespeaking-test tasksBarry OrsquoSullivan The University of Reading Cyril J WeirUniversity of Surrey Roehampton and Nick SavilleUniversity of Cambridge Local Examinations Syndicate

Test-task validation has been an important strand in recent revision projects forUniversity of Cambridge Local Examinations Syndicate (UCLES) examinationsThis article addresses the relatively neglected area of validating the match betweenintended and actual test-taker language with respect to a blueprint of languagefunctions representing the construct of spoken language ability An observationchecklist designed for both a priori and a posteriori analysis of speaking taskoutput has been developed This checklist enables language samples elicited bythe task to be scanned for these functions in real time without resorting to thelaborious and somewhat limited analysis of transcripts The process and results ofits development implications and further applications are discussed

I Background to the study

This article reports on the development and use of observation check-lists in the validation of the Speaking Tests within the Universityof Cambridge Local Examinations Syndicate (UCLES) lsquoMain Suitersquoexamination system (see Figure 1) These checklists are intended to

ALTE Level 1 ALTE Level 2 ALTE Level 3 ALTE Level 4 ALTE Level 5

WaystageUser

ThresholdUser

IndependentUser

CompetentUser

GoodUser

CAMBRIDGE CAMBRIDGE CAMBRIDGE CAMBRIDGE CAMBRIDGE

Level 1 Level 2 Level 3 Level 4 Level 5

Key EnglishTest (KET)

PreliminaryEnglish Test

(PET)

FirstCertificate in

English (FCE)

Certificate inAdvanced

English (CAE)

Certificate ofProficiency inEnglish (CPE)

BASIC INTERMEDIATE ADVANCED

Figure 1 The CambridgeALTE ve-level system

Address for correspondence Barry OrsquoSullivan Testing and Evaluation Unit School of Linguis-tics and Applied Language Studies The University of Reading PO Box 241 WhiteknightsReading RG6 6WB UK email beosullivanKreadingacuk

Language Testing 2002 19 (1) 33ndash56 1011910265532202lt219oa Oacute 2002 Arnold

34 Validating speaking-test tasks

Table 1 Format of the Main Suite Speaking Test

Part Participants Task format

1 Interviewerndashcandidate Interview Verbal questions2 Candidatendashcandidate Collaborative task Visual stimulus

Verbal instructions3 Interviewerndashcandidatendashcandidate Long turns and discussion Written

stimulus Verbal questions

provide an effective and ef cient tool for investigating variation inlanguage produced by different task types different tasks within tasktypes and different interview organization at the pro ciency levelsin Figure 1 As such they represent a unique attempt to validate thematch between intended and actual test-taker language with respect toa blueprint of language functions representing the construct of spokenlanguage ability in the UCLES tests of general language pro ciencyfrom PET to CPE level (for further information related to the differenttests in the lsquoMain Suitersquo battery see the individual handbooks pro-duced by UCLES) Beyond this study the application of such check-lists has clear relevance for any test of spoken interaction

The standard Cambridge approach in testing speaking is based ona paired format involving an interlocutor an additional examiner andtwo candidates Careful attention has been given to the tasks throughwhich the spoken language performance is elicited in each differentpart The format of the Main Suite Speaking Tests (with the exceptionof the Level 1 KET test) is summarized in Table 1

II Issues in validating tests of oral performance

In considering the issue of the validity of a performance test1 ofspeaking we need a framework that describes the relationshipbetween the construct being measured the tasks used to oper-ationalize that construct and the assessment of the performances thatare used to make inferences to that underlying ability

There have been a number of models that have attempted to portraythe relationship between a test-takerrsquos knowledge of and ability touse a language and the score they receive in a test designed to evalu-ate that knowledge (eg Milanovic and Saville 1996 McNamara1996 Skehan 1998 Upshur and Turner 1999)

1By performance tests we are referring to direct tests where a test-takerrsquos ability is evaluatedfrom their performance on a set task or tasks

Barry OrsquoSullivan Cyril J Weir and Nick Saville 35

Milanovic and Saville (1996) provide a useful overview of the vari-ables that interact in performance testing and suggest a conceptualframework for setting out different avenues of research The frame-work was in uential in the revisions of the Cambridge examinationsduring the 1990s including the development of KET and CAE examsand revisions to PET FCE and most recently CPE (for a summaryof the UCLES approach see Saville and Hargreaves 1999)

The Milanovic and Saville framework is one of the earliest andmost comprehensive of these models (reproduced here as Figure 2)This framework highlights the many factors (or facets) that must beconsidered when designing a test from which particular inferences areto be drawn about performances all of the factors represented in themodel pose potential threats to the reliability and validity of theseinferences From this model a framework can be derived throughwhich a validation strategy can be devised for Speaking Tests suchas those produced by UCLES

The essential elements of this framework aremiddot the test-takermiddot the interlocutor examinermiddot the assessment criteria (scales)middot the taskmiddot the interactions between these elements

Examinationdeveloper

Specificationsand

construct

Examinationconditions

Tasks

Assessmentcriteria

Assessmentconditions

and training

Knowledgeand ability

Examiners

Sample oflanguage

Score

Candidates

Knowledgeand ability

Figure 2 A conceptual framework for performance testingSource adapted from Milanovic and Saville 1996 6

36 Validating speaking-test tasks

The subject of this study the task has been explored from a numberof perspectives Brie y these have been

middot Taskmethod comparison (quantitative) involving studies inwhich comparisons are made between performances on differenttasks or methods (Clark 1979 1988 Henning 1983 Shohamy1983 Shohamy et al 1986 Clark and Hooshmand 1992 Stans- eld and Kenyon 1992 Wigglesworth and OrsquoLoughlin 1993Chalhoub-Deville 1995a OrsquoLoughlin 1995 Fulcher 1996 Lum-ley and OrsquoSullivan 2000 OrsquoSullivan 2000)

middot Taskmethod comparison (qualitative) as above but where quali-tative methods are employed (Shohamy 1994 Young 1995Luoma 1997 OrsquoLoughlin 1997 Bygate 1999 Kormos 1999)

middot Task performance (method effect) where aspects of the task aresystematically manipulated eg planning time pre- or post-taskoperations etc (Foster and Skehan 1996 1999 Wigglesworth1997 Mehnert 1998 Ortega 1999 Upshur and Turner 1999)

middot Native speakerNonnative speaker comparison where nativespeaker performance on speci c tasks is compared to nonnativespeaker performance on the same tasks (Weir 1983 Ballman1991)

middot Task dif cultyclassi cation where an attempt has been made toclassify tasks in terms of their dif culty (Weir 1993 Fulcher1994 Kenyon 1995 Robinson 1995 Skehan 1996 1998 Norriset al 1998)

The central importance of the test task has been clearly recognizedhowever in terms of test validation there is one question that hasto date remained largely unexplored Although there has been a greatdeal of debate over the validation of performance tests through analy-sis of the language generated in the performance of language elici-tation tasks (LETs) (eg van Lier 1989 Lazaraton 1992 1996)attention has not been drawn to the one aspect of task performancethat would appear to be of most interest to the test designer That iswhen tasks are performed in a test event how does that performancerelate to the test designerrsquos predictions or expectations based on theirde nition or interpretation of the construct After all no matter howreliably the performance is scored if it does not match the expec-tations of the test designer (in other words represent the constructswhich are to be tested) then the inferences that the test designer hopesto draw from the evaluated performance will not be valid

Cronbach went to the heart of the matter (1971 443) lsquoConstruc-tion of a test itself starts from a theory about behaviour or mentalorganization derived from prior research that suggests the ground planfor the testrsquo Davies (1977 63) argued in similar vein lsquoit is after

Barry OrsquoSullivan Cyril J Weir and Nick Saville 37

all the theory on which all else rests it is from there that the constructis set up and it is on the construct that validity of the content andpredictive kinds is basedrsquo Kelly (1978 8) supported this view com-menting that lsquothe systematic development of tests requires sometheory even an informal inexplicit one to guide the initial selectionof item content and the division of the domain of interest into appro-priate sub-areasrsquo

Because we lack an adequate theory of language in use a prioriattempts to determine the construct validity of pro ciency testsinvolve us in matters that relate more evidently to content validityWe need to talk of the communicative construct in descriptive termsand as a result we become involved in questions of content relevanceand content coverage Thus for Kelly (1978 8) content validityseemed lsquoan almost completely overlapping conceptrsquo with constructvalidity and for Moller (1982 68) lsquothe distinction between constructand content validity in language testing is not always very markedparticularly for tests of general language pro ciencyrsquo

Content validity is considered important as it is principally con-cerned with the extent to which the selection of test tasks is represen-tative of the larger universe of tasks of which the test is assumed tobe a sample (see Bachman and Palmer 1981 Henning 1987 94Messick 1989 16 Bachman 1990 244) Similarly Anastasi (1988131) de ned content validity as involving lsquoessentially the systematicexamination of the test content to determine whether it covers arepresentative sample of the behaviour domain to be measuredrsquo Sheoutlined (Anastasi 1988 132) the following guidelines for estab-lishing content validity

1) lsquothe behaviour domain to be tested must be systematicallyanalysed to make certain that all major aspects are covered bythe test items and in the correct proportionsrsquo

2) lsquothe domain under consideration should be fully described inadvance rather than being de ned after the test has been pre-paredrsquo

3) lsquocontent validity depends on the relevance of the individualrsquos testresponses to the behaviour area under consideration rather thanon the apparent relevance of item contentrsquo

The directness of t and adequacy of the test sample is thus dependenton the quality of the description of the target language behaviourbeing tested In addition if the responses to the item are invokedMessick (1975 961) suggests lsquothe concern with processes underlyingtest responses places this approach to content validity squarely in therealm of construct validityrsquo Davies (1990 23) similarly notes lsquocon-tent validity slides into construct validityrsquo

38 Validating speaking-test tasks

Content validation is of course extremely problematic given thedif culty we have in characterizing language pro ciency with suf- cient precision to ensure the validity of the representative samplewe include in our tests and the further threats to validity arising out ofany attempts to operationalize real life behaviours in a test Specifyingoperations let alone the conditions under which these are performedis challenging and at best relatively unsophisticated (see Cronbach1990) Weir (1993) provides an introductory attempt to specify theoperations and conditions that might form a framework for test taskdescription (see also Bachman 1990 Bachman and Palmer 1996)

The dif culties involved do not however absolve us fromattempting to make our tests as relevant as possible in terms of con-tent Generating content related evidence is seen as a necessaryalthough not suf cient part of the validation process of a speakingtest To this end we sought to establish in this study an effective andef cient procedure for establishing the content validity of speakingtests As well as being useful in helping specify the domain to betested we would argue that the checklist discussed below wouldenable the researcher to address how predicted vs actual task per-formance can be compared

III Methodological issues

While it is relatively easy to rationalize the need to establish that theLETs used in performance tests are working as predicted (ie interms of language generated) the dif culty lies in how this mightbest be done

UCLES EFL (English as a foreign language) routinely collectsaudio recordings and carries out transcriptions of its Speaking TestsThese transcripts are used for a range of validation purposes and inparticular they contribute to revision projects for the Speaking Testsfor example FCE which was revised in 1996 and currently therevision of the International English Language Testing System(IELTS) Speaking Test in addition to the CPE revision project

In a series of UCLES studies focusing on the language of theSpeaking Tests Lazaraton has applied conversational analysis (CA)techniques to contribute to our understanding of the language used inpair-format Speaking Tests including the language of the candidatesand the interlocutor Her approach requires a very careful ne-tunedtranscription of the tests in order to provide the data for analysis (seeLazaraton 2000) Similar qualitative methodologies have beenapplied by Young and Milanovic (1992) ndash also to UCLES data ndash byBrown (1998) and by Ross and Berwick (1992) amongst others

Barry OrsquoSullivan Cyril J Weir and Nick Saville 39

While there is clearly a great deal of potential for this detailed analy-sis of transcribed performances there are also a number of drawbacksthe most serious of which involves the complexity of the transcriptionprocess In practice this means that a great deal of time and expertiseis required in order to gain the kind of data that will answer the basicquestion concerning validity Even where this is done it is impracticalto attempt to deal with more than a small number of test eventstherefore the generalizability of the results may be questioned

Clearly then a more ef cient methodology is required that allowsthe test designer to evaluate the procedures and especially the tasksin terms of the language produced by a larger number of candidatesIdeally this should be possible in lsquorealrsquo time so that the relationshipof predicted outcome to speci c outcome can be established using adata set that satisfactorily re ects the typical test-taking populationThe primary objective of this project therefore was to create aninstrument built on a framework that describes the language of per-formance in a way that can be readily accessed by evaluators whoare familiar with the tests being observed This work is designed tobe complementary to the use of transcriptions and to provide anadditional source of validation evidence

The FCE was chosen as the focus of this study for a number ofreasons

middot It is lsquostablersquo in that it is neither under review nor is due to bereviewed

middot It represents the middle of the ALTE (and UCLES Main Suite)range and is the most widely subscribed test in the battery

middot It offers the most likelihood of a wide range of performance ofany Main Suite examination as it is often used as an lsquoentry-pointrsquointo the suite candidates tend to range from below to above thislevel in terms of ability

middot Like all of the other Main Suite examinations a database ofrecordings (audio and video) already existed

IV The development of the observation checklists

Weir (1993) building on the earlier work of Bygate (1988) suggeststhat the language of a speaking test can be described in terms ofthe informational and interactional functions and those of interactionmanagement generated by the participants involved With this as astarting point a group of researchers at the University of Readingwere commissioned by UCLES EFL to examine the spoken langu-age second language acquisition and language testing literatures tocome up with an initial set of such functions (see Schegloff et al

40 Validating speaking-test tasks

1977 Schwartz 1980 van Ek and Trim 1984 Bygate 1988Shohamy 1988 1994 Walker 1990 Weir 1994 Stenstrom 1994Chalhoub-Deville 1995b Hayashi 1995 Ellerton 1997 Suhua1998 Kormos 1999 OrsquoSullivan 2000 OrsquoLoughlin 2001)

These were then presented as a draft set of three checklists(Appendix 1) representing each of the elements of Weirrsquos categoriz-ation What follows in the three phases of the development processdescribed below (Section VI) was an attempt to customize thechecklist to more closely re ect the intended outcomes of spokenlanguage test tasks in the UCLES Main Suite The checklists weredesigned to help establish which of these functions resulted andwhich were absent

The next concern was with the development of a procedure fordevising a lsquoworkingrsquo version of the checklists to be followed by anevaluation of using this type of instrument in lsquorealrsquo time (using tapesor perhaps live speaking tests)

V The development model

The process through which the checklists were developed is shownin Figure 3 The concept that drives this model is the evaluation ateach level by different stakeholders At this stage of the project thesestakeholders were identi ed as

Figure 3 The development model

Barry OrsquoSullivan Cyril J Weir and Nick Saville 41

middot the consulting lsquoexpertrsquo testers (the University of Reading group)middot the test development and validation staff at UCLESmiddot UCLES Senior Team Leaders (ie key staff in the oral examiner

training system)

All these individuals participated in the application of each draft Itshould also be noted that a number of drafts were anticipated

VI The development process

In order to arrive at a working version of the checklists a number ofdevelopmental phases were anticipated At each phase the latest ver-sion (or draft) of the instruments was applied and this applicationevaluated

Phase 1

The rst attempt to examine how the draft checklists would beviewed and applied by a group of language teachers was conductedby ffrench (1999) Of the participants at the seminar approximately50 of the group reported that English (British English AmericanEnglish or Australian English) was their rst language while theremaining 50 were native Greek speakers

In their introduction to the application of the Observation Check-lists (OCs) the participants were given a series of activities thatfocused on the nature and use of those functions of language seen bytask designers at UCLES to be particularly applicable to their EFLMain Suite Speaking Tests (principally FCE CAE and CPE) Oncefamiliar with the nature of the functions (and where they might occurin a test) the participants applied the OCs in lsquorealrsquo time to an FCESpeaking Test from the 1998 Standardization Video This video fea-tured a pair of French speakers who were judged by a panel oflsquoexpertrsquo raters (within UCLES) to be slightly above the criterion(lsquopassrsquo) level

Of the 37 participants 32 completed the task successfully that isthey attempted to make frequency counts of the items represented inthe OCs Among this group there appear to be varying degrees ofagreement as to the use of language functions particularly in termsof the speci c number of observations of each function Howeverwhen the data are examined from the perspective of agreement onwhether a particular function was observed or not (ignoring the countwhich in retrospect was highly ambitious when we consider the lackof systematic training in the use of the questionnaires given to theteachers who attended) we nd that there is a striking degree of

42 Validating speaking-test tasks

agreement on all but a small number of functions (Appendix 2) Notehere that in order to make these patterns of behaviour clear the datahave been sorted both horizontally and vertically by the total numberof observations made by each participant and of each item

From this perspective this aspect of the developmental process wasconsidered to be quite successful However it was apparent that therewere a number of elements within the checklists that were causingsome dif culty These are highlighted in the table by the tram-linesItems above the lines have been identi ed by some participants inone case by a single person while those below have been observedby a majority of participants (in two cases by all of them) For thesecases we might infer a high degree of agreement However themiddle range of items appears to have caused a degree of confusionand so are highlighted here ie marked for further investigation

Phase 2

In this phase a much smaller gathering was organized this timeinvolving members of the development team as well as the three UK-based UCLES Senior Team Leaders In advance of this meeting allparticipants were asked to study the existing checklists and to exemp-lify each function with examples drawn from their experiences of thevarious UCLES Main Suite examinations The resulting data werecollated and presented as a single document that formed the basis ofdiscussion during a day-long session Participants were not madeaware of the ndings from Phase 1

During this session many questions were asked of all aspects ofthe checklist and a more streamlined version of the three sectionswas suggested In addition to a number of participants making a writ-ten record of the discussions the entire session was recorded Thisproved to be a valuable reminder of the way in which particularchanges came about and was used when the nal decisions regardinginclusion con ation or omission were being made Although it isbeyond the scope of this project to analyse this recording whencoupled with the earlier and revised documents it is in itself a valu-able source of data in that it provides a signi cant record of the devel-opmental process

Among the many interesting outcomes of this phase were thedecisions either to rethink to reorganize or to omit items from theinitial list These decisions were seen to mirror the results of the Phase1 application quite closely Of the 13 items identi ed in Phase 1 asbeing in need of review (7 were rarely observed indicating a highdegree of agreement that they were not in fact present and 6appeared to be confused with very mixed reported observations ) 7

Barry OrsquoSullivan Cyril J Weir and Nick Saville 43

were recommended for either omission or inclusion in other items bythe panel while the remaining 6 items were identi ed by them asbeing of value Although no examples of the latter had appeared inthe earlier data the panel agreed that they represented language func-tions that the UCLES Main Suite examinations were intended to elicitIt was also decided that each item in this latter group was in need offurther clari cation andor exempli cation Of the remaining 17items

middot two were changed the item lsquoanalysingrsquo was recoded as lsquostagingrsquoin order to clarify its intended meaning while it was decided toseparate the item lsquo(dis)agreeingrsquo into its two separate components

middot three were omitted it was argued that the item lsquoproviding non-personal informationrsquo referred to what was happening with theother items in the informational function category while the itemslsquoexplainingrsquo and lsquojustifyingsupportingrsquo were not functions usu-ally associated with the UCLES Main Suite tasks and no occur-rences of these had been noted

We would emphasize that as reported in Section IV above the initiallist was developed to cover the language functions that various spokenlanguage test tasks might elicit The development of the checklistsdescribed here re ects an attempt to customize the lists in line withthe intended functional outcomes of a speci c set of tests

We are of course aware that closed instruments of this type maybe open to the criticism that valuable information could be lost How-ever for reasons of practicality we felt it necessary to limit the listto what the examinations were intended to elicit rather than attemptto operationalize a full inventory Secondly any functions thatappeared in the data that were not covered by the reduced list wouldhave been noted There appeared to be no cases of this

The data from these two phases were combined to result in a work-ing version of the checklists (Appendix 3) which was then appliedto a pair of FCE Speaking Tests in Phase 3

Phase 3

In the third phase the revised checklists were given to a group of 15MA TEFL students who were asked to apply them to two FCE testsBoth of these tests involved a mixed-sex pair of learners one pair ofapproximately average ability and the other pair above averageBefore using the observation checklists (OCs) the students wereasked rst to attempt to predict which functions they might expect to nd To help in this pre-session task the students were given detailsof the FCE format and tasks

44 Validating speaking-test tasks

Unfortunately a small number of students did not manage to com-plete the observation task as they were somewhat overwhelmed withthe real-time application of the checklists As a result only 12 sets ofcompleted checklists were included in the nal analysis

Prior to the session the group was given an opportunity to have apractice run using a third FCE examination While this lsquotrainingrsquo per-iod coupled with the pre-session task was intended to provide thestudents with the background they needed to apply the checklists con-sistently there was a problem during the session itself This problemwas caused by the failure of a number of students to note the changefrom Task 3 to Task 4 in the rst test observed This was possiblycaused by a lack of awareness of the test itself and was not helpedby the seamless way in which the examiner on the video moved froma two-way discussion involving the test-takers to a three-way dis-cussion This meant that a full set of data exists only for the rst twotasks of this test As the problem was noticed in time the second testdid not cause these problems Unlike the earlier seminar on thisoccasion the participants were asked only to record each functionwhen it was rst observed This was done as it was felt that the earlierseminar showed that without extensive training it would be far toodif cult to apply the OCs fully in lsquorealrsquo time in order to generatecomprehensive frequency counts We are aware that a full tally wouldenable us to draw more precise conclusions about the relative fre-quency of occurrence of these functions and the degree of consensus(reliability) of observers

Against this we must emphasize that the checklists in their currentstage of development are designed to be used in real time Their usewas therefore restricted to determining the presence or absence of aparticular function Rater agreement in this case is limited to a some-what crude account of whether a function occurred or did not occurin a particular task performance We do not therefore have evidenceof whether the function observed was invariant across raters

The results from this session are included as Appendix 4 It canbe seen from this table that the participants again display mixed levelsof agreement ranging from a single perceived observation to totalagreement As with the earlier session it appears that there is rela-tively broad agreement on a range of functions but that others appearto be more dif cult to identify easily These dif culties appear to begreatest where the task involves a degree of interaction between thetest-takers

Phase 4In this phase a transcription was made of the second of the two inter-views used in Phase 3 since there was a full set of data available for

Barry OrsquoSullivan Cyril J Weir and Nick Saville 45

this interview The OCs were then lsquomappedrsquo on to this transcript inorder to give an overview from a different perspective of what func-tions were generated (it being felt that this map would result in anaccurate description of the test in terms of the items included in theOCs) This mapping was carried out by two researchers who initiallyworked independently of each other but discussed their nished workin order to arrive at a consensus

Finally the results of Phases 2 and 3 were compared (Appendix5) This clearly indicates that the checklists are now working wellThere are still some problems in items such as lsquostagingrsquo and lsquodescrib-ingrsquo and feedback from participants suggests that this may be due tomisunderstandings or misinterpretations of the gloss and examplesused In addition there are some similar dif culties with the initialthree items in the interactional functions checklist in which the great-est dif culties in applying the checklists appear to lie

VII Discussion and initial conclusions

The results of this study appear to substantiate our belief thatalthough still under development for use with the UCLES Main Suiteexaminations an operational version of these checklists is certainlyfeasible and has potentially wider application mutatis mutandis tothe content validation of other spoken language tests Further re ne-ment of the checklists is clearly required although the developmentalprocess adopted here appears to have borne positive results

1 Validities

We would not wish to claim that the checklists on their own offer asatisfactory demonstration of the construct validity of a spoken langu-age test for as Messick argues (1989 16) lsquothe varieties of evidencesupporting validity are not alternatives but rather supplements to oneanotherrsquo We recognize the necessity for a broad view of lsquothe eviden-tial basis for test interpretationrsquo (Messick 1989 20) Bachman (1990237) similarly concludes lsquoit is important to recognise that none ofthese [evidences of validity] by itself is suf cient to demonstrate thevalidity of a particular interpretation or use of test scoresrsquo (see alsoBachman 1990 243) Fulcher (1999 224) adds a further caveatagainst an overly narrow interpretation of content validity when hequotes Messick (1989 41)

the major problem is that so-called content validity is focused upon test formsrather than test scores upon instruments rather than measurements selectingcontent is an act of classi cation which is in itself a hypothesis that needs tobe con rmed empirically

46 Validating speaking-test tasks

Like these authors we regard as inadequate any conceptualization ofvalidity that does not involve the provision of evidence on a numberof levels but would argue strongly that without a clear idea of thematch between intended content and actual content any comprehen-sive investigation of the construct validity of a test is built on sandDe ning the construct is in our view underpinned by establishingthe nature of the actual performances elicited by test tasks ie thetrue content of tasks

2 Present and future applications of observational checklists

Versions of the checklists require a degree of training and practicesimilar to that given to raters if a reliable and consistent outcome isto be expected This requires that standardized training materials bedeveloped alongside the checklists In the case of these checkliststhis process has already begun with the initial versions piloted duringPhase 3 of the project

The checklists have great potential as an evaluative tool and canprovide comprehensive insight into various issues It is hoped thatamongst other issues the checklists will provide insights into the fol-lowing

middot the language functions that the different task-types (and differentsub-tasks within these) employed in the UCLES Main Suite Paper5 (Speaking) Tests typically elicit

middot the language that the pair-format elicits and how it differs in nat-ure and quality from that elicited by interlocutor-single candi-date testing

middot the extent to which there is functional variation across the topfour levels of the UCLES Main Suite Spoken Language Test

In addition to these issues the way in which the checklists can beapplied may allow for other important questions to be answered Forexample by allowing the evaluator multiple observations (stoppingand starting a recording of a test at will) it will be possible to estab-lish whether there are quanti able differences in the language func-tions generated by the different tasks ie the evaluators will havethe time they need to make frequency counts of the functions

While the results to date have focused on a posteriori validationprocedures these checklists are also relevant to task design By takinginto account the expected response of a task (and by describing thatresponse in terms of these functions ) it will be possible to explorepredicted and actual test task outcome It will also be a useful guidefor item writers in taking a priori decisions about content coverageThrough this approach it should be possible to predict more accurately

Barry OrsquoSullivan Cyril J Weir and Nick Saville 47

linguistic response (in terms of the elements of the checklists) andto apply this to the design of test tasks ndash and of course to evaluatethe success of the prediction later on In the longer term this willlead to a greater understanding of how tasks and task formats can bemanipulated to result in speci c language use We are not claimingthat it is possible to predict language use at a micro level(grammatical form or lexis) but that it is possible to predict infor-mational and interactional functions and features of interaction man-agement ndash a notion supported by Bygate (1999)

The checklists should also enable us to explore how systematicvariation in such areas as interviewer questioning behaviour (andinterlocutor frame adherence ) affects the language produced in thistype of test In the interview transcribed for this study for examplethe examiner directed his questions very deliberately (systematicallyaiming the questions at one participant and then the other) Thistended to sti e any spontaneity in the intended three-way discussion(Task 4) so occurrences of Interactional and Discourse ManagementFunctions did not materialize to the extent intended by the taskdesigners It is possible that a less deliberate (unscripted) questioningtechnique would lead to a less interviewer-oriented interaction patternand allow for the more genuine interactive communication envisagedin the task design

Perhaps the most valuable contribution that this type of validationprocedure offers is its potential to improve the quality of oral assess-ment in both low-stakes and high-stakes contexts By offering theinvestigator an instrument that can be used in real time the checklistsbroaden the scope of investigation from limited case study analysisof small numbers of test transcripts to large scale eld studies acrossa wide range of testing contexts

Acknowledgements

We would like to thank Don Porter and Rita Green for their earlyinput into the rst version of the checklist In addition help wasreceived from members of the ELT division in UCLES in particularfrom Angela ffrench Lynda Taylor and Christina Rimini from agroup of UCLES Senior Team Leaders and from MA TEFL studentsat the University of Reading Finally we would like to thank theeditors and anonymous reviewers of Language Testing for theirinsightful comments and helpful suggestions for its improvement Thefaults that remain are as ever ours

48 Validating speaking-test tasks

VIII References

Anastasi A 1988 Psychological testing 6th edition New York Macmil-lan

Bachman LF 1990 Fundamental considerations in language testingOxford Oxford University Press

Bachman LF and Palmer AS 1981 The construct validation of the FSIoral interview Language Learning 31 67ndash86

mdashmdash 1996 Language testing in practice Oxford Oxford University PressBallman TL 1991 The oral task of picture description similarities and

differences in native and nonnative speakers of Spanish In TeschnerRV editor Assessing foreign language pro ciency of undergrad-uates AAUSC Issues in Language Program Direction Boston Heinleand Heinle 221ndash31

Brown A 1998 Interviewer style and candidate performance in the IELSToral interview Paper presented at the Language Testing Research Col-loquium Monterey CA

Bygate M 1988 Speaking Oxford Oxford University Pressmdashmdash 1999 Quality of language and purpose of task patterns of learnersrsquo

language on two oral communication tasks Language TeachingResearch 3 185ndash214

Chalhoub-Deville M 1995a Deriving oral assessment scales across differ-ent tests and rater groups Language Testing 12 16ndash33

mdashmdash 1995b A contextualized approach to describing oral language pro- ciency Language Learning 45 251ndash81

Clark JLD 1979 Direct vs semi-direct tests of speaking ability In Bri-ere EJ and Hinofotis FB editors Concepts in language testingsome recent studies Washington DC TESOL

mdashmdash 1988 Validation of a tape-mediated ACTFLILR scale based test ofChinese speaking pro ciency Language Testing 5 187ndash205

Clark JLD and Hooshmand D 1992 lsquoScreen to Screenrsquo testing anexploratory study of oral pro ciency interviewing using video tele-conferencing System 20 293ndash304

Cronbach LJ 1971 Validity In Thorndike RL editor Educationalmeasurement 2nd edition Washington DC American Council on Edu-cation 443ndash597

mdashmdash 1990 Essentials of psychological testing 5th edition New YorkHarper amp Row

Davies A 1977 The construction of language tests In Allen JPB andDavies A editors Testing and experimental methods The EdinburghCourse in Applied Linguistics Volume 4 London Oxford UniversityPress 38ndash194

mdashmdash 1990 Principles of language testing Oxford BlackwellEllerton AW 1997 Considerations in the validation of semi-direct oral

testing Unpublished PhD thesis CALS University of Readingffrench A 1999 Language functions and UCLES speaking tests Seminar

in Athens Greece October 1999

Barry OrsquoSullivan Cyril J Weir and Nick Saville 49

Foster P and Skehan P 1996 The in uence of planning and task typeon second language performance Studies in Second Language Acqui-sition 18 299ndash323

mdashmdash 1999 The in uence of source of planning and focus of planning ontask-based performance Language Teaching Research 3 215ndash47

Fulcher G 1994 Some priority areas for oral language testing LanguageTesting Update 15 39ndash47

mdashmdash 1996 Testing tasks issues in task design and the group oral LanguageTesting 13 23ndash51

mdashmdash 1999 Assessment in English for academic purposes putting contentvalidity in its place Applied Linguistics 20 221ndash36

Hayashi M 1995 Conversational repair a contrastive study of Japaneseand English MA Project Report University of Canberra

Henning G 1983 Oral pro ciency testing comparative validities of inter-view imitation and completion methods Language Learning 33315ndash32

mdashmdash 1987 A guide to language testing Cambridge MA Newbury HouseKelly R 1978 On the construct validation of comprehension tests an exer-

cise in applied linguistics Unpublished PhD thesis University ofQueensland

Kenyon D 1995 An investigation of the validity of task demands onperformance-based tests of oral pro ciency In Kunnan AJ editorValidation in language assessment selected papers from the 17th Lan-guage Testing Research Colloquium Long Beach Mahwah NJ Lawr-ence Erlbaum 19ndash40

Kormos J 1999 Simulating conversations in oral-pro ciency assessmenta conversation analysis of role plays and non-scripted interviews inlanguage exams Language Testing 16 163ndash88

Lazaraton A 1992 The structural organisation of a language interview aconversational analytic perspective System 20 373ndash86

mdashmdash1996 A qualitative approach to monitoring examiner conduct in theCambridge assessment of spoken English (CASE) In Milanovic Mand Saville N editors Performance testing cognition andassessment selected papers from the 15th Language Testing ResearchColloquium Cambridge and Arnhem Studies in Language Testing 3Cambridge University of Cambridge Local Examinations Syndicate18ndash33

mdashmdash 2000 A qualitative approach to the validation of oral language testsStudies in Language Testing Volume 14 Cambridge Cambridge Uni-versity Press

Lumley T and OrsquoSullivan B 2000 The effect of speaker and topic vari-ables on task performance in a tape-mediated assessment of speakingPaper presented at the 2nd Annual Asian Language AssessmentResearch Forum The Hong Kong Polytechnic University

Luoma S 1997 Comparability of a tape-mediated and a face-to-face testof speaking a triangulation study Unpublished Licentiate ThesisCentre for Applied Language Studies Jyvaskyla University Finland

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 2: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

34 Validating speaking-test tasks

Table 1 Format of the Main Suite Speaking Test

Part Participants Task format

1 Interviewerndashcandidate Interview Verbal questions2 Candidatendashcandidate Collaborative task Visual stimulus

Verbal instructions3 Interviewerndashcandidatendashcandidate Long turns and discussion Written

stimulus Verbal questions

provide an effective and ef cient tool for investigating variation inlanguage produced by different task types different tasks within tasktypes and different interview organization at the pro ciency levelsin Figure 1 As such they represent a unique attempt to validate thematch between intended and actual test-taker language with respect toa blueprint of language functions representing the construct of spokenlanguage ability in the UCLES tests of general language pro ciencyfrom PET to CPE level (for further information related to the differenttests in the lsquoMain Suitersquo battery see the individual handbooks pro-duced by UCLES) Beyond this study the application of such check-lists has clear relevance for any test of spoken interaction

The standard Cambridge approach in testing speaking is based ona paired format involving an interlocutor an additional examiner andtwo candidates Careful attention has been given to the tasks throughwhich the spoken language performance is elicited in each differentpart The format of the Main Suite Speaking Tests (with the exceptionof the Level 1 KET test) is summarized in Table 1

II Issues in validating tests of oral performance

In considering the issue of the validity of a performance test1 ofspeaking we need a framework that describes the relationshipbetween the construct being measured the tasks used to oper-ationalize that construct and the assessment of the performances thatare used to make inferences to that underlying ability

There have been a number of models that have attempted to portraythe relationship between a test-takerrsquos knowledge of and ability touse a language and the score they receive in a test designed to evalu-ate that knowledge (eg Milanovic and Saville 1996 McNamara1996 Skehan 1998 Upshur and Turner 1999)

1By performance tests we are referring to direct tests where a test-takerrsquos ability is evaluatedfrom their performance on a set task or tasks

Barry OrsquoSullivan Cyril J Weir and Nick Saville 35

Milanovic and Saville (1996) provide a useful overview of the vari-ables that interact in performance testing and suggest a conceptualframework for setting out different avenues of research The frame-work was in uential in the revisions of the Cambridge examinationsduring the 1990s including the development of KET and CAE examsand revisions to PET FCE and most recently CPE (for a summaryof the UCLES approach see Saville and Hargreaves 1999)

The Milanovic and Saville framework is one of the earliest andmost comprehensive of these models (reproduced here as Figure 2)This framework highlights the many factors (or facets) that must beconsidered when designing a test from which particular inferences areto be drawn about performances all of the factors represented in themodel pose potential threats to the reliability and validity of theseinferences From this model a framework can be derived throughwhich a validation strategy can be devised for Speaking Tests suchas those produced by UCLES

The essential elements of this framework aremiddot the test-takermiddot the interlocutor examinermiddot the assessment criteria (scales)middot the taskmiddot the interactions between these elements

Examinationdeveloper

Specificationsand

construct

Examinationconditions

Tasks

Assessmentcriteria

Assessmentconditions

and training

Knowledgeand ability

Examiners

Sample oflanguage

Score

Candidates

Knowledgeand ability

Figure 2 A conceptual framework for performance testingSource adapted from Milanovic and Saville 1996 6

36 Validating speaking-test tasks

The subject of this study the task has been explored from a numberof perspectives Brie y these have been

middot Taskmethod comparison (quantitative) involving studies inwhich comparisons are made between performances on differenttasks or methods (Clark 1979 1988 Henning 1983 Shohamy1983 Shohamy et al 1986 Clark and Hooshmand 1992 Stans- eld and Kenyon 1992 Wigglesworth and OrsquoLoughlin 1993Chalhoub-Deville 1995a OrsquoLoughlin 1995 Fulcher 1996 Lum-ley and OrsquoSullivan 2000 OrsquoSullivan 2000)

middot Taskmethod comparison (qualitative) as above but where quali-tative methods are employed (Shohamy 1994 Young 1995Luoma 1997 OrsquoLoughlin 1997 Bygate 1999 Kormos 1999)

middot Task performance (method effect) where aspects of the task aresystematically manipulated eg planning time pre- or post-taskoperations etc (Foster and Skehan 1996 1999 Wigglesworth1997 Mehnert 1998 Ortega 1999 Upshur and Turner 1999)

middot Native speakerNonnative speaker comparison where nativespeaker performance on speci c tasks is compared to nonnativespeaker performance on the same tasks (Weir 1983 Ballman1991)

middot Task dif cultyclassi cation where an attempt has been made toclassify tasks in terms of their dif culty (Weir 1993 Fulcher1994 Kenyon 1995 Robinson 1995 Skehan 1996 1998 Norriset al 1998)

The central importance of the test task has been clearly recognizedhowever in terms of test validation there is one question that hasto date remained largely unexplored Although there has been a greatdeal of debate over the validation of performance tests through analy-sis of the language generated in the performance of language elici-tation tasks (LETs) (eg van Lier 1989 Lazaraton 1992 1996)attention has not been drawn to the one aspect of task performancethat would appear to be of most interest to the test designer That iswhen tasks are performed in a test event how does that performancerelate to the test designerrsquos predictions or expectations based on theirde nition or interpretation of the construct After all no matter howreliably the performance is scored if it does not match the expec-tations of the test designer (in other words represent the constructswhich are to be tested) then the inferences that the test designer hopesto draw from the evaluated performance will not be valid

Cronbach went to the heart of the matter (1971 443) lsquoConstruc-tion of a test itself starts from a theory about behaviour or mentalorganization derived from prior research that suggests the ground planfor the testrsquo Davies (1977 63) argued in similar vein lsquoit is after

Barry OrsquoSullivan Cyril J Weir and Nick Saville 37

all the theory on which all else rests it is from there that the constructis set up and it is on the construct that validity of the content andpredictive kinds is basedrsquo Kelly (1978 8) supported this view com-menting that lsquothe systematic development of tests requires sometheory even an informal inexplicit one to guide the initial selectionof item content and the division of the domain of interest into appro-priate sub-areasrsquo

Because we lack an adequate theory of language in use a prioriattempts to determine the construct validity of pro ciency testsinvolve us in matters that relate more evidently to content validityWe need to talk of the communicative construct in descriptive termsand as a result we become involved in questions of content relevanceand content coverage Thus for Kelly (1978 8) content validityseemed lsquoan almost completely overlapping conceptrsquo with constructvalidity and for Moller (1982 68) lsquothe distinction between constructand content validity in language testing is not always very markedparticularly for tests of general language pro ciencyrsquo

Content validity is considered important as it is principally con-cerned with the extent to which the selection of test tasks is represen-tative of the larger universe of tasks of which the test is assumed tobe a sample (see Bachman and Palmer 1981 Henning 1987 94Messick 1989 16 Bachman 1990 244) Similarly Anastasi (1988131) de ned content validity as involving lsquoessentially the systematicexamination of the test content to determine whether it covers arepresentative sample of the behaviour domain to be measuredrsquo Sheoutlined (Anastasi 1988 132) the following guidelines for estab-lishing content validity

1) lsquothe behaviour domain to be tested must be systematicallyanalysed to make certain that all major aspects are covered bythe test items and in the correct proportionsrsquo

2) lsquothe domain under consideration should be fully described inadvance rather than being de ned after the test has been pre-paredrsquo

3) lsquocontent validity depends on the relevance of the individualrsquos testresponses to the behaviour area under consideration rather thanon the apparent relevance of item contentrsquo

The directness of t and adequacy of the test sample is thus dependenton the quality of the description of the target language behaviourbeing tested In addition if the responses to the item are invokedMessick (1975 961) suggests lsquothe concern with processes underlyingtest responses places this approach to content validity squarely in therealm of construct validityrsquo Davies (1990 23) similarly notes lsquocon-tent validity slides into construct validityrsquo

38 Validating speaking-test tasks

Content validation is of course extremely problematic given thedif culty we have in characterizing language pro ciency with suf- cient precision to ensure the validity of the representative samplewe include in our tests and the further threats to validity arising out ofany attempts to operationalize real life behaviours in a test Specifyingoperations let alone the conditions under which these are performedis challenging and at best relatively unsophisticated (see Cronbach1990) Weir (1993) provides an introductory attempt to specify theoperations and conditions that might form a framework for test taskdescription (see also Bachman 1990 Bachman and Palmer 1996)

The dif culties involved do not however absolve us fromattempting to make our tests as relevant as possible in terms of con-tent Generating content related evidence is seen as a necessaryalthough not suf cient part of the validation process of a speakingtest To this end we sought to establish in this study an effective andef cient procedure for establishing the content validity of speakingtests As well as being useful in helping specify the domain to betested we would argue that the checklist discussed below wouldenable the researcher to address how predicted vs actual task per-formance can be compared

III Methodological issues

While it is relatively easy to rationalize the need to establish that theLETs used in performance tests are working as predicted (ie interms of language generated) the dif culty lies in how this mightbest be done

UCLES EFL (English as a foreign language) routinely collectsaudio recordings and carries out transcriptions of its Speaking TestsThese transcripts are used for a range of validation purposes and inparticular they contribute to revision projects for the Speaking Testsfor example FCE which was revised in 1996 and currently therevision of the International English Language Testing System(IELTS) Speaking Test in addition to the CPE revision project

In a series of UCLES studies focusing on the language of theSpeaking Tests Lazaraton has applied conversational analysis (CA)techniques to contribute to our understanding of the language used inpair-format Speaking Tests including the language of the candidatesand the interlocutor Her approach requires a very careful ne-tunedtranscription of the tests in order to provide the data for analysis (seeLazaraton 2000) Similar qualitative methodologies have beenapplied by Young and Milanovic (1992) ndash also to UCLES data ndash byBrown (1998) and by Ross and Berwick (1992) amongst others

Barry OrsquoSullivan Cyril J Weir and Nick Saville 39

While there is clearly a great deal of potential for this detailed analy-sis of transcribed performances there are also a number of drawbacksthe most serious of which involves the complexity of the transcriptionprocess In practice this means that a great deal of time and expertiseis required in order to gain the kind of data that will answer the basicquestion concerning validity Even where this is done it is impracticalto attempt to deal with more than a small number of test eventstherefore the generalizability of the results may be questioned

Clearly then a more ef cient methodology is required that allowsthe test designer to evaluate the procedures and especially the tasksin terms of the language produced by a larger number of candidatesIdeally this should be possible in lsquorealrsquo time so that the relationshipof predicted outcome to speci c outcome can be established using adata set that satisfactorily re ects the typical test-taking populationThe primary objective of this project therefore was to create aninstrument built on a framework that describes the language of per-formance in a way that can be readily accessed by evaluators whoare familiar with the tests being observed This work is designed tobe complementary to the use of transcriptions and to provide anadditional source of validation evidence

The FCE was chosen as the focus of this study for a number ofreasons

middot It is lsquostablersquo in that it is neither under review nor is due to bereviewed

middot It represents the middle of the ALTE (and UCLES Main Suite)range and is the most widely subscribed test in the battery

middot It offers the most likelihood of a wide range of performance ofany Main Suite examination as it is often used as an lsquoentry-pointrsquointo the suite candidates tend to range from below to above thislevel in terms of ability

middot Like all of the other Main Suite examinations a database ofrecordings (audio and video) already existed

IV The development of the observation checklists

Weir (1993) building on the earlier work of Bygate (1988) suggeststhat the language of a speaking test can be described in terms ofthe informational and interactional functions and those of interactionmanagement generated by the participants involved With this as astarting point a group of researchers at the University of Readingwere commissioned by UCLES EFL to examine the spoken langu-age second language acquisition and language testing literatures tocome up with an initial set of such functions (see Schegloff et al

40 Validating speaking-test tasks

1977 Schwartz 1980 van Ek and Trim 1984 Bygate 1988Shohamy 1988 1994 Walker 1990 Weir 1994 Stenstrom 1994Chalhoub-Deville 1995b Hayashi 1995 Ellerton 1997 Suhua1998 Kormos 1999 OrsquoSullivan 2000 OrsquoLoughlin 2001)

These were then presented as a draft set of three checklists(Appendix 1) representing each of the elements of Weirrsquos categoriz-ation What follows in the three phases of the development processdescribed below (Section VI) was an attempt to customize thechecklist to more closely re ect the intended outcomes of spokenlanguage test tasks in the UCLES Main Suite The checklists weredesigned to help establish which of these functions resulted andwhich were absent

The next concern was with the development of a procedure fordevising a lsquoworkingrsquo version of the checklists to be followed by anevaluation of using this type of instrument in lsquorealrsquo time (using tapesor perhaps live speaking tests)

V The development model

The process through which the checklists were developed is shownin Figure 3 The concept that drives this model is the evaluation ateach level by different stakeholders At this stage of the project thesestakeholders were identi ed as

Figure 3 The development model

Barry OrsquoSullivan Cyril J Weir and Nick Saville 41

middot the consulting lsquoexpertrsquo testers (the University of Reading group)middot the test development and validation staff at UCLESmiddot UCLES Senior Team Leaders (ie key staff in the oral examiner

training system)

All these individuals participated in the application of each draft Itshould also be noted that a number of drafts were anticipated

VI The development process

In order to arrive at a working version of the checklists a number ofdevelopmental phases were anticipated At each phase the latest ver-sion (or draft) of the instruments was applied and this applicationevaluated

Phase 1

The rst attempt to examine how the draft checklists would beviewed and applied by a group of language teachers was conductedby ffrench (1999) Of the participants at the seminar approximately50 of the group reported that English (British English AmericanEnglish or Australian English) was their rst language while theremaining 50 were native Greek speakers

In their introduction to the application of the Observation Check-lists (OCs) the participants were given a series of activities thatfocused on the nature and use of those functions of language seen bytask designers at UCLES to be particularly applicable to their EFLMain Suite Speaking Tests (principally FCE CAE and CPE) Oncefamiliar with the nature of the functions (and where they might occurin a test) the participants applied the OCs in lsquorealrsquo time to an FCESpeaking Test from the 1998 Standardization Video This video fea-tured a pair of French speakers who were judged by a panel oflsquoexpertrsquo raters (within UCLES) to be slightly above the criterion(lsquopassrsquo) level

Of the 37 participants 32 completed the task successfully that isthey attempted to make frequency counts of the items represented inthe OCs Among this group there appear to be varying degrees ofagreement as to the use of language functions particularly in termsof the speci c number of observations of each function Howeverwhen the data are examined from the perspective of agreement onwhether a particular function was observed or not (ignoring the countwhich in retrospect was highly ambitious when we consider the lackof systematic training in the use of the questionnaires given to theteachers who attended) we nd that there is a striking degree of

42 Validating speaking-test tasks

agreement on all but a small number of functions (Appendix 2) Notehere that in order to make these patterns of behaviour clear the datahave been sorted both horizontally and vertically by the total numberof observations made by each participant and of each item

From this perspective this aspect of the developmental process wasconsidered to be quite successful However it was apparent that therewere a number of elements within the checklists that were causingsome dif culty These are highlighted in the table by the tram-linesItems above the lines have been identi ed by some participants inone case by a single person while those below have been observedby a majority of participants (in two cases by all of them) For thesecases we might infer a high degree of agreement However themiddle range of items appears to have caused a degree of confusionand so are highlighted here ie marked for further investigation

Phase 2

In this phase a much smaller gathering was organized this timeinvolving members of the development team as well as the three UK-based UCLES Senior Team Leaders In advance of this meeting allparticipants were asked to study the existing checklists and to exemp-lify each function with examples drawn from their experiences of thevarious UCLES Main Suite examinations The resulting data werecollated and presented as a single document that formed the basis ofdiscussion during a day-long session Participants were not madeaware of the ndings from Phase 1

During this session many questions were asked of all aspects ofthe checklist and a more streamlined version of the three sectionswas suggested In addition to a number of participants making a writ-ten record of the discussions the entire session was recorded Thisproved to be a valuable reminder of the way in which particularchanges came about and was used when the nal decisions regardinginclusion con ation or omission were being made Although it isbeyond the scope of this project to analyse this recording whencoupled with the earlier and revised documents it is in itself a valu-able source of data in that it provides a signi cant record of the devel-opmental process

Among the many interesting outcomes of this phase were thedecisions either to rethink to reorganize or to omit items from theinitial list These decisions were seen to mirror the results of the Phase1 application quite closely Of the 13 items identi ed in Phase 1 asbeing in need of review (7 were rarely observed indicating a highdegree of agreement that they were not in fact present and 6appeared to be confused with very mixed reported observations ) 7

Barry OrsquoSullivan Cyril J Weir and Nick Saville 43

were recommended for either omission or inclusion in other items bythe panel while the remaining 6 items were identi ed by them asbeing of value Although no examples of the latter had appeared inthe earlier data the panel agreed that they represented language func-tions that the UCLES Main Suite examinations were intended to elicitIt was also decided that each item in this latter group was in need offurther clari cation andor exempli cation Of the remaining 17items

middot two were changed the item lsquoanalysingrsquo was recoded as lsquostagingrsquoin order to clarify its intended meaning while it was decided toseparate the item lsquo(dis)agreeingrsquo into its two separate components

middot three were omitted it was argued that the item lsquoproviding non-personal informationrsquo referred to what was happening with theother items in the informational function category while the itemslsquoexplainingrsquo and lsquojustifyingsupportingrsquo were not functions usu-ally associated with the UCLES Main Suite tasks and no occur-rences of these had been noted

We would emphasize that as reported in Section IV above the initiallist was developed to cover the language functions that various spokenlanguage test tasks might elicit The development of the checklistsdescribed here re ects an attempt to customize the lists in line withthe intended functional outcomes of a speci c set of tests

We are of course aware that closed instruments of this type maybe open to the criticism that valuable information could be lost How-ever for reasons of practicality we felt it necessary to limit the listto what the examinations were intended to elicit rather than attemptto operationalize a full inventory Secondly any functions thatappeared in the data that were not covered by the reduced list wouldhave been noted There appeared to be no cases of this

The data from these two phases were combined to result in a work-ing version of the checklists (Appendix 3) which was then appliedto a pair of FCE Speaking Tests in Phase 3

Phase 3

In the third phase the revised checklists were given to a group of 15MA TEFL students who were asked to apply them to two FCE testsBoth of these tests involved a mixed-sex pair of learners one pair ofapproximately average ability and the other pair above averageBefore using the observation checklists (OCs) the students wereasked rst to attempt to predict which functions they might expect to nd To help in this pre-session task the students were given detailsof the FCE format and tasks

44 Validating speaking-test tasks

Unfortunately a small number of students did not manage to com-plete the observation task as they were somewhat overwhelmed withthe real-time application of the checklists As a result only 12 sets ofcompleted checklists were included in the nal analysis

Prior to the session the group was given an opportunity to have apractice run using a third FCE examination While this lsquotrainingrsquo per-iod coupled with the pre-session task was intended to provide thestudents with the background they needed to apply the checklists con-sistently there was a problem during the session itself This problemwas caused by the failure of a number of students to note the changefrom Task 3 to Task 4 in the rst test observed This was possiblycaused by a lack of awareness of the test itself and was not helpedby the seamless way in which the examiner on the video moved froma two-way discussion involving the test-takers to a three-way dis-cussion This meant that a full set of data exists only for the rst twotasks of this test As the problem was noticed in time the second testdid not cause these problems Unlike the earlier seminar on thisoccasion the participants were asked only to record each functionwhen it was rst observed This was done as it was felt that the earlierseminar showed that without extensive training it would be far toodif cult to apply the OCs fully in lsquorealrsquo time in order to generatecomprehensive frequency counts We are aware that a full tally wouldenable us to draw more precise conclusions about the relative fre-quency of occurrence of these functions and the degree of consensus(reliability) of observers

Against this we must emphasize that the checklists in their currentstage of development are designed to be used in real time Their usewas therefore restricted to determining the presence or absence of aparticular function Rater agreement in this case is limited to a some-what crude account of whether a function occurred or did not occurin a particular task performance We do not therefore have evidenceof whether the function observed was invariant across raters

The results from this session are included as Appendix 4 It canbe seen from this table that the participants again display mixed levelsof agreement ranging from a single perceived observation to totalagreement As with the earlier session it appears that there is rela-tively broad agreement on a range of functions but that others appearto be more dif cult to identify easily These dif culties appear to begreatest where the task involves a degree of interaction between thetest-takers

Phase 4In this phase a transcription was made of the second of the two inter-views used in Phase 3 since there was a full set of data available for

Barry OrsquoSullivan Cyril J Weir and Nick Saville 45

this interview The OCs were then lsquomappedrsquo on to this transcript inorder to give an overview from a different perspective of what func-tions were generated (it being felt that this map would result in anaccurate description of the test in terms of the items included in theOCs) This mapping was carried out by two researchers who initiallyworked independently of each other but discussed their nished workin order to arrive at a consensus

Finally the results of Phases 2 and 3 were compared (Appendix5) This clearly indicates that the checklists are now working wellThere are still some problems in items such as lsquostagingrsquo and lsquodescrib-ingrsquo and feedback from participants suggests that this may be due tomisunderstandings or misinterpretations of the gloss and examplesused In addition there are some similar dif culties with the initialthree items in the interactional functions checklist in which the great-est dif culties in applying the checklists appear to lie

VII Discussion and initial conclusions

The results of this study appear to substantiate our belief thatalthough still under development for use with the UCLES Main Suiteexaminations an operational version of these checklists is certainlyfeasible and has potentially wider application mutatis mutandis tothe content validation of other spoken language tests Further re ne-ment of the checklists is clearly required although the developmentalprocess adopted here appears to have borne positive results

1 Validities

We would not wish to claim that the checklists on their own offer asatisfactory demonstration of the construct validity of a spoken langu-age test for as Messick argues (1989 16) lsquothe varieties of evidencesupporting validity are not alternatives but rather supplements to oneanotherrsquo We recognize the necessity for a broad view of lsquothe eviden-tial basis for test interpretationrsquo (Messick 1989 20) Bachman (1990237) similarly concludes lsquoit is important to recognise that none ofthese [evidences of validity] by itself is suf cient to demonstrate thevalidity of a particular interpretation or use of test scoresrsquo (see alsoBachman 1990 243) Fulcher (1999 224) adds a further caveatagainst an overly narrow interpretation of content validity when hequotes Messick (1989 41)

the major problem is that so-called content validity is focused upon test formsrather than test scores upon instruments rather than measurements selectingcontent is an act of classi cation which is in itself a hypothesis that needs tobe con rmed empirically

46 Validating speaking-test tasks

Like these authors we regard as inadequate any conceptualization ofvalidity that does not involve the provision of evidence on a numberof levels but would argue strongly that without a clear idea of thematch between intended content and actual content any comprehen-sive investigation of the construct validity of a test is built on sandDe ning the construct is in our view underpinned by establishingthe nature of the actual performances elicited by test tasks ie thetrue content of tasks

2 Present and future applications of observational checklists

Versions of the checklists require a degree of training and practicesimilar to that given to raters if a reliable and consistent outcome isto be expected This requires that standardized training materials bedeveloped alongside the checklists In the case of these checkliststhis process has already begun with the initial versions piloted duringPhase 3 of the project

The checklists have great potential as an evaluative tool and canprovide comprehensive insight into various issues It is hoped thatamongst other issues the checklists will provide insights into the fol-lowing

middot the language functions that the different task-types (and differentsub-tasks within these) employed in the UCLES Main Suite Paper5 (Speaking) Tests typically elicit

middot the language that the pair-format elicits and how it differs in nat-ure and quality from that elicited by interlocutor-single candi-date testing

middot the extent to which there is functional variation across the topfour levels of the UCLES Main Suite Spoken Language Test

In addition to these issues the way in which the checklists can beapplied may allow for other important questions to be answered Forexample by allowing the evaluator multiple observations (stoppingand starting a recording of a test at will) it will be possible to estab-lish whether there are quanti able differences in the language func-tions generated by the different tasks ie the evaluators will havethe time they need to make frequency counts of the functions

While the results to date have focused on a posteriori validationprocedures these checklists are also relevant to task design By takinginto account the expected response of a task (and by describing thatresponse in terms of these functions ) it will be possible to explorepredicted and actual test task outcome It will also be a useful guidefor item writers in taking a priori decisions about content coverageThrough this approach it should be possible to predict more accurately

Barry OrsquoSullivan Cyril J Weir and Nick Saville 47

linguistic response (in terms of the elements of the checklists) andto apply this to the design of test tasks ndash and of course to evaluatethe success of the prediction later on In the longer term this willlead to a greater understanding of how tasks and task formats can bemanipulated to result in speci c language use We are not claimingthat it is possible to predict language use at a micro level(grammatical form or lexis) but that it is possible to predict infor-mational and interactional functions and features of interaction man-agement ndash a notion supported by Bygate (1999)

The checklists should also enable us to explore how systematicvariation in such areas as interviewer questioning behaviour (andinterlocutor frame adherence ) affects the language produced in thistype of test In the interview transcribed for this study for examplethe examiner directed his questions very deliberately (systematicallyaiming the questions at one participant and then the other) Thistended to sti e any spontaneity in the intended three-way discussion(Task 4) so occurrences of Interactional and Discourse ManagementFunctions did not materialize to the extent intended by the taskdesigners It is possible that a less deliberate (unscripted) questioningtechnique would lead to a less interviewer-oriented interaction patternand allow for the more genuine interactive communication envisagedin the task design

Perhaps the most valuable contribution that this type of validationprocedure offers is its potential to improve the quality of oral assess-ment in both low-stakes and high-stakes contexts By offering theinvestigator an instrument that can be used in real time the checklistsbroaden the scope of investigation from limited case study analysisof small numbers of test transcripts to large scale eld studies acrossa wide range of testing contexts

Acknowledgements

We would like to thank Don Porter and Rita Green for their earlyinput into the rst version of the checklist In addition help wasreceived from members of the ELT division in UCLES in particularfrom Angela ffrench Lynda Taylor and Christina Rimini from agroup of UCLES Senior Team Leaders and from MA TEFL studentsat the University of Reading Finally we would like to thank theeditors and anonymous reviewers of Language Testing for theirinsightful comments and helpful suggestions for its improvement Thefaults that remain are as ever ours

48 Validating speaking-test tasks

VIII References

Anastasi A 1988 Psychological testing 6th edition New York Macmil-lan

Bachman LF 1990 Fundamental considerations in language testingOxford Oxford University Press

Bachman LF and Palmer AS 1981 The construct validation of the FSIoral interview Language Learning 31 67ndash86

mdashmdash 1996 Language testing in practice Oxford Oxford University PressBallman TL 1991 The oral task of picture description similarities and

differences in native and nonnative speakers of Spanish In TeschnerRV editor Assessing foreign language pro ciency of undergrad-uates AAUSC Issues in Language Program Direction Boston Heinleand Heinle 221ndash31

Brown A 1998 Interviewer style and candidate performance in the IELSToral interview Paper presented at the Language Testing Research Col-loquium Monterey CA

Bygate M 1988 Speaking Oxford Oxford University Pressmdashmdash 1999 Quality of language and purpose of task patterns of learnersrsquo

language on two oral communication tasks Language TeachingResearch 3 185ndash214

Chalhoub-Deville M 1995a Deriving oral assessment scales across differ-ent tests and rater groups Language Testing 12 16ndash33

mdashmdash 1995b A contextualized approach to describing oral language pro- ciency Language Learning 45 251ndash81

Clark JLD 1979 Direct vs semi-direct tests of speaking ability In Bri-ere EJ and Hinofotis FB editors Concepts in language testingsome recent studies Washington DC TESOL

mdashmdash 1988 Validation of a tape-mediated ACTFLILR scale based test ofChinese speaking pro ciency Language Testing 5 187ndash205

Clark JLD and Hooshmand D 1992 lsquoScreen to Screenrsquo testing anexploratory study of oral pro ciency interviewing using video tele-conferencing System 20 293ndash304

Cronbach LJ 1971 Validity In Thorndike RL editor Educationalmeasurement 2nd edition Washington DC American Council on Edu-cation 443ndash597

mdashmdash 1990 Essentials of psychological testing 5th edition New YorkHarper amp Row

Davies A 1977 The construction of language tests In Allen JPB andDavies A editors Testing and experimental methods The EdinburghCourse in Applied Linguistics Volume 4 London Oxford UniversityPress 38ndash194

mdashmdash 1990 Principles of language testing Oxford BlackwellEllerton AW 1997 Considerations in the validation of semi-direct oral

testing Unpublished PhD thesis CALS University of Readingffrench A 1999 Language functions and UCLES speaking tests Seminar

in Athens Greece October 1999

Barry OrsquoSullivan Cyril J Weir and Nick Saville 49

Foster P and Skehan P 1996 The in uence of planning and task typeon second language performance Studies in Second Language Acqui-sition 18 299ndash323

mdashmdash 1999 The in uence of source of planning and focus of planning ontask-based performance Language Teaching Research 3 215ndash47

Fulcher G 1994 Some priority areas for oral language testing LanguageTesting Update 15 39ndash47

mdashmdash 1996 Testing tasks issues in task design and the group oral LanguageTesting 13 23ndash51

mdashmdash 1999 Assessment in English for academic purposes putting contentvalidity in its place Applied Linguistics 20 221ndash36

Hayashi M 1995 Conversational repair a contrastive study of Japaneseand English MA Project Report University of Canberra

Henning G 1983 Oral pro ciency testing comparative validities of inter-view imitation and completion methods Language Learning 33315ndash32

mdashmdash 1987 A guide to language testing Cambridge MA Newbury HouseKelly R 1978 On the construct validation of comprehension tests an exer-

cise in applied linguistics Unpublished PhD thesis University ofQueensland

Kenyon D 1995 An investigation of the validity of task demands onperformance-based tests of oral pro ciency In Kunnan AJ editorValidation in language assessment selected papers from the 17th Lan-guage Testing Research Colloquium Long Beach Mahwah NJ Lawr-ence Erlbaum 19ndash40

Kormos J 1999 Simulating conversations in oral-pro ciency assessmenta conversation analysis of role plays and non-scripted interviews inlanguage exams Language Testing 16 163ndash88

Lazaraton A 1992 The structural organisation of a language interview aconversational analytic perspective System 20 373ndash86

mdashmdash1996 A qualitative approach to monitoring examiner conduct in theCambridge assessment of spoken English (CASE) In Milanovic Mand Saville N editors Performance testing cognition andassessment selected papers from the 15th Language Testing ResearchColloquium Cambridge and Arnhem Studies in Language Testing 3Cambridge University of Cambridge Local Examinations Syndicate18ndash33

mdashmdash 2000 A qualitative approach to the validation of oral language testsStudies in Language Testing Volume 14 Cambridge Cambridge Uni-versity Press

Lumley T and OrsquoSullivan B 2000 The effect of speaker and topic vari-ables on task performance in a tape-mediated assessment of speakingPaper presented at the 2nd Annual Asian Language AssessmentResearch Forum The Hong Kong Polytechnic University

Luoma S 1997 Comparability of a tape-mediated and a face-to-face testof speaking a triangulation study Unpublished Licentiate ThesisCentre for Applied Language Studies Jyvaskyla University Finland

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 3: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

Barry OrsquoSullivan Cyril J Weir and Nick Saville 35

Milanovic and Saville (1996) provide a useful overview of the vari-ables that interact in performance testing and suggest a conceptualframework for setting out different avenues of research The frame-work was in uential in the revisions of the Cambridge examinationsduring the 1990s including the development of KET and CAE examsand revisions to PET FCE and most recently CPE (for a summaryof the UCLES approach see Saville and Hargreaves 1999)

The Milanovic and Saville framework is one of the earliest andmost comprehensive of these models (reproduced here as Figure 2)This framework highlights the many factors (or facets) that must beconsidered when designing a test from which particular inferences areto be drawn about performances all of the factors represented in themodel pose potential threats to the reliability and validity of theseinferences From this model a framework can be derived throughwhich a validation strategy can be devised for Speaking Tests suchas those produced by UCLES

The essential elements of this framework aremiddot the test-takermiddot the interlocutor examinermiddot the assessment criteria (scales)middot the taskmiddot the interactions between these elements

Examinationdeveloper

Specificationsand

construct

Examinationconditions

Tasks

Assessmentcriteria

Assessmentconditions

and training

Knowledgeand ability

Examiners

Sample oflanguage

Score

Candidates

Knowledgeand ability

Figure 2 A conceptual framework for performance testingSource adapted from Milanovic and Saville 1996 6

36 Validating speaking-test tasks

The subject of this study the task has been explored from a numberof perspectives Brie y these have been

middot Taskmethod comparison (quantitative) involving studies inwhich comparisons are made between performances on differenttasks or methods (Clark 1979 1988 Henning 1983 Shohamy1983 Shohamy et al 1986 Clark and Hooshmand 1992 Stans- eld and Kenyon 1992 Wigglesworth and OrsquoLoughlin 1993Chalhoub-Deville 1995a OrsquoLoughlin 1995 Fulcher 1996 Lum-ley and OrsquoSullivan 2000 OrsquoSullivan 2000)

middot Taskmethod comparison (qualitative) as above but where quali-tative methods are employed (Shohamy 1994 Young 1995Luoma 1997 OrsquoLoughlin 1997 Bygate 1999 Kormos 1999)

middot Task performance (method effect) where aspects of the task aresystematically manipulated eg planning time pre- or post-taskoperations etc (Foster and Skehan 1996 1999 Wigglesworth1997 Mehnert 1998 Ortega 1999 Upshur and Turner 1999)

middot Native speakerNonnative speaker comparison where nativespeaker performance on speci c tasks is compared to nonnativespeaker performance on the same tasks (Weir 1983 Ballman1991)

middot Task dif cultyclassi cation where an attempt has been made toclassify tasks in terms of their dif culty (Weir 1993 Fulcher1994 Kenyon 1995 Robinson 1995 Skehan 1996 1998 Norriset al 1998)

The central importance of the test task has been clearly recognizedhowever in terms of test validation there is one question that hasto date remained largely unexplored Although there has been a greatdeal of debate over the validation of performance tests through analy-sis of the language generated in the performance of language elici-tation tasks (LETs) (eg van Lier 1989 Lazaraton 1992 1996)attention has not been drawn to the one aspect of task performancethat would appear to be of most interest to the test designer That iswhen tasks are performed in a test event how does that performancerelate to the test designerrsquos predictions or expectations based on theirde nition or interpretation of the construct After all no matter howreliably the performance is scored if it does not match the expec-tations of the test designer (in other words represent the constructswhich are to be tested) then the inferences that the test designer hopesto draw from the evaluated performance will not be valid

Cronbach went to the heart of the matter (1971 443) lsquoConstruc-tion of a test itself starts from a theory about behaviour or mentalorganization derived from prior research that suggests the ground planfor the testrsquo Davies (1977 63) argued in similar vein lsquoit is after

Barry OrsquoSullivan Cyril J Weir and Nick Saville 37

all the theory on which all else rests it is from there that the constructis set up and it is on the construct that validity of the content andpredictive kinds is basedrsquo Kelly (1978 8) supported this view com-menting that lsquothe systematic development of tests requires sometheory even an informal inexplicit one to guide the initial selectionof item content and the division of the domain of interest into appro-priate sub-areasrsquo

Because we lack an adequate theory of language in use a prioriattempts to determine the construct validity of pro ciency testsinvolve us in matters that relate more evidently to content validityWe need to talk of the communicative construct in descriptive termsand as a result we become involved in questions of content relevanceand content coverage Thus for Kelly (1978 8) content validityseemed lsquoan almost completely overlapping conceptrsquo with constructvalidity and for Moller (1982 68) lsquothe distinction between constructand content validity in language testing is not always very markedparticularly for tests of general language pro ciencyrsquo

Content validity is considered important as it is principally con-cerned with the extent to which the selection of test tasks is represen-tative of the larger universe of tasks of which the test is assumed tobe a sample (see Bachman and Palmer 1981 Henning 1987 94Messick 1989 16 Bachman 1990 244) Similarly Anastasi (1988131) de ned content validity as involving lsquoessentially the systematicexamination of the test content to determine whether it covers arepresentative sample of the behaviour domain to be measuredrsquo Sheoutlined (Anastasi 1988 132) the following guidelines for estab-lishing content validity

1) lsquothe behaviour domain to be tested must be systematicallyanalysed to make certain that all major aspects are covered bythe test items and in the correct proportionsrsquo

2) lsquothe domain under consideration should be fully described inadvance rather than being de ned after the test has been pre-paredrsquo

3) lsquocontent validity depends on the relevance of the individualrsquos testresponses to the behaviour area under consideration rather thanon the apparent relevance of item contentrsquo

The directness of t and adequacy of the test sample is thus dependenton the quality of the description of the target language behaviourbeing tested In addition if the responses to the item are invokedMessick (1975 961) suggests lsquothe concern with processes underlyingtest responses places this approach to content validity squarely in therealm of construct validityrsquo Davies (1990 23) similarly notes lsquocon-tent validity slides into construct validityrsquo

38 Validating speaking-test tasks

Content validation is of course extremely problematic given thedif culty we have in characterizing language pro ciency with suf- cient precision to ensure the validity of the representative samplewe include in our tests and the further threats to validity arising out ofany attempts to operationalize real life behaviours in a test Specifyingoperations let alone the conditions under which these are performedis challenging and at best relatively unsophisticated (see Cronbach1990) Weir (1993) provides an introductory attempt to specify theoperations and conditions that might form a framework for test taskdescription (see also Bachman 1990 Bachman and Palmer 1996)

The dif culties involved do not however absolve us fromattempting to make our tests as relevant as possible in terms of con-tent Generating content related evidence is seen as a necessaryalthough not suf cient part of the validation process of a speakingtest To this end we sought to establish in this study an effective andef cient procedure for establishing the content validity of speakingtests As well as being useful in helping specify the domain to betested we would argue that the checklist discussed below wouldenable the researcher to address how predicted vs actual task per-formance can be compared

III Methodological issues

While it is relatively easy to rationalize the need to establish that theLETs used in performance tests are working as predicted (ie interms of language generated) the dif culty lies in how this mightbest be done

UCLES EFL (English as a foreign language) routinely collectsaudio recordings and carries out transcriptions of its Speaking TestsThese transcripts are used for a range of validation purposes and inparticular they contribute to revision projects for the Speaking Testsfor example FCE which was revised in 1996 and currently therevision of the International English Language Testing System(IELTS) Speaking Test in addition to the CPE revision project

In a series of UCLES studies focusing on the language of theSpeaking Tests Lazaraton has applied conversational analysis (CA)techniques to contribute to our understanding of the language used inpair-format Speaking Tests including the language of the candidatesand the interlocutor Her approach requires a very careful ne-tunedtranscription of the tests in order to provide the data for analysis (seeLazaraton 2000) Similar qualitative methodologies have beenapplied by Young and Milanovic (1992) ndash also to UCLES data ndash byBrown (1998) and by Ross and Berwick (1992) amongst others

Barry OrsquoSullivan Cyril J Weir and Nick Saville 39

While there is clearly a great deal of potential for this detailed analy-sis of transcribed performances there are also a number of drawbacksthe most serious of which involves the complexity of the transcriptionprocess In practice this means that a great deal of time and expertiseis required in order to gain the kind of data that will answer the basicquestion concerning validity Even where this is done it is impracticalto attempt to deal with more than a small number of test eventstherefore the generalizability of the results may be questioned

Clearly then a more ef cient methodology is required that allowsthe test designer to evaluate the procedures and especially the tasksin terms of the language produced by a larger number of candidatesIdeally this should be possible in lsquorealrsquo time so that the relationshipof predicted outcome to speci c outcome can be established using adata set that satisfactorily re ects the typical test-taking populationThe primary objective of this project therefore was to create aninstrument built on a framework that describes the language of per-formance in a way that can be readily accessed by evaluators whoare familiar with the tests being observed This work is designed tobe complementary to the use of transcriptions and to provide anadditional source of validation evidence

The FCE was chosen as the focus of this study for a number ofreasons

middot It is lsquostablersquo in that it is neither under review nor is due to bereviewed

middot It represents the middle of the ALTE (and UCLES Main Suite)range and is the most widely subscribed test in the battery

middot It offers the most likelihood of a wide range of performance ofany Main Suite examination as it is often used as an lsquoentry-pointrsquointo the suite candidates tend to range from below to above thislevel in terms of ability

middot Like all of the other Main Suite examinations a database ofrecordings (audio and video) already existed

IV The development of the observation checklists

Weir (1993) building on the earlier work of Bygate (1988) suggeststhat the language of a speaking test can be described in terms ofthe informational and interactional functions and those of interactionmanagement generated by the participants involved With this as astarting point a group of researchers at the University of Readingwere commissioned by UCLES EFL to examine the spoken langu-age second language acquisition and language testing literatures tocome up with an initial set of such functions (see Schegloff et al

40 Validating speaking-test tasks

1977 Schwartz 1980 van Ek and Trim 1984 Bygate 1988Shohamy 1988 1994 Walker 1990 Weir 1994 Stenstrom 1994Chalhoub-Deville 1995b Hayashi 1995 Ellerton 1997 Suhua1998 Kormos 1999 OrsquoSullivan 2000 OrsquoLoughlin 2001)

These were then presented as a draft set of three checklists(Appendix 1) representing each of the elements of Weirrsquos categoriz-ation What follows in the three phases of the development processdescribed below (Section VI) was an attempt to customize thechecklist to more closely re ect the intended outcomes of spokenlanguage test tasks in the UCLES Main Suite The checklists weredesigned to help establish which of these functions resulted andwhich were absent

The next concern was with the development of a procedure fordevising a lsquoworkingrsquo version of the checklists to be followed by anevaluation of using this type of instrument in lsquorealrsquo time (using tapesor perhaps live speaking tests)

V The development model

The process through which the checklists were developed is shownin Figure 3 The concept that drives this model is the evaluation ateach level by different stakeholders At this stage of the project thesestakeholders were identi ed as

Figure 3 The development model

Barry OrsquoSullivan Cyril J Weir and Nick Saville 41

middot the consulting lsquoexpertrsquo testers (the University of Reading group)middot the test development and validation staff at UCLESmiddot UCLES Senior Team Leaders (ie key staff in the oral examiner

training system)

All these individuals participated in the application of each draft Itshould also be noted that a number of drafts were anticipated

VI The development process

In order to arrive at a working version of the checklists a number ofdevelopmental phases were anticipated At each phase the latest ver-sion (or draft) of the instruments was applied and this applicationevaluated

Phase 1

The rst attempt to examine how the draft checklists would beviewed and applied by a group of language teachers was conductedby ffrench (1999) Of the participants at the seminar approximately50 of the group reported that English (British English AmericanEnglish or Australian English) was their rst language while theremaining 50 were native Greek speakers

In their introduction to the application of the Observation Check-lists (OCs) the participants were given a series of activities thatfocused on the nature and use of those functions of language seen bytask designers at UCLES to be particularly applicable to their EFLMain Suite Speaking Tests (principally FCE CAE and CPE) Oncefamiliar with the nature of the functions (and where they might occurin a test) the participants applied the OCs in lsquorealrsquo time to an FCESpeaking Test from the 1998 Standardization Video This video fea-tured a pair of French speakers who were judged by a panel oflsquoexpertrsquo raters (within UCLES) to be slightly above the criterion(lsquopassrsquo) level

Of the 37 participants 32 completed the task successfully that isthey attempted to make frequency counts of the items represented inthe OCs Among this group there appear to be varying degrees ofagreement as to the use of language functions particularly in termsof the speci c number of observations of each function Howeverwhen the data are examined from the perspective of agreement onwhether a particular function was observed or not (ignoring the countwhich in retrospect was highly ambitious when we consider the lackof systematic training in the use of the questionnaires given to theteachers who attended) we nd that there is a striking degree of

42 Validating speaking-test tasks

agreement on all but a small number of functions (Appendix 2) Notehere that in order to make these patterns of behaviour clear the datahave been sorted both horizontally and vertically by the total numberof observations made by each participant and of each item

From this perspective this aspect of the developmental process wasconsidered to be quite successful However it was apparent that therewere a number of elements within the checklists that were causingsome dif culty These are highlighted in the table by the tram-linesItems above the lines have been identi ed by some participants inone case by a single person while those below have been observedby a majority of participants (in two cases by all of them) For thesecases we might infer a high degree of agreement However themiddle range of items appears to have caused a degree of confusionand so are highlighted here ie marked for further investigation

Phase 2

In this phase a much smaller gathering was organized this timeinvolving members of the development team as well as the three UK-based UCLES Senior Team Leaders In advance of this meeting allparticipants were asked to study the existing checklists and to exemp-lify each function with examples drawn from their experiences of thevarious UCLES Main Suite examinations The resulting data werecollated and presented as a single document that formed the basis ofdiscussion during a day-long session Participants were not madeaware of the ndings from Phase 1

During this session many questions were asked of all aspects ofthe checklist and a more streamlined version of the three sectionswas suggested In addition to a number of participants making a writ-ten record of the discussions the entire session was recorded Thisproved to be a valuable reminder of the way in which particularchanges came about and was used when the nal decisions regardinginclusion con ation or omission were being made Although it isbeyond the scope of this project to analyse this recording whencoupled with the earlier and revised documents it is in itself a valu-able source of data in that it provides a signi cant record of the devel-opmental process

Among the many interesting outcomes of this phase were thedecisions either to rethink to reorganize or to omit items from theinitial list These decisions were seen to mirror the results of the Phase1 application quite closely Of the 13 items identi ed in Phase 1 asbeing in need of review (7 were rarely observed indicating a highdegree of agreement that they were not in fact present and 6appeared to be confused with very mixed reported observations ) 7

Barry OrsquoSullivan Cyril J Weir and Nick Saville 43

were recommended for either omission or inclusion in other items bythe panel while the remaining 6 items were identi ed by them asbeing of value Although no examples of the latter had appeared inthe earlier data the panel agreed that they represented language func-tions that the UCLES Main Suite examinations were intended to elicitIt was also decided that each item in this latter group was in need offurther clari cation andor exempli cation Of the remaining 17items

middot two were changed the item lsquoanalysingrsquo was recoded as lsquostagingrsquoin order to clarify its intended meaning while it was decided toseparate the item lsquo(dis)agreeingrsquo into its two separate components

middot three were omitted it was argued that the item lsquoproviding non-personal informationrsquo referred to what was happening with theother items in the informational function category while the itemslsquoexplainingrsquo and lsquojustifyingsupportingrsquo were not functions usu-ally associated with the UCLES Main Suite tasks and no occur-rences of these had been noted

We would emphasize that as reported in Section IV above the initiallist was developed to cover the language functions that various spokenlanguage test tasks might elicit The development of the checklistsdescribed here re ects an attempt to customize the lists in line withthe intended functional outcomes of a speci c set of tests

We are of course aware that closed instruments of this type maybe open to the criticism that valuable information could be lost How-ever for reasons of practicality we felt it necessary to limit the listto what the examinations were intended to elicit rather than attemptto operationalize a full inventory Secondly any functions thatappeared in the data that were not covered by the reduced list wouldhave been noted There appeared to be no cases of this

The data from these two phases were combined to result in a work-ing version of the checklists (Appendix 3) which was then appliedto a pair of FCE Speaking Tests in Phase 3

Phase 3

In the third phase the revised checklists were given to a group of 15MA TEFL students who were asked to apply them to two FCE testsBoth of these tests involved a mixed-sex pair of learners one pair ofapproximately average ability and the other pair above averageBefore using the observation checklists (OCs) the students wereasked rst to attempt to predict which functions they might expect to nd To help in this pre-session task the students were given detailsof the FCE format and tasks

44 Validating speaking-test tasks

Unfortunately a small number of students did not manage to com-plete the observation task as they were somewhat overwhelmed withthe real-time application of the checklists As a result only 12 sets ofcompleted checklists were included in the nal analysis

Prior to the session the group was given an opportunity to have apractice run using a third FCE examination While this lsquotrainingrsquo per-iod coupled with the pre-session task was intended to provide thestudents with the background they needed to apply the checklists con-sistently there was a problem during the session itself This problemwas caused by the failure of a number of students to note the changefrom Task 3 to Task 4 in the rst test observed This was possiblycaused by a lack of awareness of the test itself and was not helpedby the seamless way in which the examiner on the video moved froma two-way discussion involving the test-takers to a three-way dis-cussion This meant that a full set of data exists only for the rst twotasks of this test As the problem was noticed in time the second testdid not cause these problems Unlike the earlier seminar on thisoccasion the participants were asked only to record each functionwhen it was rst observed This was done as it was felt that the earlierseminar showed that without extensive training it would be far toodif cult to apply the OCs fully in lsquorealrsquo time in order to generatecomprehensive frequency counts We are aware that a full tally wouldenable us to draw more precise conclusions about the relative fre-quency of occurrence of these functions and the degree of consensus(reliability) of observers

Against this we must emphasize that the checklists in their currentstage of development are designed to be used in real time Their usewas therefore restricted to determining the presence or absence of aparticular function Rater agreement in this case is limited to a some-what crude account of whether a function occurred or did not occurin a particular task performance We do not therefore have evidenceof whether the function observed was invariant across raters

The results from this session are included as Appendix 4 It canbe seen from this table that the participants again display mixed levelsof agreement ranging from a single perceived observation to totalagreement As with the earlier session it appears that there is rela-tively broad agreement on a range of functions but that others appearto be more dif cult to identify easily These dif culties appear to begreatest where the task involves a degree of interaction between thetest-takers

Phase 4In this phase a transcription was made of the second of the two inter-views used in Phase 3 since there was a full set of data available for

Barry OrsquoSullivan Cyril J Weir and Nick Saville 45

this interview The OCs were then lsquomappedrsquo on to this transcript inorder to give an overview from a different perspective of what func-tions were generated (it being felt that this map would result in anaccurate description of the test in terms of the items included in theOCs) This mapping was carried out by two researchers who initiallyworked independently of each other but discussed their nished workin order to arrive at a consensus

Finally the results of Phases 2 and 3 were compared (Appendix5) This clearly indicates that the checklists are now working wellThere are still some problems in items such as lsquostagingrsquo and lsquodescrib-ingrsquo and feedback from participants suggests that this may be due tomisunderstandings or misinterpretations of the gloss and examplesused In addition there are some similar dif culties with the initialthree items in the interactional functions checklist in which the great-est dif culties in applying the checklists appear to lie

VII Discussion and initial conclusions

The results of this study appear to substantiate our belief thatalthough still under development for use with the UCLES Main Suiteexaminations an operational version of these checklists is certainlyfeasible and has potentially wider application mutatis mutandis tothe content validation of other spoken language tests Further re ne-ment of the checklists is clearly required although the developmentalprocess adopted here appears to have borne positive results

1 Validities

We would not wish to claim that the checklists on their own offer asatisfactory demonstration of the construct validity of a spoken langu-age test for as Messick argues (1989 16) lsquothe varieties of evidencesupporting validity are not alternatives but rather supplements to oneanotherrsquo We recognize the necessity for a broad view of lsquothe eviden-tial basis for test interpretationrsquo (Messick 1989 20) Bachman (1990237) similarly concludes lsquoit is important to recognise that none ofthese [evidences of validity] by itself is suf cient to demonstrate thevalidity of a particular interpretation or use of test scoresrsquo (see alsoBachman 1990 243) Fulcher (1999 224) adds a further caveatagainst an overly narrow interpretation of content validity when hequotes Messick (1989 41)

the major problem is that so-called content validity is focused upon test formsrather than test scores upon instruments rather than measurements selectingcontent is an act of classi cation which is in itself a hypothesis that needs tobe con rmed empirically

46 Validating speaking-test tasks

Like these authors we regard as inadequate any conceptualization ofvalidity that does not involve the provision of evidence on a numberof levels but would argue strongly that without a clear idea of thematch between intended content and actual content any comprehen-sive investigation of the construct validity of a test is built on sandDe ning the construct is in our view underpinned by establishingthe nature of the actual performances elicited by test tasks ie thetrue content of tasks

2 Present and future applications of observational checklists

Versions of the checklists require a degree of training and practicesimilar to that given to raters if a reliable and consistent outcome isto be expected This requires that standardized training materials bedeveloped alongside the checklists In the case of these checkliststhis process has already begun with the initial versions piloted duringPhase 3 of the project

The checklists have great potential as an evaluative tool and canprovide comprehensive insight into various issues It is hoped thatamongst other issues the checklists will provide insights into the fol-lowing

middot the language functions that the different task-types (and differentsub-tasks within these) employed in the UCLES Main Suite Paper5 (Speaking) Tests typically elicit

middot the language that the pair-format elicits and how it differs in nat-ure and quality from that elicited by interlocutor-single candi-date testing

middot the extent to which there is functional variation across the topfour levels of the UCLES Main Suite Spoken Language Test

In addition to these issues the way in which the checklists can beapplied may allow for other important questions to be answered Forexample by allowing the evaluator multiple observations (stoppingand starting a recording of a test at will) it will be possible to estab-lish whether there are quanti able differences in the language func-tions generated by the different tasks ie the evaluators will havethe time they need to make frequency counts of the functions

While the results to date have focused on a posteriori validationprocedures these checklists are also relevant to task design By takinginto account the expected response of a task (and by describing thatresponse in terms of these functions ) it will be possible to explorepredicted and actual test task outcome It will also be a useful guidefor item writers in taking a priori decisions about content coverageThrough this approach it should be possible to predict more accurately

Barry OrsquoSullivan Cyril J Weir and Nick Saville 47

linguistic response (in terms of the elements of the checklists) andto apply this to the design of test tasks ndash and of course to evaluatethe success of the prediction later on In the longer term this willlead to a greater understanding of how tasks and task formats can bemanipulated to result in speci c language use We are not claimingthat it is possible to predict language use at a micro level(grammatical form or lexis) but that it is possible to predict infor-mational and interactional functions and features of interaction man-agement ndash a notion supported by Bygate (1999)

The checklists should also enable us to explore how systematicvariation in such areas as interviewer questioning behaviour (andinterlocutor frame adherence ) affects the language produced in thistype of test In the interview transcribed for this study for examplethe examiner directed his questions very deliberately (systematicallyaiming the questions at one participant and then the other) Thistended to sti e any spontaneity in the intended three-way discussion(Task 4) so occurrences of Interactional and Discourse ManagementFunctions did not materialize to the extent intended by the taskdesigners It is possible that a less deliberate (unscripted) questioningtechnique would lead to a less interviewer-oriented interaction patternand allow for the more genuine interactive communication envisagedin the task design

Perhaps the most valuable contribution that this type of validationprocedure offers is its potential to improve the quality of oral assess-ment in both low-stakes and high-stakes contexts By offering theinvestigator an instrument that can be used in real time the checklistsbroaden the scope of investigation from limited case study analysisof small numbers of test transcripts to large scale eld studies acrossa wide range of testing contexts

Acknowledgements

We would like to thank Don Porter and Rita Green for their earlyinput into the rst version of the checklist In addition help wasreceived from members of the ELT division in UCLES in particularfrom Angela ffrench Lynda Taylor and Christina Rimini from agroup of UCLES Senior Team Leaders and from MA TEFL studentsat the University of Reading Finally we would like to thank theeditors and anonymous reviewers of Language Testing for theirinsightful comments and helpful suggestions for its improvement Thefaults that remain are as ever ours

48 Validating speaking-test tasks

VIII References

Anastasi A 1988 Psychological testing 6th edition New York Macmil-lan

Bachman LF 1990 Fundamental considerations in language testingOxford Oxford University Press

Bachman LF and Palmer AS 1981 The construct validation of the FSIoral interview Language Learning 31 67ndash86

mdashmdash 1996 Language testing in practice Oxford Oxford University PressBallman TL 1991 The oral task of picture description similarities and

differences in native and nonnative speakers of Spanish In TeschnerRV editor Assessing foreign language pro ciency of undergrad-uates AAUSC Issues in Language Program Direction Boston Heinleand Heinle 221ndash31

Brown A 1998 Interviewer style and candidate performance in the IELSToral interview Paper presented at the Language Testing Research Col-loquium Monterey CA

Bygate M 1988 Speaking Oxford Oxford University Pressmdashmdash 1999 Quality of language and purpose of task patterns of learnersrsquo

language on two oral communication tasks Language TeachingResearch 3 185ndash214

Chalhoub-Deville M 1995a Deriving oral assessment scales across differ-ent tests and rater groups Language Testing 12 16ndash33

mdashmdash 1995b A contextualized approach to describing oral language pro- ciency Language Learning 45 251ndash81

Clark JLD 1979 Direct vs semi-direct tests of speaking ability In Bri-ere EJ and Hinofotis FB editors Concepts in language testingsome recent studies Washington DC TESOL

mdashmdash 1988 Validation of a tape-mediated ACTFLILR scale based test ofChinese speaking pro ciency Language Testing 5 187ndash205

Clark JLD and Hooshmand D 1992 lsquoScreen to Screenrsquo testing anexploratory study of oral pro ciency interviewing using video tele-conferencing System 20 293ndash304

Cronbach LJ 1971 Validity In Thorndike RL editor Educationalmeasurement 2nd edition Washington DC American Council on Edu-cation 443ndash597

mdashmdash 1990 Essentials of psychological testing 5th edition New YorkHarper amp Row

Davies A 1977 The construction of language tests In Allen JPB andDavies A editors Testing and experimental methods The EdinburghCourse in Applied Linguistics Volume 4 London Oxford UniversityPress 38ndash194

mdashmdash 1990 Principles of language testing Oxford BlackwellEllerton AW 1997 Considerations in the validation of semi-direct oral

testing Unpublished PhD thesis CALS University of Readingffrench A 1999 Language functions and UCLES speaking tests Seminar

in Athens Greece October 1999

Barry OrsquoSullivan Cyril J Weir and Nick Saville 49

Foster P and Skehan P 1996 The in uence of planning and task typeon second language performance Studies in Second Language Acqui-sition 18 299ndash323

mdashmdash 1999 The in uence of source of planning and focus of planning ontask-based performance Language Teaching Research 3 215ndash47

Fulcher G 1994 Some priority areas for oral language testing LanguageTesting Update 15 39ndash47

mdashmdash 1996 Testing tasks issues in task design and the group oral LanguageTesting 13 23ndash51

mdashmdash 1999 Assessment in English for academic purposes putting contentvalidity in its place Applied Linguistics 20 221ndash36

Hayashi M 1995 Conversational repair a contrastive study of Japaneseand English MA Project Report University of Canberra

Henning G 1983 Oral pro ciency testing comparative validities of inter-view imitation and completion methods Language Learning 33315ndash32

mdashmdash 1987 A guide to language testing Cambridge MA Newbury HouseKelly R 1978 On the construct validation of comprehension tests an exer-

cise in applied linguistics Unpublished PhD thesis University ofQueensland

Kenyon D 1995 An investigation of the validity of task demands onperformance-based tests of oral pro ciency In Kunnan AJ editorValidation in language assessment selected papers from the 17th Lan-guage Testing Research Colloquium Long Beach Mahwah NJ Lawr-ence Erlbaum 19ndash40

Kormos J 1999 Simulating conversations in oral-pro ciency assessmenta conversation analysis of role plays and non-scripted interviews inlanguage exams Language Testing 16 163ndash88

Lazaraton A 1992 The structural organisation of a language interview aconversational analytic perspective System 20 373ndash86

mdashmdash1996 A qualitative approach to monitoring examiner conduct in theCambridge assessment of spoken English (CASE) In Milanovic Mand Saville N editors Performance testing cognition andassessment selected papers from the 15th Language Testing ResearchColloquium Cambridge and Arnhem Studies in Language Testing 3Cambridge University of Cambridge Local Examinations Syndicate18ndash33

mdashmdash 2000 A qualitative approach to the validation of oral language testsStudies in Language Testing Volume 14 Cambridge Cambridge Uni-versity Press

Lumley T and OrsquoSullivan B 2000 The effect of speaker and topic vari-ables on task performance in a tape-mediated assessment of speakingPaper presented at the 2nd Annual Asian Language AssessmentResearch Forum The Hong Kong Polytechnic University

Luoma S 1997 Comparability of a tape-mediated and a face-to-face testof speaking a triangulation study Unpublished Licentiate ThesisCentre for Applied Language Studies Jyvaskyla University Finland

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 4: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

36 Validating speaking-test tasks

The subject of this study the task has been explored from a numberof perspectives Brie y these have been

middot Taskmethod comparison (quantitative) involving studies inwhich comparisons are made between performances on differenttasks or methods (Clark 1979 1988 Henning 1983 Shohamy1983 Shohamy et al 1986 Clark and Hooshmand 1992 Stans- eld and Kenyon 1992 Wigglesworth and OrsquoLoughlin 1993Chalhoub-Deville 1995a OrsquoLoughlin 1995 Fulcher 1996 Lum-ley and OrsquoSullivan 2000 OrsquoSullivan 2000)

middot Taskmethod comparison (qualitative) as above but where quali-tative methods are employed (Shohamy 1994 Young 1995Luoma 1997 OrsquoLoughlin 1997 Bygate 1999 Kormos 1999)

middot Task performance (method effect) where aspects of the task aresystematically manipulated eg planning time pre- or post-taskoperations etc (Foster and Skehan 1996 1999 Wigglesworth1997 Mehnert 1998 Ortega 1999 Upshur and Turner 1999)

middot Native speakerNonnative speaker comparison where nativespeaker performance on speci c tasks is compared to nonnativespeaker performance on the same tasks (Weir 1983 Ballman1991)

middot Task dif cultyclassi cation where an attempt has been made toclassify tasks in terms of their dif culty (Weir 1993 Fulcher1994 Kenyon 1995 Robinson 1995 Skehan 1996 1998 Norriset al 1998)

The central importance of the test task has been clearly recognizedhowever in terms of test validation there is one question that hasto date remained largely unexplored Although there has been a greatdeal of debate over the validation of performance tests through analy-sis of the language generated in the performance of language elici-tation tasks (LETs) (eg van Lier 1989 Lazaraton 1992 1996)attention has not been drawn to the one aspect of task performancethat would appear to be of most interest to the test designer That iswhen tasks are performed in a test event how does that performancerelate to the test designerrsquos predictions or expectations based on theirde nition or interpretation of the construct After all no matter howreliably the performance is scored if it does not match the expec-tations of the test designer (in other words represent the constructswhich are to be tested) then the inferences that the test designer hopesto draw from the evaluated performance will not be valid

Cronbach went to the heart of the matter (1971 443) lsquoConstruc-tion of a test itself starts from a theory about behaviour or mentalorganization derived from prior research that suggests the ground planfor the testrsquo Davies (1977 63) argued in similar vein lsquoit is after

Barry OrsquoSullivan Cyril J Weir and Nick Saville 37

all the theory on which all else rests it is from there that the constructis set up and it is on the construct that validity of the content andpredictive kinds is basedrsquo Kelly (1978 8) supported this view com-menting that lsquothe systematic development of tests requires sometheory even an informal inexplicit one to guide the initial selectionof item content and the division of the domain of interest into appro-priate sub-areasrsquo

Because we lack an adequate theory of language in use a prioriattempts to determine the construct validity of pro ciency testsinvolve us in matters that relate more evidently to content validityWe need to talk of the communicative construct in descriptive termsand as a result we become involved in questions of content relevanceand content coverage Thus for Kelly (1978 8) content validityseemed lsquoan almost completely overlapping conceptrsquo with constructvalidity and for Moller (1982 68) lsquothe distinction between constructand content validity in language testing is not always very markedparticularly for tests of general language pro ciencyrsquo

Content validity is considered important as it is principally con-cerned with the extent to which the selection of test tasks is represen-tative of the larger universe of tasks of which the test is assumed tobe a sample (see Bachman and Palmer 1981 Henning 1987 94Messick 1989 16 Bachman 1990 244) Similarly Anastasi (1988131) de ned content validity as involving lsquoessentially the systematicexamination of the test content to determine whether it covers arepresentative sample of the behaviour domain to be measuredrsquo Sheoutlined (Anastasi 1988 132) the following guidelines for estab-lishing content validity

1) lsquothe behaviour domain to be tested must be systematicallyanalysed to make certain that all major aspects are covered bythe test items and in the correct proportionsrsquo

2) lsquothe domain under consideration should be fully described inadvance rather than being de ned after the test has been pre-paredrsquo

3) lsquocontent validity depends on the relevance of the individualrsquos testresponses to the behaviour area under consideration rather thanon the apparent relevance of item contentrsquo

The directness of t and adequacy of the test sample is thus dependenton the quality of the description of the target language behaviourbeing tested In addition if the responses to the item are invokedMessick (1975 961) suggests lsquothe concern with processes underlyingtest responses places this approach to content validity squarely in therealm of construct validityrsquo Davies (1990 23) similarly notes lsquocon-tent validity slides into construct validityrsquo

38 Validating speaking-test tasks

Content validation is of course extremely problematic given thedif culty we have in characterizing language pro ciency with suf- cient precision to ensure the validity of the representative samplewe include in our tests and the further threats to validity arising out ofany attempts to operationalize real life behaviours in a test Specifyingoperations let alone the conditions under which these are performedis challenging and at best relatively unsophisticated (see Cronbach1990) Weir (1993) provides an introductory attempt to specify theoperations and conditions that might form a framework for test taskdescription (see also Bachman 1990 Bachman and Palmer 1996)

The dif culties involved do not however absolve us fromattempting to make our tests as relevant as possible in terms of con-tent Generating content related evidence is seen as a necessaryalthough not suf cient part of the validation process of a speakingtest To this end we sought to establish in this study an effective andef cient procedure for establishing the content validity of speakingtests As well as being useful in helping specify the domain to betested we would argue that the checklist discussed below wouldenable the researcher to address how predicted vs actual task per-formance can be compared

III Methodological issues

While it is relatively easy to rationalize the need to establish that theLETs used in performance tests are working as predicted (ie interms of language generated) the dif culty lies in how this mightbest be done

UCLES EFL (English as a foreign language) routinely collectsaudio recordings and carries out transcriptions of its Speaking TestsThese transcripts are used for a range of validation purposes and inparticular they contribute to revision projects for the Speaking Testsfor example FCE which was revised in 1996 and currently therevision of the International English Language Testing System(IELTS) Speaking Test in addition to the CPE revision project

In a series of UCLES studies focusing on the language of theSpeaking Tests Lazaraton has applied conversational analysis (CA)techniques to contribute to our understanding of the language used inpair-format Speaking Tests including the language of the candidatesand the interlocutor Her approach requires a very careful ne-tunedtranscription of the tests in order to provide the data for analysis (seeLazaraton 2000) Similar qualitative methodologies have beenapplied by Young and Milanovic (1992) ndash also to UCLES data ndash byBrown (1998) and by Ross and Berwick (1992) amongst others

Barry OrsquoSullivan Cyril J Weir and Nick Saville 39

While there is clearly a great deal of potential for this detailed analy-sis of transcribed performances there are also a number of drawbacksthe most serious of which involves the complexity of the transcriptionprocess In practice this means that a great deal of time and expertiseis required in order to gain the kind of data that will answer the basicquestion concerning validity Even where this is done it is impracticalto attempt to deal with more than a small number of test eventstherefore the generalizability of the results may be questioned

Clearly then a more ef cient methodology is required that allowsthe test designer to evaluate the procedures and especially the tasksin terms of the language produced by a larger number of candidatesIdeally this should be possible in lsquorealrsquo time so that the relationshipof predicted outcome to speci c outcome can be established using adata set that satisfactorily re ects the typical test-taking populationThe primary objective of this project therefore was to create aninstrument built on a framework that describes the language of per-formance in a way that can be readily accessed by evaluators whoare familiar with the tests being observed This work is designed tobe complementary to the use of transcriptions and to provide anadditional source of validation evidence

The FCE was chosen as the focus of this study for a number ofreasons

middot It is lsquostablersquo in that it is neither under review nor is due to bereviewed

middot It represents the middle of the ALTE (and UCLES Main Suite)range and is the most widely subscribed test in the battery

middot It offers the most likelihood of a wide range of performance ofany Main Suite examination as it is often used as an lsquoentry-pointrsquointo the suite candidates tend to range from below to above thislevel in terms of ability

middot Like all of the other Main Suite examinations a database ofrecordings (audio and video) already existed

IV The development of the observation checklists

Weir (1993) building on the earlier work of Bygate (1988) suggeststhat the language of a speaking test can be described in terms ofthe informational and interactional functions and those of interactionmanagement generated by the participants involved With this as astarting point a group of researchers at the University of Readingwere commissioned by UCLES EFL to examine the spoken langu-age second language acquisition and language testing literatures tocome up with an initial set of such functions (see Schegloff et al

40 Validating speaking-test tasks

1977 Schwartz 1980 van Ek and Trim 1984 Bygate 1988Shohamy 1988 1994 Walker 1990 Weir 1994 Stenstrom 1994Chalhoub-Deville 1995b Hayashi 1995 Ellerton 1997 Suhua1998 Kormos 1999 OrsquoSullivan 2000 OrsquoLoughlin 2001)

These were then presented as a draft set of three checklists(Appendix 1) representing each of the elements of Weirrsquos categoriz-ation What follows in the three phases of the development processdescribed below (Section VI) was an attempt to customize thechecklist to more closely re ect the intended outcomes of spokenlanguage test tasks in the UCLES Main Suite The checklists weredesigned to help establish which of these functions resulted andwhich were absent

The next concern was with the development of a procedure fordevising a lsquoworkingrsquo version of the checklists to be followed by anevaluation of using this type of instrument in lsquorealrsquo time (using tapesor perhaps live speaking tests)

V The development model

The process through which the checklists were developed is shownin Figure 3 The concept that drives this model is the evaluation ateach level by different stakeholders At this stage of the project thesestakeholders were identi ed as

Figure 3 The development model

Barry OrsquoSullivan Cyril J Weir and Nick Saville 41

middot the consulting lsquoexpertrsquo testers (the University of Reading group)middot the test development and validation staff at UCLESmiddot UCLES Senior Team Leaders (ie key staff in the oral examiner

training system)

All these individuals participated in the application of each draft Itshould also be noted that a number of drafts were anticipated

VI The development process

In order to arrive at a working version of the checklists a number ofdevelopmental phases were anticipated At each phase the latest ver-sion (or draft) of the instruments was applied and this applicationevaluated

Phase 1

The rst attempt to examine how the draft checklists would beviewed and applied by a group of language teachers was conductedby ffrench (1999) Of the participants at the seminar approximately50 of the group reported that English (British English AmericanEnglish or Australian English) was their rst language while theremaining 50 were native Greek speakers

In their introduction to the application of the Observation Check-lists (OCs) the participants were given a series of activities thatfocused on the nature and use of those functions of language seen bytask designers at UCLES to be particularly applicable to their EFLMain Suite Speaking Tests (principally FCE CAE and CPE) Oncefamiliar with the nature of the functions (and where they might occurin a test) the participants applied the OCs in lsquorealrsquo time to an FCESpeaking Test from the 1998 Standardization Video This video fea-tured a pair of French speakers who were judged by a panel oflsquoexpertrsquo raters (within UCLES) to be slightly above the criterion(lsquopassrsquo) level

Of the 37 participants 32 completed the task successfully that isthey attempted to make frequency counts of the items represented inthe OCs Among this group there appear to be varying degrees ofagreement as to the use of language functions particularly in termsof the speci c number of observations of each function Howeverwhen the data are examined from the perspective of agreement onwhether a particular function was observed or not (ignoring the countwhich in retrospect was highly ambitious when we consider the lackof systematic training in the use of the questionnaires given to theteachers who attended) we nd that there is a striking degree of

42 Validating speaking-test tasks

agreement on all but a small number of functions (Appendix 2) Notehere that in order to make these patterns of behaviour clear the datahave been sorted both horizontally and vertically by the total numberof observations made by each participant and of each item

From this perspective this aspect of the developmental process wasconsidered to be quite successful However it was apparent that therewere a number of elements within the checklists that were causingsome dif culty These are highlighted in the table by the tram-linesItems above the lines have been identi ed by some participants inone case by a single person while those below have been observedby a majority of participants (in two cases by all of them) For thesecases we might infer a high degree of agreement However themiddle range of items appears to have caused a degree of confusionand so are highlighted here ie marked for further investigation

Phase 2

In this phase a much smaller gathering was organized this timeinvolving members of the development team as well as the three UK-based UCLES Senior Team Leaders In advance of this meeting allparticipants were asked to study the existing checklists and to exemp-lify each function with examples drawn from their experiences of thevarious UCLES Main Suite examinations The resulting data werecollated and presented as a single document that formed the basis ofdiscussion during a day-long session Participants were not madeaware of the ndings from Phase 1

During this session many questions were asked of all aspects ofthe checklist and a more streamlined version of the three sectionswas suggested In addition to a number of participants making a writ-ten record of the discussions the entire session was recorded Thisproved to be a valuable reminder of the way in which particularchanges came about and was used when the nal decisions regardinginclusion con ation or omission were being made Although it isbeyond the scope of this project to analyse this recording whencoupled with the earlier and revised documents it is in itself a valu-able source of data in that it provides a signi cant record of the devel-opmental process

Among the many interesting outcomes of this phase were thedecisions either to rethink to reorganize or to omit items from theinitial list These decisions were seen to mirror the results of the Phase1 application quite closely Of the 13 items identi ed in Phase 1 asbeing in need of review (7 were rarely observed indicating a highdegree of agreement that they were not in fact present and 6appeared to be confused with very mixed reported observations ) 7

Barry OrsquoSullivan Cyril J Weir and Nick Saville 43

were recommended for either omission or inclusion in other items bythe panel while the remaining 6 items were identi ed by them asbeing of value Although no examples of the latter had appeared inthe earlier data the panel agreed that they represented language func-tions that the UCLES Main Suite examinations were intended to elicitIt was also decided that each item in this latter group was in need offurther clari cation andor exempli cation Of the remaining 17items

middot two were changed the item lsquoanalysingrsquo was recoded as lsquostagingrsquoin order to clarify its intended meaning while it was decided toseparate the item lsquo(dis)agreeingrsquo into its two separate components

middot three were omitted it was argued that the item lsquoproviding non-personal informationrsquo referred to what was happening with theother items in the informational function category while the itemslsquoexplainingrsquo and lsquojustifyingsupportingrsquo were not functions usu-ally associated with the UCLES Main Suite tasks and no occur-rences of these had been noted

We would emphasize that as reported in Section IV above the initiallist was developed to cover the language functions that various spokenlanguage test tasks might elicit The development of the checklistsdescribed here re ects an attempt to customize the lists in line withthe intended functional outcomes of a speci c set of tests

We are of course aware that closed instruments of this type maybe open to the criticism that valuable information could be lost How-ever for reasons of practicality we felt it necessary to limit the listto what the examinations were intended to elicit rather than attemptto operationalize a full inventory Secondly any functions thatappeared in the data that were not covered by the reduced list wouldhave been noted There appeared to be no cases of this

The data from these two phases were combined to result in a work-ing version of the checklists (Appendix 3) which was then appliedto a pair of FCE Speaking Tests in Phase 3

Phase 3

In the third phase the revised checklists were given to a group of 15MA TEFL students who were asked to apply them to two FCE testsBoth of these tests involved a mixed-sex pair of learners one pair ofapproximately average ability and the other pair above averageBefore using the observation checklists (OCs) the students wereasked rst to attempt to predict which functions they might expect to nd To help in this pre-session task the students were given detailsof the FCE format and tasks

44 Validating speaking-test tasks

Unfortunately a small number of students did not manage to com-plete the observation task as they were somewhat overwhelmed withthe real-time application of the checklists As a result only 12 sets ofcompleted checklists were included in the nal analysis

Prior to the session the group was given an opportunity to have apractice run using a third FCE examination While this lsquotrainingrsquo per-iod coupled with the pre-session task was intended to provide thestudents with the background they needed to apply the checklists con-sistently there was a problem during the session itself This problemwas caused by the failure of a number of students to note the changefrom Task 3 to Task 4 in the rst test observed This was possiblycaused by a lack of awareness of the test itself and was not helpedby the seamless way in which the examiner on the video moved froma two-way discussion involving the test-takers to a three-way dis-cussion This meant that a full set of data exists only for the rst twotasks of this test As the problem was noticed in time the second testdid not cause these problems Unlike the earlier seminar on thisoccasion the participants were asked only to record each functionwhen it was rst observed This was done as it was felt that the earlierseminar showed that without extensive training it would be far toodif cult to apply the OCs fully in lsquorealrsquo time in order to generatecomprehensive frequency counts We are aware that a full tally wouldenable us to draw more precise conclusions about the relative fre-quency of occurrence of these functions and the degree of consensus(reliability) of observers

Against this we must emphasize that the checklists in their currentstage of development are designed to be used in real time Their usewas therefore restricted to determining the presence or absence of aparticular function Rater agreement in this case is limited to a some-what crude account of whether a function occurred or did not occurin a particular task performance We do not therefore have evidenceof whether the function observed was invariant across raters

The results from this session are included as Appendix 4 It canbe seen from this table that the participants again display mixed levelsof agreement ranging from a single perceived observation to totalagreement As with the earlier session it appears that there is rela-tively broad agreement on a range of functions but that others appearto be more dif cult to identify easily These dif culties appear to begreatest where the task involves a degree of interaction between thetest-takers

Phase 4In this phase a transcription was made of the second of the two inter-views used in Phase 3 since there was a full set of data available for

Barry OrsquoSullivan Cyril J Weir and Nick Saville 45

this interview The OCs were then lsquomappedrsquo on to this transcript inorder to give an overview from a different perspective of what func-tions were generated (it being felt that this map would result in anaccurate description of the test in terms of the items included in theOCs) This mapping was carried out by two researchers who initiallyworked independently of each other but discussed their nished workin order to arrive at a consensus

Finally the results of Phases 2 and 3 were compared (Appendix5) This clearly indicates that the checklists are now working wellThere are still some problems in items such as lsquostagingrsquo and lsquodescrib-ingrsquo and feedback from participants suggests that this may be due tomisunderstandings or misinterpretations of the gloss and examplesused In addition there are some similar dif culties with the initialthree items in the interactional functions checklist in which the great-est dif culties in applying the checklists appear to lie

VII Discussion and initial conclusions

The results of this study appear to substantiate our belief thatalthough still under development for use with the UCLES Main Suiteexaminations an operational version of these checklists is certainlyfeasible and has potentially wider application mutatis mutandis tothe content validation of other spoken language tests Further re ne-ment of the checklists is clearly required although the developmentalprocess adopted here appears to have borne positive results

1 Validities

We would not wish to claim that the checklists on their own offer asatisfactory demonstration of the construct validity of a spoken langu-age test for as Messick argues (1989 16) lsquothe varieties of evidencesupporting validity are not alternatives but rather supplements to oneanotherrsquo We recognize the necessity for a broad view of lsquothe eviden-tial basis for test interpretationrsquo (Messick 1989 20) Bachman (1990237) similarly concludes lsquoit is important to recognise that none ofthese [evidences of validity] by itself is suf cient to demonstrate thevalidity of a particular interpretation or use of test scoresrsquo (see alsoBachman 1990 243) Fulcher (1999 224) adds a further caveatagainst an overly narrow interpretation of content validity when hequotes Messick (1989 41)

the major problem is that so-called content validity is focused upon test formsrather than test scores upon instruments rather than measurements selectingcontent is an act of classi cation which is in itself a hypothesis that needs tobe con rmed empirically

46 Validating speaking-test tasks

Like these authors we regard as inadequate any conceptualization ofvalidity that does not involve the provision of evidence on a numberof levels but would argue strongly that without a clear idea of thematch between intended content and actual content any comprehen-sive investigation of the construct validity of a test is built on sandDe ning the construct is in our view underpinned by establishingthe nature of the actual performances elicited by test tasks ie thetrue content of tasks

2 Present and future applications of observational checklists

Versions of the checklists require a degree of training and practicesimilar to that given to raters if a reliable and consistent outcome isto be expected This requires that standardized training materials bedeveloped alongside the checklists In the case of these checkliststhis process has already begun with the initial versions piloted duringPhase 3 of the project

The checklists have great potential as an evaluative tool and canprovide comprehensive insight into various issues It is hoped thatamongst other issues the checklists will provide insights into the fol-lowing

middot the language functions that the different task-types (and differentsub-tasks within these) employed in the UCLES Main Suite Paper5 (Speaking) Tests typically elicit

middot the language that the pair-format elicits and how it differs in nat-ure and quality from that elicited by interlocutor-single candi-date testing

middot the extent to which there is functional variation across the topfour levels of the UCLES Main Suite Spoken Language Test

In addition to these issues the way in which the checklists can beapplied may allow for other important questions to be answered Forexample by allowing the evaluator multiple observations (stoppingand starting a recording of a test at will) it will be possible to estab-lish whether there are quanti able differences in the language func-tions generated by the different tasks ie the evaluators will havethe time they need to make frequency counts of the functions

While the results to date have focused on a posteriori validationprocedures these checklists are also relevant to task design By takinginto account the expected response of a task (and by describing thatresponse in terms of these functions ) it will be possible to explorepredicted and actual test task outcome It will also be a useful guidefor item writers in taking a priori decisions about content coverageThrough this approach it should be possible to predict more accurately

Barry OrsquoSullivan Cyril J Weir and Nick Saville 47

linguistic response (in terms of the elements of the checklists) andto apply this to the design of test tasks ndash and of course to evaluatethe success of the prediction later on In the longer term this willlead to a greater understanding of how tasks and task formats can bemanipulated to result in speci c language use We are not claimingthat it is possible to predict language use at a micro level(grammatical form or lexis) but that it is possible to predict infor-mational and interactional functions and features of interaction man-agement ndash a notion supported by Bygate (1999)

The checklists should also enable us to explore how systematicvariation in such areas as interviewer questioning behaviour (andinterlocutor frame adherence ) affects the language produced in thistype of test In the interview transcribed for this study for examplethe examiner directed his questions very deliberately (systematicallyaiming the questions at one participant and then the other) Thistended to sti e any spontaneity in the intended three-way discussion(Task 4) so occurrences of Interactional and Discourse ManagementFunctions did not materialize to the extent intended by the taskdesigners It is possible that a less deliberate (unscripted) questioningtechnique would lead to a less interviewer-oriented interaction patternand allow for the more genuine interactive communication envisagedin the task design

Perhaps the most valuable contribution that this type of validationprocedure offers is its potential to improve the quality of oral assess-ment in both low-stakes and high-stakes contexts By offering theinvestigator an instrument that can be used in real time the checklistsbroaden the scope of investigation from limited case study analysisof small numbers of test transcripts to large scale eld studies acrossa wide range of testing contexts

Acknowledgements

We would like to thank Don Porter and Rita Green for their earlyinput into the rst version of the checklist In addition help wasreceived from members of the ELT division in UCLES in particularfrom Angela ffrench Lynda Taylor and Christina Rimini from agroup of UCLES Senior Team Leaders and from MA TEFL studentsat the University of Reading Finally we would like to thank theeditors and anonymous reviewers of Language Testing for theirinsightful comments and helpful suggestions for its improvement Thefaults that remain are as ever ours

48 Validating speaking-test tasks

VIII References

Anastasi A 1988 Psychological testing 6th edition New York Macmil-lan

Bachman LF 1990 Fundamental considerations in language testingOxford Oxford University Press

Bachman LF and Palmer AS 1981 The construct validation of the FSIoral interview Language Learning 31 67ndash86

mdashmdash 1996 Language testing in practice Oxford Oxford University PressBallman TL 1991 The oral task of picture description similarities and

differences in native and nonnative speakers of Spanish In TeschnerRV editor Assessing foreign language pro ciency of undergrad-uates AAUSC Issues in Language Program Direction Boston Heinleand Heinle 221ndash31

Brown A 1998 Interviewer style and candidate performance in the IELSToral interview Paper presented at the Language Testing Research Col-loquium Monterey CA

Bygate M 1988 Speaking Oxford Oxford University Pressmdashmdash 1999 Quality of language and purpose of task patterns of learnersrsquo

language on two oral communication tasks Language TeachingResearch 3 185ndash214

Chalhoub-Deville M 1995a Deriving oral assessment scales across differ-ent tests and rater groups Language Testing 12 16ndash33

mdashmdash 1995b A contextualized approach to describing oral language pro- ciency Language Learning 45 251ndash81

Clark JLD 1979 Direct vs semi-direct tests of speaking ability In Bri-ere EJ and Hinofotis FB editors Concepts in language testingsome recent studies Washington DC TESOL

mdashmdash 1988 Validation of a tape-mediated ACTFLILR scale based test ofChinese speaking pro ciency Language Testing 5 187ndash205

Clark JLD and Hooshmand D 1992 lsquoScreen to Screenrsquo testing anexploratory study of oral pro ciency interviewing using video tele-conferencing System 20 293ndash304

Cronbach LJ 1971 Validity In Thorndike RL editor Educationalmeasurement 2nd edition Washington DC American Council on Edu-cation 443ndash597

mdashmdash 1990 Essentials of psychological testing 5th edition New YorkHarper amp Row

Davies A 1977 The construction of language tests In Allen JPB andDavies A editors Testing and experimental methods The EdinburghCourse in Applied Linguistics Volume 4 London Oxford UniversityPress 38ndash194

mdashmdash 1990 Principles of language testing Oxford BlackwellEllerton AW 1997 Considerations in the validation of semi-direct oral

testing Unpublished PhD thesis CALS University of Readingffrench A 1999 Language functions and UCLES speaking tests Seminar

in Athens Greece October 1999

Barry OrsquoSullivan Cyril J Weir and Nick Saville 49

Foster P and Skehan P 1996 The in uence of planning and task typeon second language performance Studies in Second Language Acqui-sition 18 299ndash323

mdashmdash 1999 The in uence of source of planning and focus of planning ontask-based performance Language Teaching Research 3 215ndash47

Fulcher G 1994 Some priority areas for oral language testing LanguageTesting Update 15 39ndash47

mdashmdash 1996 Testing tasks issues in task design and the group oral LanguageTesting 13 23ndash51

mdashmdash 1999 Assessment in English for academic purposes putting contentvalidity in its place Applied Linguistics 20 221ndash36

Hayashi M 1995 Conversational repair a contrastive study of Japaneseand English MA Project Report University of Canberra

Henning G 1983 Oral pro ciency testing comparative validities of inter-view imitation and completion methods Language Learning 33315ndash32

mdashmdash 1987 A guide to language testing Cambridge MA Newbury HouseKelly R 1978 On the construct validation of comprehension tests an exer-

cise in applied linguistics Unpublished PhD thesis University ofQueensland

Kenyon D 1995 An investigation of the validity of task demands onperformance-based tests of oral pro ciency In Kunnan AJ editorValidation in language assessment selected papers from the 17th Lan-guage Testing Research Colloquium Long Beach Mahwah NJ Lawr-ence Erlbaum 19ndash40

Kormos J 1999 Simulating conversations in oral-pro ciency assessmenta conversation analysis of role plays and non-scripted interviews inlanguage exams Language Testing 16 163ndash88

Lazaraton A 1992 The structural organisation of a language interview aconversational analytic perspective System 20 373ndash86

mdashmdash1996 A qualitative approach to monitoring examiner conduct in theCambridge assessment of spoken English (CASE) In Milanovic Mand Saville N editors Performance testing cognition andassessment selected papers from the 15th Language Testing ResearchColloquium Cambridge and Arnhem Studies in Language Testing 3Cambridge University of Cambridge Local Examinations Syndicate18ndash33

mdashmdash 2000 A qualitative approach to the validation of oral language testsStudies in Language Testing Volume 14 Cambridge Cambridge Uni-versity Press

Lumley T and OrsquoSullivan B 2000 The effect of speaker and topic vari-ables on task performance in a tape-mediated assessment of speakingPaper presented at the 2nd Annual Asian Language AssessmentResearch Forum The Hong Kong Polytechnic University

Luoma S 1997 Comparability of a tape-mediated and a face-to-face testof speaking a triangulation study Unpublished Licentiate ThesisCentre for Applied Language Studies Jyvaskyla University Finland

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 5: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

Barry OrsquoSullivan Cyril J Weir and Nick Saville 37

all the theory on which all else rests it is from there that the constructis set up and it is on the construct that validity of the content andpredictive kinds is basedrsquo Kelly (1978 8) supported this view com-menting that lsquothe systematic development of tests requires sometheory even an informal inexplicit one to guide the initial selectionof item content and the division of the domain of interest into appro-priate sub-areasrsquo

Because we lack an adequate theory of language in use a prioriattempts to determine the construct validity of pro ciency testsinvolve us in matters that relate more evidently to content validityWe need to talk of the communicative construct in descriptive termsand as a result we become involved in questions of content relevanceand content coverage Thus for Kelly (1978 8) content validityseemed lsquoan almost completely overlapping conceptrsquo with constructvalidity and for Moller (1982 68) lsquothe distinction between constructand content validity in language testing is not always very markedparticularly for tests of general language pro ciencyrsquo

Content validity is considered important as it is principally con-cerned with the extent to which the selection of test tasks is represen-tative of the larger universe of tasks of which the test is assumed tobe a sample (see Bachman and Palmer 1981 Henning 1987 94Messick 1989 16 Bachman 1990 244) Similarly Anastasi (1988131) de ned content validity as involving lsquoessentially the systematicexamination of the test content to determine whether it covers arepresentative sample of the behaviour domain to be measuredrsquo Sheoutlined (Anastasi 1988 132) the following guidelines for estab-lishing content validity

1) lsquothe behaviour domain to be tested must be systematicallyanalysed to make certain that all major aspects are covered bythe test items and in the correct proportionsrsquo

2) lsquothe domain under consideration should be fully described inadvance rather than being de ned after the test has been pre-paredrsquo

3) lsquocontent validity depends on the relevance of the individualrsquos testresponses to the behaviour area under consideration rather thanon the apparent relevance of item contentrsquo

The directness of t and adequacy of the test sample is thus dependenton the quality of the description of the target language behaviourbeing tested In addition if the responses to the item are invokedMessick (1975 961) suggests lsquothe concern with processes underlyingtest responses places this approach to content validity squarely in therealm of construct validityrsquo Davies (1990 23) similarly notes lsquocon-tent validity slides into construct validityrsquo

38 Validating speaking-test tasks

Content validation is of course extremely problematic given thedif culty we have in characterizing language pro ciency with suf- cient precision to ensure the validity of the representative samplewe include in our tests and the further threats to validity arising out ofany attempts to operationalize real life behaviours in a test Specifyingoperations let alone the conditions under which these are performedis challenging and at best relatively unsophisticated (see Cronbach1990) Weir (1993) provides an introductory attempt to specify theoperations and conditions that might form a framework for test taskdescription (see also Bachman 1990 Bachman and Palmer 1996)

The dif culties involved do not however absolve us fromattempting to make our tests as relevant as possible in terms of con-tent Generating content related evidence is seen as a necessaryalthough not suf cient part of the validation process of a speakingtest To this end we sought to establish in this study an effective andef cient procedure for establishing the content validity of speakingtests As well as being useful in helping specify the domain to betested we would argue that the checklist discussed below wouldenable the researcher to address how predicted vs actual task per-formance can be compared

III Methodological issues

While it is relatively easy to rationalize the need to establish that theLETs used in performance tests are working as predicted (ie interms of language generated) the dif culty lies in how this mightbest be done

UCLES EFL (English as a foreign language) routinely collectsaudio recordings and carries out transcriptions of its Speaking TestsThese transcripts are used for a range of validation purposes and inparticular they contribute to revision projects for the Speaking Testsfor example FCE which was revised in 1996 and currently therevision of the International English Language Testing System(IELTS) Speaking Test in addition to the CPE revision project

In a series of UCLES studies focusing on the language of theSpeaking Tests Lazaraton has applied conversational analysis (CA)techniques to contribute to our understanding of the language used inpair-format Speaking Tests including the language of the candidatesand the interlocutor Her approach requires a very careful ne-tunedtranscription of the tests in order to provide the data for analysis (seeLazaraton 2000) Similar qualitative methodologies have beenapplied by Young and Milanovic (1992) ndash also to UCLES data ndash byBrown (1998) and by Ross and Berwick (1992) amongst others

Barry OrsquoSullivan Cyril J Weir and Nick Saville 39

While there is clearly a great deal of potential for this detailed analy-sis of transcribed performances there are also a number of drawbacksthe most serious of which involves the complexity of the transcriptionprocess In practice this means that a great deal of time and expertiseis required in order to gain the kind of data that will answer the basicquestion concerning validity Even where this is done it is impracticalto attempt to deal with more than a small number of test eventstherefore the generalizability of the results may be questioned

Clearly then a more ef cient methodology is required that allowsthe test designer to evaluate the procedures and especially the tasksin terms of the language produced by a larger number of candidatesIdeally this should be possible in lsquorealrsquo time so that the relationshipof predicted outcome to speci c outcome can be established using adata set that satisfactorily re ects the typical test-taking populationThe primary objective of this project therefore was to create aninstrument built on a framework that describes the language of per-formance in a way that can be readily accessed by evaluators whoare familiar with the tests being observed This work is designed tobe complementary to the use of transcriptions and to provide anadditional source of validation evidence

The FCE was chosen as the focus of this study for a number ofreasons

middot It is lsquostablersquo in that it is neither under review nor is due to bereviewed

middot It represents the middle of the ALTE (and UCLES Main Suite)range and is the most widely subscribed test in the battery

middot It offers the most likelihood of a wide range of performance ofany Main Suite examination as it is often used as an lsquoentry-pointrsquointo the suite candidates tend to range from below to above thislevel in terms of ability

middot Like all of the other Main Suite examinations a database ofrecordings (audio and video) already existed

IV The development of the observation checklists

Weir (1993) building on the earlier work of Bygate (1988) suggeststhat the language of a speaking test can be described in terms ofthe informational and interactional functions and those of interactionmanagement generated by the participants involved With this as astarting point a group of researchers at the University of Readingwere commissioned by UCLES EFL to examine the spoken langu-age second language acquisition and language testing literatures tocome up with an initial set of such functions (see Schegloff et al

40 Validating speaking-test tasks

1977 Schwartz 1980 van Ek and Trim 1984 Bygate 1988Shohamy 1988 1994 Walker 1990 Weir 1994 Stenstrom 1994Chalhoub-Deville 1995b Hayashi 1995 Ellerton 1997 Suhua1998 Kormos 1999 OrsquoSullivan 2000 OrsquoLoughlin 2001)

These were then presented as a draft set of three checklists(Appendix 1) representing each of the elements of Weirrsquos categoriz-ation What follows in the three phases of the development processdescribed below (Section VI) was an attempt to customize thechecklist to more closely re ect the intended outcomes of spokenlanguage test tasks in the UCLES Main Suite The checklists weredesigned to help establish which of these functions resulted andwhich were absent

The next concern was with the development of a procedure fordevising a lsquoworkingrsquo version of the checklists to be followed by anevaluation of using this type of instrument in lsquorealrsquo time (using tapesor perhaps live speaking tests)

V The development model

The process through which the checklists were developed is shownin Figure 3 The concept that drives this model is the evaluation ateach level by different stakeholders At this stage of the project thesestakeholders were identi ed as

Figure 3 The development model

Barry OrsquoSullivan Cyril J Weir and Nick Saville 41

middot the consulting lsquoexpertrsquo testers (the University of Reading group)middot the test development and validation staff at UCLESmiddot UCLES Senior Team Leaders (ie key staff in the oral examiner

training system)

All these individuals participated in the application of each draft Itshould also be noted that a number of drafts were anticipated

VI The development process

In order to arrive at a working version of the checklists a number ofdevelopmental phases were anticipated At each phase the latest ver-sion (or draft) of the instruments was applied and this applicationevaluated

Phase 1

The rst attempt to examine how the draft checklists would beviewed and applied by a group of language teachers was conductedby ffrench (1999) Of the participants at the seminar approximately50 of the group reported that English (British English AmericanEnglish or Australian English) was their rst language while theremaining 50 were native Greek speakers

In their introduction to the application of the Observation Check-lists (OCs) the participants were given a series of activities thatfocused on the nature and use of those functions of language seen bytask designers at UCLES to be particularly applicable to their EFLMain Suite Speaking Tests (principally FCE CAE and CPE) Oncefamiliar with the nature of the functions (and where they might occurin a test) the participants applied the OCs in lsquorealrsquo time to an FCESpeaking Test from the 1998 Standardization Video This video fea-tured a pair of French speakers who were judged by a panel oflsquoexpertrsquo raters (within UCLES) to be slightly above the criterion(lsquopassrsquo) level

Of the 37 participants 32 completed the task successfully that isthey attempted to make frequency counts of the items represented inthe OCs Among this group there appear to be varying degrees ofagreement as to the use of language functions particularly in termsof the speci c number of observations of each function Howeverwhen the data are examined from the perspective of agreement onwhether a particular function was observed or not (ignoring the countwhich in retrospect was highly ambitious when we consider the lackof systematic training in the use of the questionnaires given to theteachers who attended) we nd that there is a striking degree of

42 Validating speaking-test tasks

agreement on all but a small number of functions (Appendix 2) Notehere that in order to make these patterns of behaviour clear the datahave been sorted both horizontally and vertically by the total numberof observations made by each participant and of each item

From this perspective this aspect of the developmental process wasconsidered to be quite successful However it was apparent that therewere a number of elements within the checklists that were causingsome dif culty These are highlighted in the table by the tram-linesItems above the lines have been identi ed by some participants inone case by a single person while those below have been observedby a majority of participants (in two cases by all of them) For thesecases we might infer a high degree of agreement However themiddle range of items appears to have caused a degree of confusionand so are highlighted here ie marked for further investigation

Phase 2

In this phase a much smaller gathering was organized this timeinvolving members of the development team as well as the three UK-based UCLES Senior Team Leaders In advance of this meeting allparticipants were asked to study the existing checklists and to exemp-lify each function with examples drawn from their experiences of thevarious UCLES Main Suite examinations The resulting data werecollated and presented as a single document that formed the basis ofdiscussion during a day-long session Participants were not madeaware of the ndings from Phase 1

During this session many questions were asked of all aspects ofthe checklist and a more streamlined version of the three sectionswas suggested In addition to a number of participants making a writ-ten record of the discussions the entire session was recorded Thisproved to be a valuable reminder of the way in which particularchanges came about and was used when the nal decisions regardinginclusion con ation or omission were being made Although it isbeyond the scope of this project to analyse this recording whencoupled with the earlier and revised documents it is in itself a valu-able source of data in that it provides a signi cant record of the devel-opmental process

Among the many interesting outcomes of this phase were thedecisions either to rethink to reorganize or to omit items from theinitial list These decisions were seen to mirror the results of the Phase1 application quite closely Of the 13 items identi ed in Phase 1 asbeing in need of review (7 were rarely observed indicating a highdegree of agreement that they were not in fact present and 6appeared to be confused with very mixed reported observations ) 7

Barry OrsquoSullivan Cyril J Weir and Nick Saville 43

were recommended for either omission or inclusion in other items bythe panel while the remaining 6 items were identi ed by them asbeing of value Although no examples of the latter had appeared inthe earlier data the panel agreed that they represented language func-tions that the UCLES Main Suite examinations were intended to elicitIt was also decided that each item in this latter group was in need offurther clari cation andor exempli cation Of the remaining 17items

middot two were changed the item lsquoanalysingrsquo was recoded as lsquostagingrsquoin order to clarify its intended meaning while it was decided toseparate the item lsquo(dis)agreeingrsquo into its two separate components

middot three were omitted it was argued that the item lsquoproviding non-personal informationrsquo referred to what was happening with theother items in the informational function category while the itemslsquoexplainingrsquo and lsquojustifyingsupportingrsquo were not functions usu-ally associated with the UCLES Main Suite tasks and no occur-rences of these had been noted

We would emphasize that as reported in Section IV above the initiallist was developed to cover the language functions that various spokenlanguage test tasks might elicit The development of the checklistsdescribed here re ects an attempt to customize the lists in line withthe intended functional outcomes of a speci c set of tests

We are of course aware that closed instruments of this type maybe open to the criticism that valuable information could be lost How-ever for reasons of practicality we felt it necessary to limit the listto what the examinations were intended to elicit rather than attemptto operationalize a full inventory Secondly any functions thatappeared in the data that were not covered by the reduced list wouldhave been noted There appeared to be no cases of this

The data from these two phases were combined to result in a work-ing version of the checklists (Appendix 3) which was then appliedto a pair of FCE Speaking Tests in Phase 3

Phase 3

In the third phase the revised checklists were given to a group of 15MA TEFL students who were asked to apply them to two FCE testsBoth of these tests involved a mixed-sex pair of learners one pair ofapproximately average ability and the other pair above averageBefore using the observation checklists (OCs) the students wereasked rst to attempt to predict which functions they might expect to nd To help in this pre-session task the students were given detailsof the FCE format and tasks

44 Validating speaking-test tasks

Unfortunately a small number of students did not manage to com-plete the observation task as they were somewhat overwhelmed withthe real-time application of the checklists As a result only 12 sets ofcompleted checklists were included in the nal analysis

Prior to the session the group was given an opportunity to have apractice run using a third FCE examination While this lsquotrainingrsquo per-iod coupled with the pre-session task was intended to provide thestudents with the background they needed to apply the checklists con-sistently there was a problem during the session itself This problemwas caused by the failure of a number of students to note the changefrom Task 3 to Task 4 in the rst test observed This was possiblycaused by a lack of awareness of the test itself and was not helpedby the seamless way in which the examiner on the video moved froma two-way discussion involving the test-takers to a three-way dis-cussion This meant that a full set of data exists only for the rst twotasks of this test As the problem was noticed in time the second testdid not cause these problems Unlike the earlier seminar on thisoccasion the participants were asked only to record each functionwhen it was rst observed This was done as it was felt that the earlierseminar showed that without extensive training it would be far toodif cult to apply the OCs fully in lsquorealrsquo time in order to generatecomprehensive frequency counts We are aware that a full tally wouldenable us to draw more precise conclusions about the relative fre-quency of occurrence of these functions and the degree of consensus(reliability) of observers

Against this we must emphasize that the checklists in their currentstage of development are designed to be used in real time Their usewas therefore restricted to determining the presence or absence of aparticular function Rater agreement in this case is limited to a some-what crude account of whether a function occurred or did not occurin a particular task performance We do not therefore have evidenceof whether the function observed was invariant across raters

The results from this session are included as Appendix 4 It canbe seen from this table that the participants again display mixed levelsof agreement ranging from a single perceived observation to totalagreement As with the earlier session it appears that there is rela-tively broad agreement on a range of functions but that others appearto be more dif cult to identify easily These dif culties appear to begreatest where the task involves a degree of interaction between thetest-takers

Phase 4In this phase a transcription was made of the second of the two inter-views used in Phase 3 since there was a full set of data available for

Barry OrsquoSullivan Cyril J Weir and Nick Saville 45

this interview The OCs were then lsquomappedrsquo on to this transcript inorder to give an overview from a different perspective of what func-tions were generated (it being felt that this map would result in anaccurate description of the test in terms of the items included in theOCs) This mapping was carried out by two researchers who initiallyworked independently of each other but discussed their nished workin order to arrive at a consensus

Finally the results of Phases 2 and 3 were compared (Appendix5) This clearly indicates that the checklists are now working wellThere are still some problems in items such as lsquostagingrsquo and lsquodescrib-ingrsquo and feedback from participants suggests that this may be due tomisunderstandings or misinterpretations of the gloss and examplesused In addition there are some similar dif culties with the initialthree items in the interactional functions checklist in which the great-est dif culties in applying the checklists appear to lie

VII Discussion and initial conclusions

The results of this study appear to substantiate our belief thatalthough still under development for use with the UCLES Main Suiteexaminations an operational version of these checklists is certainlyfeasible and has potentially wider application mutatis mutandis tothe content validation of other spoken language tests Further re ne-ment of the checklists is clearly required although the developmentalprocess adopted here appears to have borne positive results

1 Validities

We would not wish to claim that the checklists on their own offer asatisfactory demonstration of the construct validity of a spoken langu-age test for as Messick argues (1989 16) lsquothe varieties of evidencesupporting validity are not alternatives but rather supplements to oneanotherrsquo We recognize the necessity for a broad view of lsquothe eviden-tial basis for test interpretationrsquo (Messick 1989 20) Bachman (1990237) similarly concludes lsquoit is important to recognise that none ofthese [evidences of validity] by itself is suf cient to demonstrate thevalidity of a particular interpretation or use of test scoresrsquo (see alsoBachman 1990 243) Fulcher (1999 224) adds a further caveatagainst an overly narrow interpretation of content validity when hequotes Messick (1989 41)

the major problem is that so-called content validity is focused upon test formsrather than test scores upon instruments rather than measurements selectingcontent is an act of classi cation which is in itself a hypothesis that needs tobe con rmed empirically

46 Validating speaking-test tasks

Like these authors we regard as inadequate any conceptualization ofvalidity that does not involve the provision of evidence on a numberof levels but would argue strongly that without a clear idea of thematch between intended content and actual content any comprehen-sive investigation of the construct validity of a test is built on sandDe ning the construct is in our view underpinned by establishingthe nature of the actual performances elicited by test tasks ie thetrue content of tasks

2 Present and future applications of observational checklists

Versions of the checklists require a degree of training and practicesimilar to that given to raters if a reliable and consistent outcome isto be expected This requires that standardized training materials bedeveloped alongside the checklists In the case of these checkliststhis process has already begun with the initial versions piloted duringPhase 3 of the project

The checklists have great potential as an evaluative tool and canprovide comprehensive insight into various issues It is hoped thatamongst other issues the checklists will provide insights into the fol-lowing

middot the language functions that the different task-types (and differentsub-tasks within these) employed in the UCLES Main Suite Paper5 (Speaking) Tests typically elicit

middot the language that the pair-format elicits and how it differs in nat-ure and quality from that elicited by interlocutor-single candi-date testing

middot the extent to which there is functional variation across the topfour levels of the UCLES Main Suite Spoken Language Test

In addition to these issues the way in which the checklists can beapplied may allow for other important questions to be answered Forexample by allowing the evaluator multiple observations (stoppingand starting a recording of a test at will) it will be possible to estab-lish whether there are quanti able differences in the language func-tions generated by the different tasks ie the evaluators will havethe time they need to make frequency counts of the functions

While the results to date have focused on a posteriori validationprocedures these checklists are also relevant to task design By takinginto account the expected response of a task (and by describing thatresponse in terms of these functions ) it will be possible to explorepredicted and actual test task outcome It will also be a useful guidefor item writers in taking a priori decisions about content coverageThrough this approach it should be possible to predict more accurately

Barry OrsquoSullivan Cyril J Weir and Nick Saville 47

linguistic response (in terms of the elements of the checklists) andto apply this to the design of test tasks ndash and of course to evaluatethe success of the prediction later on In the longer term this willlead to a greater understanding of how tasks and task formats can bemanipulated to result in speci c language use We are not claimingthat it is possible to predict language use at a micro level(grammatical form or lexis) but that it is possible to predict infor-mational and interactional functions and features of interaction man-agement ndash a notion supported by Bygate (1999)

The checklists should also enable us to explore how systematicvariation in such areas as interviewer questioning behaviour (andinterlocutor frame adherence ) affects the language produced in thistype of test In the interview transcribed for this study for examplethe examiner directed his questions very deliberately (systematicallyaiming the questions at one participant and then the other) Thistended to sti e any spontaneity in the intended three-way discussion(Task 4) so occurrences of Interactional and Discourse ManagementFunctions did not materialize to the extent intended by the taskdesigners It is possible that a less deliberate (unscripted) questioningtechnique would lead to a less interviewer-oriented interaction patternand allow for the more genuine interactive communication envisagedin the task design

Perhaps the most valuable contribution that this type of validationprocedure offers is its potential to improve the quality of oral assess-ment in both low-stakes and high-stakes contexts By offering theinvestigator an instrument that can be used in real time the checklistsbroaden the scope of investigation from limited case study analysisof small numbers of test transcripts to large scale eld studies acrossa wide range of testing contexts

Acknowledgements

We would like to thank Don Porter and Rita Green for their earlyinput into the rst version of the checklist In addition help wasreceived from members of the ELT division in UCLES in particularfrom Angela ffrench Lynda Taylor and Christina Rimini from agroup of UCLES Senior Team Leaders and from MA TEFL studentsat the University of Reading Finally we would like to thank theeditors and anonymous reviewers of Language Testing for theirinsightful comments and helpful suggestions for its improvement Thefaults that remain are as ever ours

48 Validating speaking-test tasks

VIII References

Anastasi A 1988 Psychological testing 6th edition New York Macmil-lan

Bachman LF 1990 Fundamental considerations in language testingOxford Oxford University Press

Bachman LF and Palmer AS 1981 The construct validation of the FSIoral interview Language Learning 31 67ndash86

mdashmdash 1996 Language testing in practice Oxford Oxford University PressBallman TL 1991 The oral task of picture description similarities and

differences in native and nonnative speakers of Spanish In TeschnerRV editor Assessing foreign language pro ciency of undergrad-uates AAUSC Issues in Language Program Direction Boston Heinleand Heinle 221ndash31

Brown A 1998 Interviewer style and candidate performance in the IELSToral interview Paper presented at the Language Testing Research Col-loquium Monterey CA

Bygate M 1988 Speaking Oxford Oxford University Pressmdashmdash 1999 Quality of language and purpose of task patterns of learnersrsquo

language on two oral communication tasks Language TeachingResearch 3 185ndash214

Chalhoub-Deville M 1995a Deriving oral assessment scales across differ-ent tests and rater groups Language Testing 12 16ndash33

mdashmdash 1995b A contextualized approach to describing oral language pro- ciency Language Learning 45 251ndash81

Clark JLD 1979 Direct vs semi-direct tests of speaking ability In Bri-ere EJ and Hinofotis FB editors Concepts in language testingsome recent studies Washington DC TESOL

mdashmdash 1988 Validation of a tape-mediated ACTFLILR scale based test ofChinese speaking pro ciency Language Testing 5 187ndash205

Clark JLD and Hooshmand D 1992 lsquoScreen to Screenrsquo testing anexploratory study of oral pro ciency interviewing using video tele-conferencing System 20 293ndash304

Cronbach LJ 1971 Validity In Thorndike RL editor Educationalmeasurement 2nd edition Washington DC American Council on Edu-cation 443ndash597

mdashmdash 1990 Essentials of psychological testing 5th edition New YorkHarper amp Row

Davies A 1977 The construction of language tests In Allen JPB andDavies A editors Testing and experimental methods The EdinburghCourse in Applied Linguistics Volume 4 London Oxford UniversityPress 38ndash194

mdashmdash 1990 Principles of language testing Oxford BlackwellEllerton AW 1997 Considerations in the validation of semi-direct oral

testing Unpublished PhD thesis CALS University of Readingffrench A 1999 Language functions and UCLES speaking tests Seminar

in Athens Greece October 1999

Barry OrsquoSullivan Cyril J Weir and Nick Saville 49

Foster P and Skehan P 1996 The in uence of planning and task typeon second language performance Studies in Second Language Acqui-sition 18 299ndash323

mdashmdash 1999 The in uence of source of planning and focus of planning ontask-based performance Language Teaching Research 3 215ndash47

Fulcher G 1994 Some priority areas for oral language testing LanguageTesting Update 15 39ndash47

mdashmdash 1996 Testing tasks issues in task design and the group oral LanguageTesting 13 23ndash51

mdashmdash 1999 Assessment in English for academic purposes putting contentvalidity in its place Applied Linguistics 20 221ndash36

Hayashi M 1995 Conversational repair a contrastive study of Japaneseand English MA Project Report University of Canberra

Henning G 1983 Oral pro ciency testing comparative validities of inter-view imitation and completion methods Language Learning 33315ndash32

mdashmdash 1987 A guide to language testing Cambridge MA Newbury HouseKelly R 1978 On the construct validation of comprehension tests an exer-

cise in applied linguistics Unpublished PhD thesis University ofQueensland

Kenyon D 1995 An investigation of the validity of task demands onperformance-based tests of oral pro ciency In Kunnan AJ editorValidation in language assessment selected papers from the 17th Lan-guage Testing Research Colloquium Long Beach Mahwah NJ Lawr-ence Erlbaum 19ndash40

Kormos J 1999 Simulating conversations in oral-pro ciency assessmenta conversation analysis of role plays and non-scripted interviews inlanguage exams Language Testing 16 163ndash88

Lazaraton A 1992 The structural organisation of a language interview aconversational analytic perspective System 20 373ndash86

mdashmdash1996 A qualitative approach to monitoring examiner conduct in theCambridge assessment of spoken English (CASE) In Milanovic Mand Saville N editors Performance testing cognition andassessment selected papers from the 15th Language Testing ResearchColloquium Cambridge and Arnhem Studies in Language Testing 3Cambridge University of Cambridge Local Examinations Syndicate18ndash33

mdashmdash 2000 A qualitative approach to the validation of oral language testsStudies in Language Testing Volume 14 Cambridge Cambridge Uni-versity Press

Lumley T and OrsquoSullivan B 2000 The effect of speaker and topic vari-ables on task performance in a tape-mediated assessment of speakingPaper presented at the 2nd Annual Asian Language AssessmentResearch Forum The Hong Kong Polytechnic University

Luoma S 1997 Comparability of a tape-mediated and a face-to-face testof speaking a triangulation study Unpublished Licentiate ThesisCentre for Applied Language Studies Jyvaskyla University Finland

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 6: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

38 Validating speaking-test tasks

Content validation is of course extremely problematic given thedif culty we have in characterizing language pro ciency with suf- cient precision to ensure the validity of the representative samplewe include in our tests and the further threats to validity arising out ofany attempts to operationalize real life behaviours in a test Specifyingoperations let alone the conditions under which these are performedis challenging and at best relatively unsophisticated (see Cronbach1990) Weir (1993) provides an introductory attempt to specify theoperations and conditions that might form a framework for test taskdescription (see also Bachman 1990 Bachman and Palmer 1996)

The dif culties involved do not however absolve us fromattempting to make our tests as relevant as possible in terms of con-tent Generating content related evidence is seen as a necessaryalthough not suf cient part of the validation process of a speakingtest To this end we sought to establish in this study an effective andef cient procedure for establishing the content validity of speakingtests As well as being useful in helping specify the domain to betested we would argue that the checklist discussed below wouldenable the researcher to address how predicted vs actual task per-formance can be compared

III Methodological issues

While it is relatively easy to rationalize the need to establish that theLETs used in performance tests are working as predicted (ie interms of language generated) the dif culty lies in how this mightbest be done

UCLES EFL (English as a foreign language) routinely collectsaudio recordings and carries out transcriptions of its Speaking TestsThese transcripts are used for a range of validation purposes and inparticular they contribute to revision projects for the Speaking Testsfor example FCE which was revised in 1996 and currently therevision of the International English Language Testing System(IELTS) Speaking Test in addition to the CPE revision project

In a series of UCLES studies focusing on the language of theSpeaking Tests Lazaraton has applied conversational analysis (CA)techniques to contribute to our understanding of the language used inpair-format Speaking Tests including the language of the candidatesand the interlocutor Her approach requires a very careful ne-tunedtranscription of the tests in order to provide the data for analysis (seeLazaraton 2000) Similar qualitative methodologies have beenapplied by Young and Milanovic (1992) ndash also to UCLES data ndash byBrown (1998) and by Ross and Berwick (1992) amongst others

Barry OrsquoSullivan Cyril J Weir and Nick Saville 39

While there is clearly a great deal of potential for this detailed analy-sis of transcribed performances there are also a number of drawbacksthe most serious of which involves the complexity of the transcriptionprocess In practice this means that a great deal of time and expertiseis required in order to gain the kind of data that will answer the basicquestion concerning validity Even where this is done it is impracticalto attempt to deal with more than a small number of test eventstherefore the generalizability of the results may be questioned

Clearly then a more ef cient methodology is required that allowsthe test designer to evaluate the procedures and especially the tasksin terms of the language produced by a larger number of candidatesIdeally this should be possible in lsquorealrsquo time so that the relationshipof predicted outcome to speci c outcome can be established using adata set that satisfactorily re ects the typical test-taking populationThe primary objective of this project therefore was to create aninstrument built on a framework that describes the language of per-formance in a way that can be readily accessed by evaluators whoare familiar with the tests being observed This work is designed tobe complementary to the use of transcriptions and to provide anadditional source of validation evidence

The FCE was chosen as the focus of this study for a number ofreasons

middot It is lsquostablersquo in that it is neither under review nor is due to bereviewed

middot It represents the middle of the ALTE (and UCLES Main Suite)range and is the most widely subscribed test in the battery

middot It offers the most likelihood of a wide range of performance ofany Main Suite examination as it is often used as an lsquoentry-pointrsquointo the suite candidates tend to range from below to above thislevel in terms of ability

middot Like all of the other Main Suite examinations a database ofrecordings (audio and video) already existed

IV The development of the observation checklists

Weir (1993) building on the earlier work of Bygate (1988) suggeststhat the language of a speaking test can be described in terms ofthe informational and interactional functions and those of interactionmanagement generated by the participants involved With this as astarting point a group of researchers at the University of Readingwere commissioned by UCLES EFL to examine the spoken langu-age second language acquisition and language testing literatures tocome up with an initial set of such functions (see Schegloff et al

40 Validating speaking-test tasks

1977 Schwartz 1980 van Ek and Trim 1984 Bygate 1988Shohamy 1988 1994 Walker 1990 Weir 1994 Stenstrom 1994Chalhoub-Deville 1995b Hayashi 1995 Ellerton 1997 Suhua1998 Kormos 1999 OrsquoSullivan 2000 OrsquoLoughlin 2001)

These were then presented as a draft set of three checklists(Appendix 1) representing each of the elements of Weirrsquos categoriz-ation What follows in the three phases of the development processdescribed below (Section VI) was an attempt to customize thechecklist to more closely re ect the intended outcomes of spokenlanguage test tasks in the UCLES Main Suite The checklists weredesigned to help establish which of these functions resulted andwhich were absent

The next concern was with the development of a procedure fordevising a lsquoworkingrsquo version of the checklists to be followed by anevaluation of using this type of instrument in lsquorealrsquo time (using tapesor perhaps live speaking tests)

V The development model

The process through which the checklists were developed is shownin Figure 3 The concept that drives this model is the evaluation ateach level by different stakeholders At this stage of the project thesestakeholders were identi ed as

Figure 3 The development model

Barry OrsquoSullivan Cyril J Weir and Nick Saville 41

middot the consulting lsquoexpertrsquo testers (the University of Reading group)middot the test development and validation staff at UCLESmiddot UCLES Senior Team Leaders (ie key staff in the oral examiner

training system)

All these individuals participated in the application of each draft Itshould also be noted that a number of drafts were anticipated

VI The development process

In order to arrive at a working version of the checklists a number ofdevelopmental phases were anticipated At each phase the latest ver-sion (or draft) of the instruments was applied and this applicationevaluated

Phase 1

The rst attempt to examine how the draft checklists would beviewed and applied by a group of language teachers was conductedby ffrench (1999) Of the participants at the seminar approximately50 of the group reported that English (British English AmericanEnglish or Australian English) was their rst language while theremaining 50 were native Greek speakers

In their introduction to the application of the Observation Check-lists (OCs) the participants were given a series of activities thatfocused on the nature and use of those functions of language seen bytask designers at UCLES to be particularly applicable to their EFLMain Suite Speaking Tests (principally FCE CAE and CPE) Oncefamiliar with the nature of the functions (and where they might occurin a test) the participants applied the OCs in lsquorealrsquo time to an FCESpeaking Test from the 1998 Standardization Video This video fea-tured a pair of French speakers who were judged by a panel oflsquoexpertrsquo raters (within UCLES) to be slightly above the criterion(lsquopassrsquo) level

Of the 37 participants 32 completed the task successfully that isthey attempted to make frequency counts of the items represented inthe OCs Among this group there appear to be varying degrees ofagreement as to the use of language functions particularly in termsof the speci c number of observations of each function Howeverwhen the data are examined from the perspective of agreement onwhether a particular function was observed or not (ignoring the countwhich in retrospect was highly ambitious when we consider the lackof systematic training in the use of the questionnaires given to theteachers who attended) we nd that there is a striking degree of

42 Validating speaking-test tasks

agreement on all but a small number of functions (Appendix 2) Notehere that in order to make these patterns of behaviour clear the datahave been sorted both horizontally and vertically by the total numberof observations made by each participant and of each item

From this perspective this aspect of the developmental process wasconsidered to be quite successful However it was apparent that therewere a number of elements within the checklists that were causingsome dif culty These are highlighted in the table by the tram-linesItems above the lines have been identi ed by some participants inone case by a single person while those below have been observedby a majority of participants (in two cases by all of them) For thesecases we might infer a high degree of agreement However themiddle range of items appears to have caused a degree of confusionand so are highlighted here ie marked for further investigation

Phase 2

In this phase a much smaller gathering was organized this timeinvolving members of the development team as well as the three UK-based UCLES Senior Team Leaders In advance of this meeting allparticipants were asked to study the existing checklists and to exemp-lify each function with examples drawn from their experiences of thevarious UCLES Main Suite examinations The resulting data werecollated and presented as a single document that formed the basis ofdiscussion during a day-long session Participants were not madeaware of the ndings from Phase 1

During this session many questions were asked of all aspects ofthe checklist and a more streamlined version of the three sectionswas suggested In addition to a number of participants making a writ-ten record of the discussions the entire session was recorded Thisproved to be a valuable reminder of the way in which particularchanges came about and was used when the nal decisions regardinginclusion con ation or omission were being made Although it isbeyond the scope of this project to analyse this recording whencoupled with the earlier and revised documents it is in itself a valu-able source of data in that it provides a signi cant record of the devel-opmental process

Among the many interesting outcomes of this phase were thedecisions either to rethink to reorganize or to omit items from theinitial list These decisions were seen to mirror the results of the Phase1 application quite closely Of the 13 items identi ed in Phase 1 asbeing in need of review (7 were rarely observed indicating a highdegree of agreement that they were not in fact present and 6appeared to be confused with very mixed reported observations ) 7

Barry OrsquoSullivan Cyril J Weir and Nick Saville 43

were recommended for either omission or inclusion in other items bythe panel while the remaining 6 items were identi ed by them asbeing of value Although no examples of the latter had appeared inthe earlier data the panel agreed that they represented language func-tions that the UCLES Main Suite examinations were intended to elicitIt was also decided that each item in this latter group was in need offurther clari cation andor exempli cation Of the remaining 17items

middot two were changed the item lsquoanalysingrsquo was recoded as lsquostagingrsquoin order to clarify its intended meaning while it was decided toseparate the item lsquo(dis)agreeingrsquo into its two separate components

middot three were omitted it was argued that the item lsquoproviding non-personal informationrsquo referred to what was happening with theother items in the informational function category while the itemslsquoexplainingrsquo and lsquojustifyingsupportingrsquo were not functions usu-ally associated with the UCLES Main Suite tasks and no occur-rences of these had been noted

We would emphasize that as reported in Section IV above the initiallist was developed to cover the language functions that various spokenlanguage test tasks might elicit The development of the checklistsdescribed here re ects an attempt to customize the lists in line withthe intended functional outcomes of a speci c set of tests

We are of course aware that closed instruments of this type maybe open to the criticism that valuable information could be lost How-ever for reasons of practicality we felt it necessary to limit the listto what the examinations were intended to elicit rather than attemptto operationalize a full inventory Secondly any functions thatappeared in the data that were not covered by the reduced list wouldhave been noted There appeared to be no cases of this

The data from these two phases were combined to result in a work-ing version of the checklists (Appendix 3) which was then appliedto a pair of FCE Speaking Tests in Phase 3

Phase 3

In the third phase the revised checklists were given to a group of 15MA TEFL students who were asked to apply them to two FCE testsBoth of these tests involved a mixed-sex pair of learners one pair ofapproximately average ability and the other pair above averageBefore using the observation checklists (OCs) the students wereasked rst to attempt to predict which functions they might expect to nd To help in this pre-session task the students were given detailsof the FCE format and tasks

44 Validating speaking-test tasks

Unfortunately a small number of students did not manage to com-plete the observation task as they were somewhat overwhelmed withthe real-time application of the checklists As a result only 12 sets ofcompleted checklists were included in the nal analysis

Prior to the session the group was given an opportunity to have apractice run using a third FCE examination While this lsquotrainingrsquo per-iod coupled with the pre-session task was intended to provide thestudents with the background they needed to apply the checklists con-sistently there was a problem during the session itself This problemwas caused by the failure of a number of students to note the changefrom Task 3 to Task 4 in the rst test observed This was possiblycaused by a lack of awareness of the test itself and was not helpedby the seamless way in which the examiner on the video moved froma two-way discussion involving the test-takers to a three-way dis-cussion This meant that a full set of data exists only for the rst twotasks of this test As the problem was noticed in time the second testdid not cause these problems Unlike the earlier seminar on thisoccasion the participants were asked only to record each functionwhen it was rst observed This was done as it was felt that the earlierseminar showed that without extensive training it would be far toodif cult to apply the OCs fully in lsquorealrsquo time in order to generatecomprehensive frequency counts We are aware that a full tally wouldenable us to draw more precise conclusions about the relative fre-quency of occurrence of these functions and the degree of consensus(reliability) of observers

Against this we must emphasize that the checklists in their currentstage of development are designed to be used in real time Their usewas therefore restricted to determining the presence or absence of aparticular function Rater agreement in this case is limited to a some-what crude account of whether a function occurred or did not occurin a particular task performance We do not therefore have evidenceof whether the function observed was invariant across raters

The results from this session are included as Appendix 4 It canbe seen from this table that the participants again display mixed levelsof agreement ranging from a single perceived observation to totalagreement As with the earlier session it appears that there is rela-tively broad agreement on a range of functions but that others appearto be more dif cult to identify easily These dif culties appear to begreatest where the task involves a degree of interaction between thetest-takers

Phase 4In this phase a transcription was made of the second of the two inter-views used in Phase 3 since there was a full set of data available for

Barry OrsquoSullivan Cyril J Weir and Nick Saville 45

this interview The OCs were then lsquomappedrsquo on to this transcript inorder to give an overview from a different perspective of what func-tions were generated (it being felt that this map would result in anaccurate description of the test in terms of the items included in theOCs) This mapping was carried out by two researchers who initiallyworked independently of each other but discussed their nished workin order to arrive at a consensus

Finally the results of Phases 2 and 3 were compared (Appendix5) This clearly indicates that the checklists are now working wellThere are still some problems in items such as lsquostagingrsquo and lsquodescrib-ingrsquo and feedback from participants suggests that this may be due tomisunderstandings or misinterpretations of the gloss and examplesused In addition there are some similar dif culties with the initialthree items in the interactional functions checklist in which the great-est dif culties in applying the checklists appear to lie

VII Discussion and initial conclusions

The results of this study appear to substantiate our belief thatalthough still under development for use with the UCLES Main Suiteexaminations an operational version of these checklists is certainlyfeasible and has potentially wider application mutatis mutandis tothe content validation of other spoken language tests Further re ne-ment of the checklists is clearly required although the developmentalprocess adopted here appears to have borne positive results

1 Validities

We would not wish to claim that the checklists on their own offer asatisfactory demonstration of the construct validity of a spoken langu-age test for as Messick argues (1989 16) lsquothe varieties of evidencesupporting validity are not alternatives but rather supplements to oneanotherrsquo We recognize the necessity for a broad view of lsquothe eviden-tial basis for test interpretationrsquo (Messick 1989 20) Bachman (1990237) similarly concludes lsquoit is important to recognise that none ofthese [evidences of validity] by itself is suf cient to demonstrate thevalidity of a particular interpretation or use of test scoresrsquo (see alsoBachman 1990 243) Fulcher (1999 224) adds a further caveatagainst an overly narrow interpretation of content validity when hequotes Messick (1989 41)

the major problem is that so-called content validity is focused upon test formsrather than test scores upon instruments rather than measurements selectingcontent is an act of classi cation which is in itself a hypothesis that needs tobe con rmed empirically

46 Validating speaking-test tasks

Like these authors we regard as inadequate any conceptualization ofvalidity that does not involve the provision of evidence on a numberof levels but would argue strongly that without a clear idea of thematch between intended content and actual content any comprehen-sive investigation of the construct validity of a test is built on sandDe ning the construct is in our view underpinned by establishingthe nature of the actual performances elicited by test tasks ie thetrue content of tasks

2 Present and future applications of observational checklists

Versions of the checklists require a degree of training and practicesimilar to that given to raters if a reliable and consistent outcome isto be expected This requires that standardized training materials bedeveloped alongside the checklists In the case of these checkliststhis process has already begun with the initial versions piloted duringPhase 3 of the project

The checklists have great potential as an evaluative tool and canprovide comprehensive insight into various issues It is hoped thatamongst other issues the checklists will provide insights into the fol-lowing

middot the language functions that the different task-types (and differentsub-tasks within these) employed in the UCLES Main Suite Paper5 (Speaking) Tests typically elicit

middot the language that the pair-format elicits and how it differs in nat-ure and quality from that elicited by interlocutor-single candi-date testing

middot the extent to which there is functional variation across the topfour levels of the UCLES Main Suite Spoken Language Test

In addition to these issues the way in which the checklists can beapplied may allow for other important questions to be answered Forexample by allowing the evaluator multiple observations (stoppingand starting a recording of a test at will) it will be possible to estab-lish whether there are quanti able differences in the language func-tions generated by the different tasks ie the evaluators will havethe time they need to make frequency counts of the functions

While the results to date have focused on a posteriori validationprocedures these checklists are also relevant to task design By takinginto account the expected response of a task (and by describing thatresponse in terms of these functions ) it will be possible to explorepredicted and actual test task outcome It will also be a useful guidefor item writers in taking a priori decisions about content coverageThrough this approach it should be possible to predict more accurately

Barry OrsquoSullivan Cyril J Weir and Nick Saville 47

linguistic response (in terms of the elements of the checklists) andto apply this to the design of test tasks ndash and of course to evaluatethe success of the prediction later on In the longer term this willlead to a greater understanding of how tasks and task formats can bemanipulated to result in speci c language use We are not claimingthat it is possible to predict language use at a micro level(grammatical form or lexis) but that it is possible to predict infor-mational and interactional functions and features of interaction man-agement ndash a notion supported by Bygate (1999)

The checklists should also enable us to explore how systematicvariation in such areas as interviewer questioning behaviour (andinterlocutor frame adherence ) affects the language produced in thistype of test In the interview transcribed for this study for examplethe examiner directed his questions very deliberately (systematicallyaiming the questions at one participant and then the other) Thistended to sti e any spontaneity in the intended three-way discussion(Task 4) so occurrences of Interactional and Discourse ManagementFunctions did not materialize to the extent intended by the taskdesigners It is possible that a less deliberate (unscripted) questioningtechnique would lead to a less interviewer-oriented interaction patternand allow for the more genuine interactive communication envisagedin the task design

Perhaps the most valuable contribution that this type of validationprocedure offers is its potential to improve the quality of oral assess-ment in both low-stakes and high-stakes contexts By offering theinvestigator an instrument that can be used in real time the checklistsbroaden the scope of investigation from limited case study analysisof small numbers of test transcripts to large scale eld studies acrossa wide range of testing contexts

Acknowledgements

We would like to thank Don Porter and Rita Green for their earlyinput into the rst version of the checklist In addition help wasreceived from members of the ELT division in UCLES in particularfrom Angela ffrench Lynda Taylor and Christina Rimini from agroup of UCLES Senior Team Leaders and from MA TEFL studentsat the University of Reading Finally we would like to thank theeditors and anonymous reviewers of Language Testing for theirinsightful comments and helpful suggestions for its improvement Thefaults that remain are as ever ours

48 Validating speaking-test tasks

VIII References

Anastasi A 1988 Psychological testing 6th edition New York Macmil-lan

Bachman LF 1990 Fundamental considerations in language testingOxford Oxford University Press

Bachman LF and Palmer AS 1981 The construct validation of the FSIoral interview Language Learning 31 67ndash86

mdashmdash 1996 Language testing in practice Oxford Oxford University PressBallman TL 1991 The oral task of picture description similarities and

differences in native and nonnative speakers of Spanish In TeschnerRV editor Assessing foreign language pro ciency of undergrad-uates AAUSC Issues in Language Program Direction Boston Heinleand Heinle 221ndash31

Brown A 1998 Interviewer style and candidate performance in the IELSToral interview Paper presented at the Language Testing Research Col-loquium Monterey CA

Bygate M 1988 Speaking Oxford Oxford University Pressmdashmdash 1999 Quality of language and purpose of task patterns of learnersrsquo

language on two oral communication tasks Language TeachingResearch 3 185ndash214

Chalhoub-Deville M 1995a Deriving oral assessment scales across differ-ent tests and rater groups Language Testing 12 16ndash33

mdashmdash 1995b A contextualized approach to describing oral language pro- ciency Language Learning 45 251ndash81

Clark JLD 1979 Direct vs semi-direct tests of speaking ability In Bri-ere EJ and Hinofotis FB editors Concepts in language testingsome recent studies Washington DC TESOL

mdashmdash 1988 Validation of a tape-mediated ACTFLILR scale based test ofChinese speaking pro ciency Language Testing 5 187ndash205

Clark JLD and Hooshmand D 1992 lsquoScreen to Screenrsquo testing anexploratory study of oral pro ciency interviewing using video tele-conferencing System 20 293ndash304

Cronbach LJ 1971 Validity In Thorndike RL editor Educationalmeasurement 2nd edition Washington DC American Council on Edu-cation 443ndash597

mdashmdash 1990 Essentials of psychological testing 5th edition New YorkHarper amp Row

Davies A 1977 The construction of language tests In Allen JPB andDavies A editors Testing and experimental methods The EdinburghCourse in Applied Linguistics Volume 4 London Oxford UniversityPress 38ndash194

mdashmdash 1990 Principles of language testing Oxford BlackwellEllerton AW 1997 Considerations in the validation of semi-direct oral

testing Unpublished PhD thesis CALS University of Readingffrench A 1999 Language functions and UCLES speaking tests Seminar

in Athens Greece October 1999

Barry OrsquoSullivan Cyril J Weir and Nick Saville 49

Foster P and Skehan P 1996 The in uence of planning and task typeon second language performance Studies in Second Language Acqui-sition 18 299ndash323

mdashmdash 1999 The in uence of source of planning and focus of planning ontask-based performance Language Teaching Research 3 215ndash47

Fulcher G 1994 Some priority areas for oral language testing LanguageTesting Update 15 39ndash47

mdashmdash 1996 Testing tasks issues in task design and the group oral LanguageTesting 13 23ndash51

mdashmdash 1999 Assessment in English for academic purposes putting contentvalidity in its place Applied Linguistics 20 221ndash36

Hayashi M 1995 Conversational repair a contrastive study of Japaneseand English MA Project Report University of Canberra

Henning G 1983 Oral pro ciency testing comparative validities of inter-view imitation and completion methods Language Learning 33315ndash32

mdashmdash 1987 A guide to language testing Cambridge MA Newbury HouseKelly R 1978 On the construct validation of comprehension tests an exer-

cise in applied linguistics Unpublished PhD thesis University ofQueensland

Kenyon D 1995 An investigation of the validity of task demands onperformance-based tests of oral pro ciency In Kunnan AJ editorValidation in language assessment selected papers from the 17th Lan-guage Testing Research Colloquium Long Beach Mahwah NJ Lawr-ence Erlbaum 19ndash40

Kormos J 1999 Simulating conversations in oral-pro ciency assessmenta conversation analysis of role plays and non-scripted interviews inlanguage exams Language Testing 16 163ndash88

Lazaraton A 1992 The structural organisation of a language interview aconversational analytic perspective System 20 373ndash86

mdashmdash1996 A qualitative approach to monitoring examiner conduct in theCambridge assessment of spoken English (CASE) In Milanovic Mand Saville N editors Performance testing cognition andassessment selected papers from the 15th Language Testing ResearchColloquium Cambridge and Arnhem Studies in Language Testing 3Cambridge University of Cambridge Local Examinations Syndicate18ndash33

mdashmdash 2000 A qualitative approach to the validation of oral language testsStudies in Language Testing Volume 14 Cambridge Cambridge Uni-versity Press

Lumley T and OrsquoSullivan B 2000 The effect of speaker and topic vari-ables on task performance in a tape-mediated assessment of speakingPaper presented at the 2nd Annual Asian Language AssessmentResearch Forum The Hong Kong Polytechnic University

Luoma S 1997 Comparability of a tape-mediated and a face-to-face testof speaking a triangulation study Unpublished Licentiate ThesisCentre for Applied Language Studies Jyvaskyla University Finland

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 7: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

Barry OrsquoSullivan Cyril J Weir and Nick Saville 39

While there is clearly a great deal of potential for this detailed analy-sis of transcribed performances there are also a number of drawbacksthe most serious of which involves the complexity of the transcriptionprocess In practice this means that a great deal of time and expertiseis required in order to gain the kind of data that will answer the basicquestion concerning validity Even where this is done it is impracticalto attempt to deal with more than a small number of test eventstherefore the generalizability of the results may be questioned

Clearly then a more ef cient methodology is required that allowsthe test designer to evaluate the procedures and especially the tasksin terms of the language produced by a larger number of candidatesIdeally this should be possible in lsquorealrsquo time so that the relationshipof predicted outcome to speci c outcome can be established using adata set that satisfactorily re ects the typical test-taking populationThe primary objective of this project therefore was to create aninstrument built on a framework that describes the language of per-formance in a way that can be readily accessed by evaluators whoare familiar with the tests being observed This work is designed tobe complementary to the use of transcriptions and to provide anadditional source of validation evidence

The FCE was chosen as the focus of this study for a number ofreasons

middot It is lsquostablersquo in that it is neither under review nor is due to bereviewed

middot It represents the middle of the ALTE (and UCLES Main Suite)range and is the most widely subscribed test in the battery

middot It offers the most likelihood of a wide range of performance ofany Main Suite examination as it is often used as an lsquoentry-pointrsquointo the suite candidates tend to range from below to above thislevel in terms of ability

middot Like all of the other Main Suite examinations a database ofrecordings (audio and video) already existed

IV The development of the observation checklists

Weir (1993) building on the earlier work of Bygate (1988) suggeststhat the language of a speaking test can be described in terms ofthe informational and interactional functions and those of interactionmanagement generated by the participants involved With this as astarting point a group of researchers at the University of Readingwere commissioned by UCLES EFL to examine the spoken langu-age second language acquisition and language testing literatures tocome up with an initial set of such functions (see Schegloff et al

40 Validating speaking-test tasks

1977 Schwartz 1980 van Ek and Trim 1984 Bygate 1988Shohamy 1988 1994 Walker 1990 Weir 1994 Stenstrom 1994Chalhoub-Deville 1995b Hayashi 1995 Ellerton 1997 Suhua1998 Kormos 1999 OrsquoSullivan 2000 OrsquoLoughlin 2001)

These were then presented as a draft set of three checklists(Appendix 1) representing each of the elements of Weirrsquos categoriz-ation What follows in the three phases of the development processdescribed below (Section VI) was an attempt to customize thechecklist to more closely re ect the intended outcomes of spokenlanguage test tasks in the UCLES Main Suite The checklists weredesigned to help establish which of these functions resulted andwhich were absent

The next concern was with the development of a procedure fordevising a lsquoworkingrsquo version of the checklists to be followed by anevaluation of using this type of instrument in lsquorealrsquo time (using tapesor perhaps live speaking tests)

V The development model

The process through which the checklists were developed is shownin Figure 3 The concept that drives this model is the evaluation ateach level by different stakeholders At this stage of the project thesestakeholders were identi ed as

Figure 3 The development model

Barry OrsquoSullivan Cyril J Weir and Nick Saville 41

middot the consulting lsquoexpertrsquo testers (the University of Reading group)middot the test development and validation staff at UCLESmiddot UCLES Senior Team Leaders (ie key staff in the oral examiner

training system)

All these individuals participated in the application of each draft Itshould also be noted that a number of drafts were anticipated

VI The development process

In order to arrive at a working version of the checklists a number ofdevelopmental phases were anticipated At each phase the latest ver-sion (or draft) of the instruments was applied and this applicationevaluated

Phase 1

The rst attempt to examine how the draft checklists would beviewed and applied by a group of language teachers was conductedby ffrench (1999) Of the participants at the seminar approximately50 of the group reported that English (British English AmericanEnglish or Australian English) was their rst language while theremaining 50 were native Greek speakers

In their introduction to the application of the Observation Check-lists (OCs) the participants were given a series of activities thatfocused on the nature and use of those functions of language seen bytask designers at UCLES to be particularly applicable to their EFLMain Suite Speaking Tests (principally FCE CAE and CPE) Oncefamiliar with the nature of the functions (and where they might occurin a test) the participants applied the OCs in lsquorealrsquo time to an FCESpeaking Test from the 1998 Standardization Video This video fea-tured a pair of French speakers who were judged by a panel oflsquoexpertrsquo raters (within UCLES) to be slightly above the criterion(lsquopassrsquo) level

Of the 37 participants 32 completed the task successfully that isthey attempted to make frequency counts of the items represented inthe OCs Among this group there appear to be varying degrees ofagreement as to the use of language functions particularly in termsof the speci c number of observations of each function Howeverwhen the data are examined from the perspective of agreement onwhether a particular function was observed or not (ignoring the countwhich in retrospect was highly ambitious when we consider the lackof systematic training in the use of the questionnaires given to theteachers who attended) we nd that there is a striking degree of

42 Validating speaking-test tasks

agreement on all but a small number of functions (Appendix 2) Notehere that in order to make these patterns of behaviour clear the datahave been sorted both horizontally and vertically by the total numberof observations made by each participant and of each item

From this perspective this aspect of the developmental process wasconsidered to be quite successful However it was apparent that therewere a number of elements within the checklists that were causingsome dif culty These are highlighted in the table by the tram-linesItems above the lines have been identi ed by some participants inone case by a single person while those below have been observedby a majority of participants (in two cases by all of them) For thesecases we might infer a high degree of agreement However themiddle range of items appears to have caused a degree of confusionand so are highlighted here ie marked for further investigation

Phase 2

In this phase a much smaller gathering was organized this timeinvolving members of the development team as well as the three UK-based UCLES Senior Team Leaders In advance of this meeting allparticipants were asked to study the existing checklists and to exemp-lify each function with examples drawn from their experiences of thevarious UCLES Main Suite examinations The resulting data werecollated and presented as a single document that formed the basis ofdiscussion during a day-long session Participants were not madeaware of the ndings from Phase 1

During this session many questions were asked of all aspects ofthe checklist and a more streamlined version of the three sectionswas suggested In addition to a number of participants making a writ-ten record of the discussions the entire session was recorded Thisproved to be a valuable reminder of the way in which particularchanges came about and was used when the nal decisions regardinginclusion con ation or omission were being made Although it isbeyond the scope of this project to analyse this recording whencoupled with the earlier and revised documents it is in itself a valu-able source of data in that it provides a signi cant record of the devel-opmental process

Among the many interesting outcomes of this phase were thedecisions either to rethink to reorganize or to omit items from theinitial list These decisions were seen to mirror the results of the Phase1 application quite closely Of the 13 items identi ed in Phase 1 asbeing in need of review (7 were rarely observed indicating a highdegree of agreement that they were not in fact present and 6appeared to be confused with very mixed reported observations ) 7

Barry OrsquoSullivan Cyril J Weir and Nick Saville 43

were recommended for either omission or inclusion in other items bythe panel while the remaining 6 items were identi ed by them asbeing of value Although no examples of the latter had appeared inthe earlier data the panel agreed that they represented language func-tions that the UCLES Main Suite examinations were intended to elicitIt was also decided that each item in this latter group was in need offurther clari cation andor exempli cation Of the remaining 17items

middot two were changed the item lsquoanalysingrsquo was recoded as lsquostagingrsquoin order to clarify its intended meaning while it was decided toseparate the item lsquo(dis)agreeingrsquo into its two separate components

middot three were omitted it was argued that the item lsquoproviding non-personal informationrsquo referred to what was happening with theother items in the informational function category while the itemslsquoexplainingrsquo and lsquojustifyingsupportingrsquo were not functions usu-ally associated with the UCLES Main Suite tasks and no occur-rences of these had been noted

We would emphasize that as reported in Section IV above the initiallist was developed to cover the language functions that various spokenlanguage test tasks might elicit The development of the checklistsdescribed here re ects an attempt to customize the lists in line withthe intended functional outcomes of a speci c set of tests

We are of course aware that closed instruments of this type maybe open to the criticism that valuable information could be lost How-ever for reasons of practicality we felt it necessary to limit the listto what the examinations were intended to elicit rather than attemptto operationalize a full inventory Secondly any functions thatappeared in the data that were not covered by the reduced list wouldhave been noted There appeared to be no cases of this

The data from these two phases were combined to result in a work-ing version of the checklists (Appendix 3) which was then appliedto a pair of FCE Speaking Tests in Phase 3

Phase 3

In the third phase the revised checklists were given to a group of 15MA TEFL students who were asked to apply them to two FCE testsBoth of these tests involved a mixed-sex pair of learners one pair ofapproximately average ability and the other pair above averageBefore using the observation checklists (OCs) the students wereasked rst to attempt to predict which functions they might expect to nd To help in this pre-session task the students were given detailsof the FCE format and tasks

44 Validating speaking-test tasks

Unfortunately a small number of students did not manage to com-plete the observation task as they were somewhat overwhelmed withthe real-time application of the checklists As a result only 12 sets ofcompleted checklists were included in the nal analysis

Prior to the session the group was given an opportunity to have apractice run using a third FCE examination While this lsquotrainingrsquo per-iod coupled with the pre-session task was intended to provide thestudents with the background they needed to apply the checklists con-sistently there was a problem during the session itself This problemwas caused by the failure of a number of students to note the changefrom Task 3 to Task 4 in the rst test observed This was possiblycaused by a lack of awareness of the test itself and was not helpedby the seamless way in which the examiner on the video moved froma two-way discussion involving the test-takers to a three-way dis-cussion This meant that a full set of data exists only for the rst twotasks of this test As the problem was noticed in time the second testdid not cause these problems Unlike the earlier seminar on thisoccasion the participants were asked only to record each functionwhen it was rst observed This was done as it was felt that the earlierseminar showed that without extensive training it would be far toodif cult to apply the OCs fully in lsquorealrsquo time in order to generatecomprehensive frequency counts We are aware that a full tally wouldenable us to draw more precise conclusions about the relative fre-quency of occurrence of these functions and the degree of consensus(reliability) of observers

Against this we must emphasize that the checklists in their currentstage of development are designed to be used in real time Their usewas therefore restricted to determining the presence or absence of aparticular function Rater agreement in this case is limited to a some-what crude account of whether a function occurred or did not occurin a particular task performance We do not therefore have evidenceof whether the function observed was invariant across raters

The results from this session are included as Appendix 4 It canbe seen from this table that the participants again display mixed levelsof agreement ranging from a single perceived observation to totalagreement As with the earlier session it appears that there is rela-tively broad agreement on a range of functions but that others appearto be more dif cult to identify easily These dif culties appear to begreatest where the task involves a degree of interaction between thetest-takers

Phase 4In this phase a transcription was made of the second of the two inter-views used in Phase 3 since there was a full set of data available for

Barry OrsquoSullivan Cyril J Weir and Nick Saville 45

this interview The OCs were then lsquomappedrsquo on to this transcript inorder to give an overview from a different perspective of what func-tions were generated (it being felt that this map would result in anaccurate description of the test in terms of the items included in theOCs) This mapping was carried out by two researchers who initiallyworked independently of each other but discussed their nished workin order to arrive at a consensus

Finally the results of Phases 2 and 3 were compared (Appendix5) This clearly indicates that the checklists are now working wellThere are still some problems in items such as lsquostagingrsquo and lsquodescrib-ingrsquo and feedback from participants suggests that this may be due tomisunderstandings or misinterpretations of the gloss and examplesused In addition there are some similar dif culties with the initialthree items in the interactional functions checklist in which the great-est dif culties in applying the checklists appear to lie

VII Discussion and initial conclusions

The results of this study appear to substantiate our belief thatalthough still under development for use with the UCLES Main Suiteexaminations an operational version of these checklists is certainlyfeasible and has potentially wider application mutatis mutandis tothe content validation of other spoken language tests Further re ne-ment of the checklists is clearly required although the developmentalprocess adopted here appears to have borne positive results

1 Validities

We would not wish to claim that the checklists on their own offer asatisfactory demonstration of the construct validity of a spoken langu-age test for as Messick argues (1989 16) lsquothe varieties of evidencesupporting validity are not alternatives but rather supplements to oneanotherrsquo We recognize the necessity for a broad view of lsquothe eviden-tial basis for test interpretationrsquo (Messick 1989 20) Bachman (1990237) similarly concludes lsquoit is important to recognise that none ofthese [evidences of validity] by itself is suf cient to demonstrate thevalidity of a particular interpretation or use of test scoresrsquo (see alsoBachman 1990 243) Fulcher (1999 224) adds a further caveatagainst an overly narrow interpretation of content validity when hequotes Messick (1989 41)

the major problem is that so-called content validity is focused upon test formsrather than test scores upon instruments rather than measurements selectingcontent is an act of classi cation which is in itself a hypothesis that needs tobe con rmed empirically

46 Validating speaking-test tasks

Like these authors we regard as inadequate any conceptualization ofvalidity that does not involve the provision of evidence on a numberof levels but would argue strongly that without a clear idea of thematch between intended content and actual content any comprehen-sive investigation of the construct validity of a test is built on sandDe ning the construct is in our view underpinned by establishingthe nature of the actual performances elicited by test tasks ie thetrue content of tasks

2 Present and future applications of observational checklists

Versions of the checklists require a degree of training and practicesimilar to that given to raters if a reliable and consistent outcome isto be expected This requires that standardized training materials bedeveloped alongside the checklists In the case of these checkliststhis process has already begun with the initial versions piloted duringPhase 3 of the project

The checklists have great potential as an evaluative tool and canprovide comprehensive insight into various issues It is hoped thatamongst other issues the checklists will provide insights into the fol-lowing

middot the language functions that the different task-types (and differentsub-tasks within these) employed in the UCLES Main Suite Paper5 (Speaking) Tests typically elicit

middot the language that the pair-format elicits and how it differs in nat-ure and quality from that elicited by interlocutor-single candi-date testing

middot the extent to which there is functional variation across the topfour levels of the UCLES Main Suite Spoken Language Test

In addition to these issues the way in which the checklists can beapplied may allow for other important questions to be answered Forexample by allowing the evaluator multiple observations (stoppingand starting a recording of a test at will) it will be possible to estab-lish whether there are quanti able differences in the language func-tions generated by the different tasks ie the evaluators will havethe time they need to make frequency counts of the functions

While the results to date have focused on a posteriori validationprocedures these checklists are also relevant to task design By takinginto account the expected response of a task (and by describing thatresponse in terms of these functions ) it will be possible to explorepredicted and actual test task outcome It will also be a useful guidefor item writers in taking a priori decisions about content coverageThrough this approach it should be possible to predict more accurately

Barry OrsquoSullivan Cyril J Weir and Nick Saville 47

linguistic response (in terms of the elements of the checklists) andto apply this to the design of test tasks ndash and of course to evaluatethe success of the prediction later on In the longer term this willlead to a greater understanding of how tasks and task formats can bemanipulated to result in speci c language use We are not claimingthat it is possible to predict language use at a micro level(grammatical form or lexis) but that it is possible to predict infor-mational and interactional functions and features of interaction man-agement ndash a notion supported by Bygate (1999)

The checklists should also enable us to explore how systematicvariation in such areas as interviewer questioning behaviour (andinterlocutor frame adherence ) affects the language produced in thistype of test In the interview transcribed for this study for examplethe examiner directed his questions very deliberately (systematicallyaiming the questions at one participant and then the other) Thistended to sti e any spontaneity in the intended three-way discussion(Task 4) so occurrences of Interactional and Discourse ManagementFunctions did not materialize to the extent intended by the taskdesigners It is possible that a less deliberate (unscripted) questioningtechnique would lead to a less interviewer-oriented interaction patternand allow for the more genuine interactive communication envisagedin the task design

Perhaps the most valuable contribution that this type of validationprocedure offers is its potential to improve the quality of oral assess-ment in both low-stakes and high-stakes contexts By offering theinvestigator an instrument that can be used in real time the checklistsbroaden the scope of investigation from limited case study analysisof small numbers of test transcripts to large scale eld studies acrossa wide range of testing contexts

Acknowledgements

We would like to thank Don Porter and Rita Green for their earlyinput into the rst version of the checklist In addition help wasreceived from members of the ELT division in UCLES in particularfrom Angela ffrench Lynda Taylor and Christina Rimini from agroup of UCLES Senior Team Leaders and from MA TEFL studentsat the University of Reading Finally we would like to thank theeditors and anonymous reviewers of Language Testing for theirinsightful comments and helpful suggestions for its improvement Thefaults that remain are as ever ours

48 Validating speaking-test tasks

VIII References

Anastasi A 1988 Psychological testing 6th edition New York Macmil-lan

Bachman LF 1990 Fundamental considerations in language testingOxford Oxford University Press

Bachman LF and Palmer AS 1981 The construct validation of the FSIoral interview Language Learning 31 67ndash86

mdashmdash 1996 Language testing in practice Oxford Oxford University PressBallman TL 1991 The oral task of picture description similarities and

differences in native and nonnative speakers of Spanish In TeschnerRV editor Assessing foreign language pro ciency of undergrad-uates AAUSC Issues in Language Program Direction Boston Heinleand Heinle 221ndash31

Brown A 1998 Interviewer style and candidate performance in the IELSToral interview Paper presented at the Language Testing Research Col-loquium Monterey CA

Bygate M 1988 Speaking Oxford Oxford University Pressmdashmdash 1999 Quality of language and purpose of task patterns of learnersrsquo

language on two oral communication tasks Language TeachingResearch 3 185ndash214

Chalhoub-Deville M 1995a Deriving oral assessment scales across differ-ent tests and rater groups Language Testing 12 16ndash33

mdashmdash 1995b A contextualized approach to describing oral language pro- ciency Language Learning 45 251ndash81

Clark JLD 1979 Direct vs semi-direct tests of speaking ability In Bri-ere EJ and Hinofotis FB editors Concepts in language testingsome recent studies Washington DC TESOL

mdashmdash 1988 Validation of a tape-mediated ACTFLILR scale based test ofChinese speaking pro ciency Language Testing 5 187ndash205

Clark JLD and Hooshmand D 1992 lsquoScreen to Screenrsquo testing anexploratory study of oral pro ciency interviewing using video tele-conferencing System 20 293ndash304

Cronbach LJ 1971 Validity In Thorndike RL editor Educationalmeasurement 2nd edition Washington DC American Council on Edu-cation 443ndash597

mdashmdash 1990 Essentials of psychological testing 5th edition New YorkHarper amp Row

Davies A 1977 The construction of language tests In Allen JPB andDavies A editors Testing and experimental methods The EdinburghCourse in Applied Linguistics Volume 4 London Oxford UniversityPress 38ndash194

mdashmdash 1990 Principles of language testing Oxford BlackwellEllerton AW 1997 Considerations in the validation of semi-direct oral

testing Unpublished PhD thesis CALS University of Readingffrench A 1999 Language functions and UCLES speaking tests Seminar

in Athens Greece October 1999

Barry OrsquoSullivan Cyril J Weir and Nick Saville 49

Foster P and Skehan P 1996 The in uence of planning and task typeon second language performance Studies in Second Language Acqui-sition 18 299ndash323

mdashmdash 1999 The in uence of source of planning and focus of planning ontask-based performance Language Teaching Research 3 215ndash47

Fulcher G 1994 Some priority areas for oral language testing LanguageTesting Update 15 39ndash47

mdashmdash 1996 Testing tasks issues in task design and the group oral LanguageTesting 13 23ndash51

mdashmdash 1999 Assessment in English for academic purposes putting contentvalidity in its place Applied Linguistics 20 221ndash36

Hayashi M 1995 Conversational repair a contrastive study of Japaneseand English MA Project Report University of Canberra

Henning G 1983 Oral pro ciency testing comparative validities of inter-view imitation and completion methods Language Learning 33315ndash32

mdashmdash 1987 A guide to language testing Cambridge MA Newbury HouseKelly R 1978 On the construct validation of comprehension tests an exer-

cise in applied linguistics Unpublished PhD thesis University ofQueensland

Kenyon D 1995 An investigation of the validity of task demands onperformance-based tests of oral pro ciency In Kunnan AJ editorValidation in language assessment selected papers from the 17th Lan-guage Testing Research Colloquium Long Beach Mahwah NJ Lawr-ence Erlbaum 19ndash40

Kormos J 1999 Simulating conversations in oral-pro ciency assessmenta conversation analysis of role plays and non-scripted interviews inlanguage exams Language Testing 16 163ndash88

Lazaraton A 1992 The structural organisation of a language interview aconversational analytic perspective System 20 373ndash86

mdashmdash1996 A qualitative approach to monitoring examiner conduct in theCambridge assessment of spoken English (CASE) In Milanovic Mand Saville N editors Performance testing cognition andassessment selected papers from the 15th Language Testing ResearchColloquium Cambridge and Arnhem Studies in Language Testing 3Cambridge University of Cambridge Local Examinations Syndicate18ndash33

mdashmdash 2000 A qualitative approach to the validation of oral language testsStudies in Language Testing Volume 14 Cambridge Cambridge Uni-versity Press

Lumley T and OrsquoSullivan B 2000 The effect of speaker and topic vari-ables on task performance in a tape-mediated assessment of speakingPaper presented at the 2nd Annual Asian Language AssessmentResearch Forum The Hong Kong Polytechnic University

Luoma S 1997 Comparability of a tape-mediated and a face-to-face testof speaking a triangulation study Unpublished Licentiate ThesisCentre for Applied Language Studies Jyvaskyla University Finland

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 8: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

40 Validating speaking-test tasks

1977 Schwartz 1980 van Ek and Trim 1984 Bygate 1988Shohamy 1988 1994 Walker 1990 Weir 1994 Stenstrom 1994Chalhoub-Deville 1995b Hayashi 1995 Ellerton 1997 Suhua1998 Kormos 1999 OrsquoSullivan 2000 OrsquoLoughlin 2001)

These were then presented as a draft set of three checklists(Appendix 1) representing each of the elements of Weirrsquos categoriz-ation What follows in the three phases of the development processdescribed below (Section VI) was an attempt to customize thechecklist to more closely re ect the intended outcomes of spokenlanguage test tasks in the UCLES Main Suite The checklists weredesigned to help establish which of these functions resulted andwhich were absent

The next concern was with the development of a procedure fordevising a lsquoworkingrsquo version of the checklists to be followed by anevaluation of using this type of instrument in lsquorealrsquo time (using tapesor perhaps live speaking tests)

V The development model

The process through which the checklists were developed is shownin Figure 3 The concept that drives this model is the evaluation ateach level by different stakeholders At this stage of the project thesestakeholders were identi ed as

Figure 3 The development model

Barry OrsquoSullivan Cyril J Weir and Nick Saville 41

middot the consulting lsquoexpertrsquo testers (the University of Reading group)middot the test development and validation staff at UCLESmiddot UCLES Senior Team Leaders (ie key staff in the oral examiner

training system)

All these individuals participated in the application of each draft Itshould also be noted that a number of drafts were anticipated

VI The development process

In order to arrive at a working version of the checklists a number ofdevelopmental phases were anticipated At each phase the latest ver-sion (or draft) of the instruments was applied and this applicationevaluated

Phase 1

The rst attempt to examine how the draft checklists would beviewed and applied by a group of language teachers was conductedby ffrench (1999) Of the participants at the seminar approximately50 of the group reported that English (British English AmericanEnglish or Australian English) was their rst language while theremaining 50 were native Greek speakers

In their introduction to the application of the Observation Check-lists (OCs) the participants were given a series of activities thatfocused on the nature and use of those functions of language seen bytask designers at UCLES to be particularly applicable to their EFLMain Suite Speaking Tests (principally FCE CAE and CPE) Oncefamiliar with the nature of the functions (and where they might occurin a test) the participants applied the OCs in lsquorealrsquo time to an FCESpeaking Test from the 1998 Standardization Video This video fea-tured a pair of French speakers who were judged by a panel oflsquoexpertrsquo raters (within UCLES) to be slightly above the criterion(lsquopassrsquo) level

Of the 37 participants 32 completed the task successfully that isthey attempted to make frequency counts of the items represented inthe OCs Among this group there appear to be varying degrees ofagreement as to the use of language functions particularly in termsof the speci c number of observations of each function Howeverwhen the data are examined from the perspective of agreement onwhether a particular function was observed or not (ignoring the countwhich in retrospect was highly ambitious when we consider the lackof systematic training in the use of the questionnaires given to theteachers who attended) we nd that there is a striking degree of

42 Validating speaking-test tasks

agreement on all but a small number of functions (Appendix 2) Notehere that in order to make these patterns of behaviour clear the datahave been sorted both horizontally and vertically by the total numberof observations made by each participant and of each item

From this perspective this aspect of the developmental process wasconsidered to be quite successful However it was apparent that therewere a number of elements within the checklists that were causingsome dif culty These are highlighted in the table by the tram-linesItems above the lines have been identi ed by some participants inone case by a single person while those below have been observedby a majority of participants (in two cases by all of them) For thesecases we might infer a high degree of agreement However themiddle range of items appears to have caused a degree of confusionand so are highlighted here ie marked for further investigation

Phase 2

In this phase a much smaller gathering was organized this timeinvolving members of the development team as well as the three UK-based UCLES Senior Team Leaders In advance of this meeting allparticipants were asked to study the existing checklists and to exemp-lify each function with examples drawn from their experiences of thevarious UCLES Main Suite examinations The resulting data werecollated and presented as a single document that formed the basis ofdiscussion during a day-long session Participants were not madeaware of the ndings from Phase 1

During this session many questions were asked of all aspects ofthe checklist and a more streamlined version of the three sectionswas suggested In addition to a number of participants making a writ-ten record of the discussions the entire session was recorded Thisproved to be a valuable reminder of the way in which particularchanges came about and was used when the nal decisions regardinginclusion con ation or omission were being made Although it isbeyond the scope of this project to analyse this recording whencoupled with the earlier and revised documents it is in itself a valu-able source of data in that it provides a signi cant record of the devel-opmental process

Among the many interesting outcomes of this phase were thedecisions either to rethink to reorganize or to omit items from theinitial list These decisions were seen to mirror the results of the Phase1 application quite closely Of the 13 items identi ed in Phase 1 asbeing in need of review (7 were rarely observed indicating a highdegree of agreement that they were not in fact present and 6appeared to be confused with very mixed reported observations ) 7

Barry OrsquoSullivan Cyril J Weir and Nick Saville 43

were recommended for either omission or inclusion in other items bythe panel while the remaining 6 items were identi ed by them asbeing of value Although no examples of the latter had appeared inthe earlier data the panel agreed that they represented language func-tions that the UCLES Main Suite examinations were intended to elicitIt was also decided that each item in this latter group was in need offurther clari cation andor exempli cation Of the remaining 17items

middot two were changed the item lsquoanalysingrsquo was recoded as lsquostagingrsquoin order to clarify its intended meaning while it was decided toseparate the item lsquo(dis)agreeingrsquo into its two separate components

middot three were omitted it was argued that the item lsquoproviding non-personal informationrsquo referred to what was happening with theother items in the informational function category while the itemslsquoexplainingrsquo and lsquojustifyingsupportingrsquo were not functions usu-ally associated with the UCLES Main Suite tasks and no occur-rences of these had been noted

We would emphasize that as reported in Section IV above the initiallist was developed to cover the language functions that various spokenlanguage test tasks might elicit The development of the checklistsdescribed here re ects an attempt to customize the lists in line withthe intended functional outcomes of a speci c set of tests

We are of course aware that closed instruments of this type maybe open to the criticism that valuable information could be lost How-ever for reasons of practicality we felt it necessary to limit the listto what the examinations were intended to elicit rather than attemptto operationalize a full inventory Secondly any functions thatappeared in the data that were not covered by the reduced list wouldhave been noted There appeared to be no cases of this

The data from these two phases were combined to result in a work-ing version of the checklists (Appendix 3) which was then appliedto a pair of FCE Speaking Tests in Phase 3

Phase 3

In the third phase the revised checklists were given to a group of 15MA TEFL students who were asked to apply them to two FCE testsBoth of these tests involved a mixed-sex pair of learners one pair ofapproximately average ability and the other pair above averageBefore using the observation checklists (OCs) the students wereasked rst to attempt to predict which functions they might expect to nd To help in this pre-session task the students were given detailsof the FCE format and tasks

44 Validating speaking-test tasks

Unfortunately a small number of students did not manage to com-plete the observation task as they were somewhat overwhelmed withthe real-time application of the checklists As a result only 12 sets ofcompleted checklists were included in the nal analysis

Prior to the session the group was given an opportunity to have apractice run using a third FCE examination While this lsquotrainingrsquo per-iod coupled with the pre-session task was intended to provide thestudents with the background they needed to apply the checklists con-sistently there was a problem during the session itself This problemwas caused by the failure of a number of students to note the changefrom Task 3 to Task 4 in the rst test observed This was possiblycaused by a lack of awareness of the test itself and was not helpedby the seamless way in which the examiner on the video moved froma two-way discussion involving the test-takers to a three-way dis-cussion This meant that a full set of data exists only for the rst twotasks of this test As the problem was noticed in time the second testdid not cause these problems Unlike the earlier seminar on thisoccasion the participants were asked only to record each functionwhen it was rst observed This was done as it was felt that the earlierseminar showed that without extensive training it would be far toodif cult to apply the OCs fully in lsquorealrsquo time in order to generatecomprehensive frequency counts We are aware that a full tally wouldenable us to draw more precise conclusions about the relative fre-quency of occurrence of these functions and the degree of consensus(reliability) of observers

Against this we must emphasize that the checklists in their currentstage of development are designed to be used in real time Their usewas therefore restricted to determining the presence or absence of aparticular function Rater agreement in this case is limited to a some-what crude account of whether a function occurred or did not occurin a particular task performance We do not therefore have evidenceof whether the function observed was invariant across raters

The results from this session are included as Appendix 4 It canbe seen from this table that the participants again display mixed levelsof agreement ranging from a single perceived observation to totalagreement As with the earlier session it appears that there is rela-tively broad agreement on a range of functions but that others appearto be more dif cult to identify easily These dif culties appear to begreatest where the task involves a degree of interaction between thetest-takers

Phase 4In this phase a transcription was made of the second of the two inter-views used in Phase 3 since there was a full set of data available for

Barry OrsquoSullivan Cyril J Weir and Nick Saville 45

this interview The OCs were then lsquomappedrsquo on to this transcript inorder to give an overview from a different perspective of what func-tions were generated (it being felt that this map would result in anaccurate description of the test in terms of the items included in theOCs) This mapping was carried out by two researchers who initiallyworked independently of each other but discussed their nished workin order to arrive at a consensus

Finally the results of Phases 2 and 3 were compared (Appendix5) This clearly indicates that the checklists are now working wellThere are still some problems in items such as lsquostagingrsquo and lsquodescrib-ingrsquo and feedback from participants suggests that this may be due tomisunderstandings or misinterpretations of the gloss and examplesused In addition there are some similar dif culties with the initialthree items in the interactional functions checklist in which the great-est dif culties in applying the checklists appear to lie

VII Discussion and initial conclusions

The results of this study appear to substantiate our belief thatalthough still under development for use with the UCLES Main Suiteexaminations an operational version of these checklists is certainlyfeasible and has potentially wider application mutatis mutandis tothe content validation of other spoken language tests Further re ne-ment of the checklists is clearly required although the developmentalprocess adopted here appears to have borne positive results

1 Validities

We would not wish to claim that the checklists on their own offer asatisfactory demonstration of the construct validity of a spoken langu-age test for as Messick argues (1989 16) lsquothe varieties of evidencesupporting validity are not alternatives but rather supplements to oneanotherrsquo We recognize the necessity for a broad view of lsquothe eviden-tial basis for test interpretationrsquo (Messick 1989 20) Bachman (1990237) similarly concludes lsquoit is important to recognise that none ofthese [evidences of validity] by itself is suf cient to demonstrate thevalidity of a particular interpretation or use of test scoresrsquo (see alsoBachman 1990 243) Fulcher (1999 224) adds a further caveatagainst an overly narrow interpretation of content validity when hequotes Messick (1989 41)

the major problem is that so-called content validity is focused upon test formsrather than test scores upon instruments rather than measurements selectingcontent is an act of classi cation which is in itself a hypothesis that needs tobe con rmed empirically

46 Validating speaking-test tasks

Like these authors we regard as inadequate any conceptualization ofvalidity that does not involve the provision of evidence on a numberof levels but would argue strongly that without a clear idea of thematch between intended content and actual content any comprehen-sive investigation of the construct validity of a test is built on sandDe ning the construct is in our view underpinned by establishingthe nature of the actual performances elicited by test tasks ie thetrue content of tasks

2 Present and future applications of observational checklists

Versions of the checklists require a degree of training and practicesimilar to that given to raters if a reliable and consistent outcome isto be expected This requires that standardized training materials bedeveloped alongside the checklists In the case of these checkliststhis process has already begun with the initial versions piloted duringPhase 3 of the project

The checklists have great potential as an evaluative tool and canprovide comprehensive insight into various issues It is hoped thatamongst other issues the checklists will provide insights into the fol-lowing

middot the language functions that the different task-types (and differentsub-tasks within these) employed in the UCLES Main Suite Paper5 (Speaking) Tests typically elicit

middot the language that the pair-format elicits and how it differs in nat-ure and quality from that elicited by interlocutor-single candi-date testing

middot the extent to which there is functional variation across the topfour levels of the UCLES Main Suite Spoken Language Test

In addition to these issues the way in which the checklists can beapplied may allow for other important questions to be answered Forexample by allowing the evaluator multiple observations (stoppingand starting a recording of a test at will) it will be possible to estab-lish whether there are quanti able differences in the language func-tions generated by the different tasks ie the evaluators will havethe time they need to make frequency counts of the functions

While the results to date have focused on a posteriori validationprocedures these checklists are also relevant to task design By takinginto account the expected response of a task (and by describing thatresponse in terms of these functions ) it will be possible to explorepredicted and actual test task outcome It will also be a useful guidefor item writers in taking a priori decisions about content coverageThrough this approach it should be possible to predict more accurately

Barry OrsquoSullivan Cyril J Weir and Nick Saville 47

linguistic response (in terms of the elements of the checklists) andto apply this to the design of test tasks ndash and of course to evaluatethe success of the prediction later on In the longer term this willlead to a greater understanding of how tasks and task formats can bemanipulated to result in speci c language use We are not claimingthat it is possible to predict language use at a micro level(grammatical form or lexis) but that it is possible to predict infor-mational and interactional functions and features of interaction man-agement ndash a notion supported by Bygate (1999)

The checklists should also enable us to explore how systematicvariation in such areas as interviewer questioning behaviour (andinterlocutor frame adherence ) affects the language produced in thistype of test In the interview transcribed for this study for examplethe examiner directed his questions very deliberately (systematicallyaiming the questions at one participant and then the other) Thistended to sti e any spontaneity in the intended three-way discussion(Task 4) so occurrences of Interactional and Discourse ManagementFunctions did not materialize to the extent intended by the taskdesigners It is possible that a less deliberate (unscripted) questioningtechnique would lead to a less interviewer-oriented interaction patternand allow for the more genuine interactive communication envisagedin the task design

Perhaps the most valuable contribution that this type of validationprocedure offers is its potential to improve the quality of oral assess-ment in both low-stakes and high-stakes contexts By offering theinvestigator an instrument that can be used in real time the checklistsbroaden the scope of investigation from limited case study analysisof small numbers of test transcripts to large scale eld studies acrossa wide range of testing contexts

Acknowledgements

We would like to thank Don Porter and Rita Green for their earlyinput into the rst version of the checklist In addition help wasreceived from members of the ELT division in UCLES in particularfrom Angela ffrench Lynda Taylor and Christina Rimini from agroup of UCLES Senior Team Leaders and from MA TEFL studentsat the University of Reading Finally we would like to thank theeditors and anonymous reviewers of Language Testing for theirinsightful comments and helpful suggestions for its improvement Thefaults that remain are as ever ours

48 Validating speaking-test tasks

VIII References

Anastasi A 1988 Psychological testing 6th edition New York Macmil-lan

Bachman LF 1990 Fundamental considerations in language testingOxford Oxford University Press

Bachman LF and Palmer AS 1981 The construct validation of the FSIoral interview Language Learning 31 67ndash86

mdashmdash 1996 Language testing in practice Oxford Oxford University PressBallman TL 1991 The oral task of picture description similarities and

differences in native and nonnative speakers of Spanish In TeschnerRV editor Assessing foreign language pro ciency of undergrad-uates AAUSC Issues in Language Program Direction Boston Heinleand Heinle 221ndash31

Brown A 1998 Interviewer style and candidate performance in the IELSToral interview Paper presented at the Language Testing Research Col-loquium Monterey CA

Bygate M 1988 Speaking Oxford Oxford University Pressmdashmdash 1999 Quality of language and purpose of task patterns of learnersrsquo

language on two oral communication tasks Language TeachingResearch 3 185ndash214

Chalhoub-Deville M 1995a Deriving oral assessment scales across differ-ent tests and rater groups Language Testing 12 16ndash33

mdashmdash 1995b A contextualized approach to describing oral language pro- ciency Language Learning 45 251ndash81

Clark JLD 1979 Direct vs semi-direct tests of speaking ability In Bri-ere EJ and Hinofotis FB editors Concepts in language testingsome recent studies Washington DC TESOL

mdashmdash 1988 Validation of a tape-mediated ACTFLILR scale based test ofChinese speaking pro ciency Language Testing 5 187ndash205

Clark JLD and Hooshmand D 1992 lsquoScreen to Screenrsquo testing anexploratory study of oral pro ciency interviewing using video tele-conferencing System 20 293ndash304

Cronbach LJ 1971 Validity In Thorndike RL editor Educationalmeasurement 2nd edition Washington DC American Council on Edu-cation 443ndash597

mdashmdash 1990 Essentials of psychological testing 5th edition New YorkHarper amp Row

Davies A 1977 The construction of language tests In Allen JPB andDavies A editors Testing and experimental methods The EdinburghCourse in Applied Linguistics Volume 4 London Oxford UniversityPress 38ndash194

mdashmdash 1990 Principles of language testing Oxford BlackwellEllerton AW 1997 Considerations in the validation of semi-direct oral

testing Unpublished PhD thesis CALS University of Readingffrench A 1999 Language functions and UCLES speaking tests Seminar

in Athens Greece October 1999

Barry OrsquoSullivan Cyril J Weir and Nick Saville 49

Foster P and Skehan P 1996 The in uence of planning and task typeon second language performance Studies in Second Language Acqui-sition 18 299ndash323

mdashmdash 1999 The in uence of source of planning and focus of planning ontask-based performance Language Teaching Research 3 215ndash47

Fulcher G 1994 Some priority areas for oral language testing LanguageTesting Update 15 39ndash47

mdashmdash 1996 Testing tasks issues in task design and the group oral LanguageTesting 13 23ndash51

mdashmdash 1999 Assessment in English for academic purposes putting contentvalidity in its place Applied Linguistics 20 221ndash36

Hayashi M 1995 Conversational repair a contrastive study of Japaneseand English MA Project Report University of Canberra

Henning G 1983 Oral pro ciency testing comparative validities of inter-view imitation and completion methods Language Learning 33315ndash32

mdashmdash 1987 A guide to language testing Cambridge MA Newbury HouseKelly R 1978 On the construct validation of comprehension tests an exer-

cise in applied linguistics Unpublished PhD thesis University ofQueensland

Kenyon D 1995 An investigation of the validity of task demands onperformance-based tests of oral pro ciency In Kunnan AJ editorValidation in language assessment selected papers from the 17th Lan-guage Testing Research Colloquium Long Beach Mahwah NJ Lawr-ence Erlbaum 19ndash40

Kormos J 1999 Simulating conversations in oral-pro ciency assessmenta conversation analysis of role plays and non-scripted interviews inlanguage exams Language Testing 16 163ndash88

Lazaraton A 1992 The structural organisation of a language interview aconversational analytic perspective System 20 373ndash86

mdashmdash1996 A qualitative approach to monitoring examiner conduct in theCambridge assessment of spoken English (CASE) In Milanovic Mand Saville N editors Performance testing cognition andassessment selected papers from the 15th Language Testing ResearchColloquium Cambridge and Arnhem Studies in Language Testing 3Cambridge University of Cambridge Local Examinations Syndicate18ndash33

mdashmdash 2000 A qualitative approach to the validation of oral language testsStudies in Language Testing Volume 14 Cambridge Cambridge Uni-versity Press

Lumley T and OrsquoSullivan B 2000 The effect of speaker and topic vari-ables on task performance in a tape-mediated assessment of speakingPaper presented at the 2nd Annual Asian Language AssessmentResearch Forum The Hong Kong Polytechnic University

Luoma S 1997 Comparability of a tape-mediated and a face-to-face testof speaking a triangulation study Unpublished Licentiate ThesisCentre for Applied Language Studies Jyvaskyla University Finland

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 9: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

Barry OrsquoSullivan Cyril J Weir and Nick Saville 41

middot the consulting lsquoexpertrsquo testers (the University of Reading group)middot the test development and validation staff at UCLESmiddot UCLES Senior Team Leaders (ie key staff in the oral examiner

training system)

All these individuals participated in the application of each draft Itshould also be noted that a number of drafts were anticipated

VI The development process

In order to arrive at a working version of the checklists a number ofdevelopmental phases were anticipated At each phase the latest ver-sion (or draft) of the instruments was applied and this applicationevaluated

Phase 1

The rst attempt to examine how the draft checklists would beviewed and applied by a group of language teachers was conductedby ffrench (1999) Of the participants at the seminar approximately50 of the group reported that English (British English AmericanEnglish or Australian English) was their rst language while theremaining 50 were native Greek speakers

In their introduction to the application of the Observation Check-lists (OCs) the participants were given a series of activities thatfocused on the nature and use of those functions of language seen bytask designers at UCLES to be particularly applicable to their EFLMain Suite Speaking Tests (principally FCE CAE and CPE) Oncefamiliar with the nature of the functions (and where they might occurin a test) the participants applied the OCs in lsquorealrsquo time to an FCESpeaking Test from the 1998 Standardization Video This video fea-tured a pair of French speakers who were judged by a panel oflsquoexpertrsquo raters (within UCLES) to be slightly above the criterion(lsquopassrsquo) level

Of the 37 participants 32 completed the task successfully that isthey attempted to make frequency counts of the items represented inthe OCs Among this group there appear to be varying degrees ofagreement as to the use of language functions particularly in termsof the speci c number of observations of each function Howeverwhen the data are examined from the perspective of agreement onwhether a particular function was observed or not (ignoring the countwhich in retrospect was highly ambitious when we consider the lackof systematic training in the use of the questionnaires given to theteachers who attended) we nd that there is a striking degree of

42 Validating speaking-test tasks

agreement on all but a small number of functions (Appendix 2) Notehere that in order to make these patterns of behaviour clear the datahave been sorted both horizontally and vertically by the total numberof observations made by each participant and of each item

From this perspective this aspect of the developmental process wasconsidered to be quite successful However it was apparent that therewere a number of elements within the checklists that were causingsome dif culty These are highlighted in the table by the tram-linesItems above the lines have been identi ed by some participants inone case by a single person while those below have been observedby a majority of participants (in two cases by all of them) For thesecases we might infer a high degree of agreement However themiddle range of items appears to have caused a degree of confusionand so are highlighted here ie marked for further investigation

Phase 2

In this phase a much smaller gathering was organized this timeinvolving members of the development team as well as the three UK-based UCLES Senior Team Leaders In advance of this meeting allparticipants were asked to study the existing checklists and to exemp-lify each function with examples drawn from their experiences of thevarious UCLES Main Suite examinations The resulting data werecollated and presented as a single document that formed the basis ofdiscussion during a day-long session Participants were not madeaware of the ndings from Phase 1

During this session many questions were asked of all aspects ofthe checklist and a more streamlined version of the three sectionswas suggested In addition to a number of participants making a writ-ten record of the discussions the entire session was recorded Thisproved to be a valuable reminder of the way in which particularchanges came about and was used when the nal decisions regardinginclusion con ation or omission were being made Although it isbeyond the scope of this project to analyse this recording whencoupled with the earlier and revised documents it is in itself a valu-able source of data in that it provides a signi cant record of the devel-opmental process

Among the many interesting outcomes of this phase were thedecisions either to rethink to reorganize or to omit items from theinitial list These decisions were seen to mirror the results of the Phase1 application quite closely Of the 13 items identi ed in Phase 1 asbeing in need of review (7 were rarely observed indicating a highdegree of agreement that they were not in fact present and 6appeared to be confused with very mixed reported observations ) 7

Barry OrsquoSullivan Cyril J Weir and Nick Saville 43

were recommended for either omission or inclusion in other items bythe panel while the remaining 6 items were identi ed by them asbeing of value Although no examples of the latter had appeared inthe earlier data the panel agreed that they represented language func-tions that the UCLES Main Suite examinations were intended to elicitIt was also decided that each item in this latter group was in need offurther clari cation andor exempli cation Of the remaining 17items

middot two were changed the item lsquoanalysingrsquo was recoded as lsquostagingrsquoin order to clarify its intended meaning while it was decided toseparate the item lsquo(dis)agreeingrsquo into its two separate components

middot three were omitted it was argued that the item lsquoproviding non-personal informationrsquo referred to what was happening with theother items in the informational function category while the itemslsquoexplainingrsquo and lsquojustifyingsupportingrsquo were not functions usu-ally associated with the UCLES Main Suite tasks and no occur-rences of these had been noted

We would emphasize that as reported in Section IV above the initiallist was developed to cover the language functions that various spokenlanguage test tasks might elicit The development of the checklistsdescribed here re ects an attempt to customize the lists in line withthe intended functional outcomes of a speci c set of tests

We are of course aware that closed instruments of this type maybe open to the criticism that valuable information could be lost How-ever for reasons of practicality we felt it necessary to limit the listto what the examinations were intended to elicit rather than attemptto operationalize a full inventory Secondly any functions thatappeared in the data that were not covered by the reduced list wouldhave been noted There appeared to be no cases of this

The data from these two phases were combined to result in a work-ing version of the checklists (Appendix 3) which was then appliedto a pair of FCE Speaking Tests in Phase 3

Phase 3

In the third phase the revised checklists were given to a group of 15MA TEFL students who were asked to apply them to two FCE testsBoth of these tests involved a mixed-sex pair of learners one pair ofapproximately average ability and the other pair above averageBefore using the observation checklists (OCs) the students wereasked rst to attempt to predict which functions they might expect to nd To help in this pre-session task the students were given detailsof the FCE format and tasks

44 Validating speaking-test tasks

Unfortunately a small number of students did not manage to com-plete the observation task as they were somewhat overwhelmed withthe real-time application of the checklists As a result only 12 sets ofcompleted checklists were included in the nal analysis

Prior to the session the group was given an opportunity to have apractice run using a third FCE examination While this lsquotrainingrsquo per-iod coupled with the pre-session task was intended to provide thestudents with the background they needed to apply the checklists con-sistently there was a problem during the session itself This problemwas caused by the failure of a number of students to note the changefrom Task 3 to Task 4 in the rst test observed This was possiblycaused by a lack of awareness of the test itself and was not helpedby the seamless way in which the examiner on the video moved froma two-way discussion involving the test-takers to a three-way dis-cussion This meant that a full set of data exists only for the rst twotasks of this test As the problem was noticed in time the second testdid not cause these problems Unlike the earlier seminar on thisoccasion the participants were asked only to record each functionwhen it was rst observed This was done as it was felt that the earlierseminar showed that without extensive training it would be far toodif cult to apply the OCs fully in lsquorealrsquo time in order to generatecomprehensive frequency counts We are aware that a full tally wouldenable us to draw more precise conclusions about the relative fre-quency of occurrence of these functions and the degree of consensus(reliability) of observers

Against this we must emphasize that the checklists in their currentstage of development are designed to be used in real time Their usewas therefore restricted to determining the presence or absence of aparticular function Rater agreement in this case is limited to a some-what crude account of whether a function occurred or did not occurin a particular task performance We do not therefore have evidenceof whether the function observed was invariant across raters

The results from this session are included as Appendix 4 It canbe seen from this table that the participants again display mixed levelsof agreement ranging from a single perceived observation to totalagreement As with the earlier session it appears that there is rela-tively broad agreement on a range of functions but that others appearto be more dif cult to identify easily These dif culties appear to begreatest where the task involves a degree of interaction between thetest-takers

Phase 4In this phase a transcription was made of the second of the two inter-views used in Phase 3 since there was a full set of data available for

Barry OrsquoSullivan Cyril J Weir and Nick Saville 45

this interview The OCs were then lsquomappedrsquo on to this transcript inorder to give an overview from a different perspective of what func-tions were generated (it being felt that this map would result in anaccurate description of the test in terms of the items included in theOCs) This mapping was carried out by two researchers who initiallyworked independently of each other but discussed their nished workin order to arrive at a consensus

Finally the results of Phases 2 and 3 were compared (Appendix5) This clearly indicates that the checklists are now working wellThere are still some problems in items such as lsquostagingrsquo and lsquodescrib-ingrsquo and feedback from participants suggests that this may be due tomisunderstandings or misinterpretations of the gloss and examplesused In addition there are some similar dif culties with the initialthree items in the interactional functions checklist in which the great-est dif culties in applying the checklists appear to lie

VII Discussion and initial conclusions

The results of this study appear to substantiate our belief thatalthough still under development for use with the UCLES Main Suiteexaminations an operational version of these checklists is certainlyfeasible and has potentially wider application mutatis mutandis tothe content validation of other spoken language tests Further re ne-ment of the checklists is clearly required although the developmentalprocess adopted here appears to have borne positive results

1 Validities

We would not wish to claim that the checklists on their own offer asatisfactory demonstration of the construct validity of a spoken langu-age test for as Messick argues (1989 16) lsquothe varieties of evidencesupporting validity are not alternatives but rather supplements to oneanotherrsquo We recognize the necessity for a broad view of lsquothe eviden-tial basis for test interpretationrsquo (Messick 1989 20) Bachman (1990237) similarly concludes lsquoit is important to recognise that none ofthese [evidences of validity] by itself is suf cient to demonstrate thevalidity of a particular interpretation or use of test scoresrsquo (see alsoBachman 1990 243) Fulcher (1999 224) adds a further caveatagainst an overly narrow interpretation of content validity when hequotes Messick (1989 41)

the major problem is that so-called content validity is focused upon test formsrather than test scores upon instruments rather than measurements selectingcontent is an act of classi cation which is in itself a hypothesis that needs tobe con rmed empirically

46 Validating speaking-test tasks

Like these authors we regard as inadequate any conceptualization ofvalidity that does not involve the provision of evidence on a numberof levels but would argue strongly that without a clear idea of thematch between intended content and actual content any comprehen-sive investigation of the construct validity of a test is built on sandDe ning the construct is in our view underpinned by establishingthe nature of the actual performances elicited by test tasks ie thetrue content of tasks

2 Present and future applications of observational checklists

Versions of the checklists require a degree of training and practicesimilar to that given to raters if a reliable and consistent outcome isto be expected This requires that standardized training materials bedeveloped alongside the checklists In the case of these checkliststhis process has already begun with the initial versions piloted duringPhase 3 of the project

The checklists have great potential as an evaluative tool and canprovide comprehensive insight into various issues It is hoped thatamongst other issues the checklists will provide insights into the fol-lowing

middot the language functions that the different task-types (and differentsub-tasks within these) employed in the UCLES Main Suite Paper5 (Speaking) Tests typically elicit

middot the language that the pair-format elicits and how it differs in nat-ure and quality from that elicited by interlocutor-single candi-date testing

middot the extent to which there is functional variation across the topfour levels of the UCLES Main Suite Spoken Language Test

In addition to these issues the way in which the checklists can beapplied may allow for other important questions to be answered Forexample by allowing the evaluator multiple observations (stoppingand starting a recording of a test at will) it will be possible to estab-lish whether there are quanti able differences in the language func-tions generated by the different tasks ie the evaluators will havethe time they need to make frequency counts of the functions

While the results to date have focused on a posteriori validationprocedures these checklists are also relevant to task design By takinginto account the expected response of a task (and by describing thatresponse in terms of these functions ) it will be possible to explorepredicted and actual test task outcome It will also be a useful guidefor item writers in taking a priori decisions about content coverageThrough this approach it should be possible to predict more accurately

Barry OrsquoSullivan Cyril J Weir and Nick Saville 47

linguistic response (in terms of the elements of the checklists) andto apply this to the design of test tasks ndash and of course to evaluatethe success of the prediction later on In the longer term this willlead to a greater understanding of how tasks and task formats can bemanipulated to result in speci c language use We are not claimingthat it is possible to predict language use at a micro level(grammatical form or lexis) but that it is possible to predict infor-mational and interactional functions and features of interaction man-agement ndash a notion supported by Bygate (1999)

The checklists should also enable us to explore how systematicvariation in such areas as interviewer questioning behaviour (andinterlocutor frame adherence ) affects the language produced in thistype of test In the interview transcribed for this study for examplethe examiner directed his questions very deliberately (systematicallyaiming the questions at one participant and then the other) Thistended to sti e any spontaneity in the intended three-way discussion(Task 4) so occurrences of Interactional and Discourse ManagementFunctions did not materialize to the extent intended by the taskdesigners It is possible that a less deliberate (unscripted) questioningtechnique would lead to a less interviewer-oriented interaction patternand allow for the more genuine interactive communication envisagedin the task design

Perhaps the most valuable contribution that this type of validationprocedure offers is its potential to improve the quality of oral assess-ment in both low-stakes and high-stakes contexts By offering theinvestigator an instrument that can be used in real time the checklistsbroaden the scope of investigation from limited case study analysisof small numbers of test transcripts to large scale eld studies acrossa wide range of testing contexts

Acknowledgements

We would like to thank Don Porter and Rita Green for their earlyinput into the rst version of the checklist In addition help wasreceived from members of the ELT division in UCLES in particularfrom Angela ffrench Lynda Taylor and Christina Rimini from agroup of UCLES Senior Team Leaders and from MA TEFL studentsat the University of Reading Finally we would like to thank theeditors and anonymous reviewers of Language Testing for theirinsightful comments and helpful suggestions for its improvement Thefaults that remain are as ever ours

48 Validating speaking-test tasks

VIII References

Anastasi A 1988 Psychological testing 6th edition New York Macmil-lan

Bachman LF 1990 Fundamental considerations in language testingOxford Oxford University Press

Bachman LF and Palmer AS 1981 The construct validation of the FSIoral interview Language Learning 31 67ndash86

mdashmdash 1996 Language testing in practice Oxford Oxford University PressBallman TL 1991 The oral task of picture description similarities and

differences in native and nonnative speakers of Spanish In TeschnerRV editor Assessing foreign language pro ciency of undergrad-uates AAUSC Issues in Language Program Direction Boston Heinleand Heinle 221ndash31

Brown A 1998 Interviewer style and candidate performance in the IELSToral interview Paper presented at the Language Testing Research Col-loquium Monterey CA

Bygate M 1988 Speaking Oxford Oxford University Pressmdashmdash 1999 Quality of language and purpose of task patterns of learnersrsquo

language on two oral communication tasks Language TeachingResearch 3 185ndash214

Chalhoub-Deville M 1995a Deriving oral assessment scales across differ-ent tests and rater groups Language Testing 12 16ndash33

mdashmdash 1995b A contextualized approach to describing oral language pro- ciency Language Learning 45 251ndash81

Clark JLD 1979 Direct vs semi-direct tests of speaking ability In Bri-ere EJ and Hinofotis FB editors Concepts in language testingsome recent studies Washington DC TESOL

mdashmdash 1988 Validation of a tape-mediated ACTFLILR scale based test ofChinese speaking pro ciency Language Testing 5 187ndash205

Clark JLD and Hooshmand D 1992 lsquoScreen to Screenrsquo testing anexploratory study of oral pro ciency interviewing using video tele-conferencing System 20 293ndash304

Cronbach LJ 1971 Validity In Thorndike RL editor Educationalmeasurement 2nd edition Washington DC American Council on Edu-cation 443ndash597

mdashmdash 1990 Essentials of psychological testing 5th edition New YorkHarper amp Row

Davies A 1977 The construction of language tests In Allen JPB andDavies A editors Testing and experimental methods The EdinburghCourse in Applied Linguistics Volume 4 London Oxford UniversityPress 38ndash194

mdashmdash 1990 Principles of language testing Oxford BlackwellEllerton AW 1997 Considerations in the validation of semi-direct oral

testing Unpublished PhD thesis CALS University of Readingffrench A 1999 Language functions and UCLES speaking tests Seminar

in Athens Greece October 1999

Barry OrsquoSullivan Cyril J Weir and Nick Saville 49

Foster P and Skehan P 1996 The in uence of planning and task typeon second language performance Studies in Second Language Acqui-sition 18 299ndash323

mdashmdash 1999 The in uence of source of planning and focus of planning ontask-based performance Language Teaching Research 3 215ndash47

Fulcher G 1994 Some priority areas for oral language testing LanguageTesting Update 15 39ndash47

mdashmdash 1996 Testing tasks issues in task design and the group oral LanguageTesting 13 23ndash51

mdashmdash 1999 Assessment in English for academic purposes putting contentvalidity in its place Applied Linguistics 20 221ndash36

Hayashi M 1995 Conversational repair a contrastive study of Japaneseand English MA Project Report University of Canberra

Henning G 1983 Oral pro ciency testing comparative validities of inter-view imitation and completion methods Language Learning 33315ndash32

mdashmdash 1987 A guide to language testing Cambridge MA Newbury HouseKelly R 1978 On the construct validation of comprehension tests an exer-

cise in applied linguistics Unpublished PhD thesis University ofQueensland

Kenyon D 1995 An investigation of the validity of task demands onperformance-based tests of oral pro ciency In Kunnan AJ editorValidation in language assessment selected papers from the 17th Lan-guage Testing Research Colloquium Long Beach Mahwah NJ Lawr-ence Erlbaum 19ndash40

Kormos J 1999 Simulating conversations in oral-pro ciency assessmenta conversation analysis of role plays and non-scripted interviews inlanguage exams Language Testing 16 163ndash88

Lazaraton A 1992 The structural organisation of a language interview aconversational analytic perspective System 20 373ndash86

mdashmdash1996 A qualitative approach to monitoring examiner conduct in theCambridge assessment of spoken English (CASE) In Milanovic Mand Saville N editors Performance testing cognition andassessment selected papers from the 15th Language Testing ResearchColloquium Cambridge and Arnhem Studies in Language Testing 3Cambridge University of Cambridge Local Examinations Syndicate18ndash33

mdashmdash 2000 A qualitative approach to the validation of oral language testsStudies in Language Testing Volume 14 Cambridge Cambridge Uni-versity Press

Lumley T and OrsquoSullivan B 2000 The effect of speaker and topic vari-ables on task performance in a tape-mediated assessment of speakingPaper presented at the 2nd Annual Asian Language AssessmentResearch Forum The Hong Kong Polytechnic University

Luoma S 1997 Comparability of a tape-mediated and a face-to-face testof speaking a triangulation study Unpublished Licentiate ThesisCentre for Applied Language Studies Jyvaskyla University Finland

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 10: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

42 Validating speaking-test tasks

agreement on all but a small number of functions (Appendix 2) Notehere that in order to make these patterns of behaviour clear the datahave been sorted both horizontally and vertically by the total numberof observations made by each participant and of each item

From this perspective this aspect of the developmental process wasconsidered to be quite successful However it was apparent that therewere a number of elements within the checklists that were causingsome dif culty These are highlighted in the table by the tram-linesItems above the lines have been identi ed by some participants inone case by a single person while those below have been observedby a majority of participants (in two cases by all of them) For thesecases we might infer a high degree of agreement However themiddle range of items appears to have caused a degree of confusionand so are highlighted here ie marked for further investigation

Phase 2

In this phase a much smaller gathering was organized this timeinvolving members of the development team as well as the three UK-based UCLES Senior Team Leaders In advance of this meeting allparticipants were asked to study the existing checklists and to exemp-lify each function with examples drawn from their experiences of thevarious UCLES Main Suite examinations The resulting data werecollated and presented as a single document that formed the basis ofdiscussion during a day-long session Participants were not madeaware of the ndings from Phase 1

During this session many questions were asked of all aspects ofthe checklist and a more streamlined version of the three sectionswas suggested In addition to a number of participants making a writ-ten record of the discussions the entire session was recorded Thisproved to be a valuable reminder of the way in which particularchanges came about and was used when the nal decisions regardinginclusion con ation or omission were being made Although it isbeyond the scope of this project to analyse this recording whencoupled with the earlier and revised documents it is in itself a valu-able source of data in that it provides a signi cant record of the devel-opmental process

Among the many interesting outcomes of this phase were thedecisions either to rethink to reorganize or to omit items from theinitial list These decisions were seen to mirror the results of the Phase1 application quite closely Of the 13 items identi ed in Phase 1 asbeing in need of review (7 were rarely observed indicating a highdegree of agreement that they were not in fact present and 6appeared to be confused with very mixed reported observations ) 7

Barry OrsquoSullivan Cyril J Weir and Nick Saville 43

were recommended for either omission or inclusion in other items bythe panel while the remaining 6 items were identi ed by them asbeing of value Although no examples of the latter had appeared inthe earlier data the panel agreed that they represented language func-tions that the UCLES Main Suite examinations were intended to elicitIt was also decided that each item in this latter group was in need offurther clari cation andor exempli cation Of the remaining 17items

middot two were changed the item lsquoanalysingrsquo was recoded as lsquostagingrsquoin order to clarify its intended meaning while it was decided toseparate the item lsquo(dis)agreeingrsquo into its two separate components

middot three were omitted it was argued that the item lsquoproviding non-personal informationrsquo referred to what was happening with theother items in the informational function category while the itemslsquoexplainingrsquo and lsquojustifyingsupportingrsquo were not functions usu-ally associated with the UCLES Main Suite tasks and no occur-rences of these had been noted

We would emphasize that as reported in Section IV above the initiallist was developed to cover the language functions that various spokenlanguage test tasks might elicit The development of the checklistsdescribed here re ects an attempt to customize the lists in line withthe intended functional outcomes of a speci c set of tests

We are of course aware that closed instruments of this type maybe open to the criticism that valuable information could be lost How-ever for reasons of practicality we felt it necessary to limit the listto what the examinations were intended to elicit rather than attemptto operationalize a full inventory Secondly any functions thatappeared in the data that were not covered by the reduced list wouldhave been noted There appeared to be no cases of this

The data from these two phases were combined to result in a work-ing version of the checklists (Appendix 3) which was then appliedto a pair of FCE Speaking Tests in Phase 3

Phase 3

In the third phase the revised checklists were given to a group of 15MA TEFL students who were asked to apply them to two FCE testsBoth of these tests involved a mixed-sex pair of learners one pair ofapproximately average ability and the other pair above averageBefore using the observation checklists (OCs) the students wereasked rst to attempt to predict which functions they might expect to nd To help in this pre-session task the students were given detailsof the FCE format and tasks

44 Validating speaking-test tasks

Unfortunately a small number of students did not manage to com-plete the observation task as they were somewhat overwhelmed withthe real-time application of the checklists As a result only 12 sets ofcompleted checklists were included in the nal analysis

Prior to the session the group was given an opportunity to have apractice run using a third FCE examination While this lsquotrainingrsquo per-iod coupled with the pre-session task was intended to provide thestudents with the background they needed to apply the checklists con-sistently there was a problem during the session itself This problemwas caused by the failure of a number of students to note the changefrom Task 3 to Task 4 in the rst test observed This was possiblycaused by a lack of awareness of the test itself and was not helpedby the seamless way in which the examiner on the video moved froma two-way discussion involving the test-takers to a three-way dis-cussion This meant that a full set of data exists only for the rst twotasks of this test As the problem was noticed in time the second testdid not cause these problems Unlike the earlier seminar on thisoccasion the participants were asked only to record each functionwhen it was rst observed This was done as it was felt that the earlierseminar showed that without extensive training it would be far toodif cult to apply the OCs fully in lsquorealrsquo time in order to generatecomprehensive frequency counts We are aware that a full tally wouldenable us to draw more precise conclusions about the relative fre-quency of occurrence of these functions and the degree of consensus(reliability) of observers

Against this we must emphasize that the checklists in their currentstage of development are designed to be used in real time Their usewas therefore restricted to determining the presence or absence of aparticular function Rater agreement in this case is limited to a some-what crude account of whether a function occurred or did not occurin a particular task performance We do not therefore have evidenceof whether the function observed was invariant across raters

The results from this session are included as Appendix 4 It canbe seen from this table that the participants again display mixed levelsof agreement ranging from a single perceived observation to totalagreement As with the earlier session it appears that there is rela-tively broad agreement on a range of functions but that others appearto be more dif cult to identify easily These dif culties appear to begreatest where the task involves a degree of interaction between thetest-takers

Phase 4In this phase a transcription was made of the second of the two inter-views used in Phase 3 since there was a full set of data available for

Barry OrsquoSullivan Cyril J Weir and Nick Saville 45

this interview The OCs were then lsquomappedrsquo on to this transcript inorder to give an overview from a different perspective of what func-tions were generated (it being felt that this map would result in anaccurate description of the test in terms of the items included in theOCs) This mapping was carried out by two researchers who initiallyworked independently of each other but discussed their nished workin order to arrive at a consensus

Finally the results of Phases 2 and 3 were compared (Appendix5) This clearly indicates that the checklists are now working wellThere are still some problems in items such as lsquostagingrsquo and lsquodescrib-ingrsquo and feedback from participants suggests that this may be due tomisunderstandings or misinterpretations of the gloss and examplesused In addition there are some similar dif culties with the initialthree items in the interactional functions checklist in which the great-est dif culties in applying the checklists appear to lie

VII Discussion and initial conclusions

The results of this study appear to substantiate our belief thatalthough still under development for use with the UCLES Main Suiteexaminations an operational version of these checklists is certainlyfeasible and has potentially wider application mutatis mutandis tothe content validation of other spoken language tests Further re ne-ment of the checklists is clearly required although the developmentalprocess adopted here appears to have borne positive results

1 Validities

We would not wish to claim that the checklists on their own offer asatisfactory demonstration of the construct validity of a spoken langu-age test for as Messick argues (1989 16) lsquothe varieties of evidencesupporting validity are not alternatives but rather supplements to oneanotherrsquo We recognize the necessity for a broad view of lsquothe eviden-tial basis for test interpretationrsquo (Messick 1989 20) Bachman (1990237) similarly concludes lsquoit is important to recognise that none ofthese [evidences of validity] by itself is suf cient to demonstrate thevalidity of a particular interpretation or use of test scoresrsquo (see alsoBachman 1990 243) Fulcher (1999 224) adds a further caveatagainst an overly narrow interpretation of content validity when hequotes Messick (1989 41)

the major problem is that so-called content validity is focused upon test formsrather than test scores upon instruments rather than measurements selectingcontent is an act of classi cation which is in itself a hypothesis that needs tobe con rmed empirically

46 Validating speaking-test tasks

Like these authors we regard as inadequate any conceptualization ofvalidity that does not involve the provision of evidence on a numberof levels but would argue strongly that without a clear idea of thematch between intended content and actual content any comprehen-sive investigation of the construct validity of a test is built on sandDe ning the construct is in our view underpinned by establishingthe nature of the actual performances elicited by test tasks ie thetrue content of tasks

2 Present and future applications of observational checklists

Versions of the checklists require a degree of training and practicesimilar to that given to raters if a reliable and consistent outcome isto be expected This requires that standardized training materials bedeveloped alongside the checklists In the case of these checkliststhis process has already begun with the initial versions piloted duringPhase 3 of the project

The checklists have great potential as an evaluative tool and canprovide comprehensive insight into various issues It is hoped thatamongst other issues the checklists will provide insights into the fol-lowing

middot the language functions that the different task-types (and differentsub-tasks within these) employed in the UCLES Main Suite Paper5 (Speaking) Tests typically elicit

middot the language that the pair-format elicits and how it differs in nat-ure and quality from that elicited by interlocutor-single candi-date testing

middot the extent to which there is functional variation across the topfour levels of the UCLES Main Suite Spoken Language Test

In addition to these issues the way in which the checklists can beapplied may allow for other important questions to be answered Forexample by allowing the evaluator multiple observations (stoppingand starting a recording of a test at will) it will be possible to estab-lish whether there are quanti able differences in the language func-tions generated by the different tasks ie the evaluators will havethe time they need to make frequency counts of the functions

While the results to date have focused on a posteriori validationprocedures these checklists are also relevant to task design By takinginto account the expected response of a task (and by describing thatresponse in terms of these functions ) it will be possible to explorepredicted and actual test task outcome It will also be a useful guidefor item writers in taking a priori decisions about content coverageThrough this approach it should be possible to predict more accurately

Barry OrsquoSullivan Cyril J Weir and Nick Saville 47

linguistic response (in terms of the elements of the checklists) andto apply this to the design of test tasks ndash and of course to evaluatethe success of the prediction later on In the longer term this willlead to a greater understanding of how tasks and task formats can bemanipulated to result in speci c language use We are not claimingthat it is possible to predict language use at a micro level(grammatical form or lexis) but that it is possible to predict infor-mational and interactional functions and features of interaction man-agement ndash a notion supported by Bygate (1999)

The checklists should also enable us to explore how systematicvariation in such areas as interviewer questioning behaviour (andinterlocutor frame adherence ) affects the language produced in thistype of test In the interview transcribed for this study for examplethe examiner directed his questions very deliberately (systematicallyaiming the questions at one participant and then the other) Thistended to sti e any spontaneity in the intended three-way discussion(Task 4) so occurrences of Interactional and Discourse ManagementFunctions did not materialize to the extent intended by the taskdesigners It is possible that a less deliberate (unscripted) questioningtechnique would lead to a less interviewer-oriented interaction patternand allow for the more genuine interactive communication envisagedin the task design

Perhaps the most valuable contribution that this type of validationprocedure offers is its potential to improve the quality of oral assess-ment in both low-stakes and high-stakes contexts By offering theinvestigator an instrument that can be used in real time the checklistsbroaden the scope of investigation from limited case study analysisof small numbers of test transcripts to large scale eld studies acrossa wide range of testing contexts

Acknowledgements

We would like to thank Don Porter and Rita Green for their earlyinput into the rst version of the checklist In addition help wasreceived from members of the ELT division in UCLES in particularfrom Angela ffrench Lynda Taylor and Christina Rimini from agroup of UCLES Senior Team Leaders and from MA TEFL studentsat the University of Reading Finally we would like to thank theeditors and anonymous reviewers of Language Testing for theirinsightful comments and helpful suggestions for its improvement Thefaults that remain are as ever ours

48 Validating speaking-test tasks

VIII References

Anastasi A 1988 Psychological testing 6th edition New York Macmil-lan

Bachman LF 1990 Fundamental considerations in language testingOxford Oxford University Press

Bachman LF and Palmer AS 1981 The construct validation of the FSIoral interview Language Learning 31 67ndash86

mdashmdash 1996 Language testing in practice Oxford Oxford University PressBallman TL 1991 The oral task of picture description similarities and

differences in native and nonnative speakers of Spanish In TeschnerRV editor Assessing foreign language pro ciency of undergrad-uates AAUSC Issues in Language Program Direction Boston Heinleand Heinle 221ndash31

Brown A 1998 Interviewer style and candidate performance in the IELSToral interview Paper presented at the Language Testing Research Col-loquium Monterey CA

Bygate M 1988 Speaking Oxford Oxford University Pressmdashmdash 1999 Quality of language and purpose of task patterns of learnersrsquo

language on two oral communication tasks Language TeachingResearch 3 185ndash214

Chalhoub-Deville M 1995a Deriving oral assessment scales across differ-ent tests and rater groups Language Testing 12 16ndash33

mdashmdash 1995b A contextualized approach to describing oral language pro- ciency Language Learning 45 251ndash81

Clark JLD 1979 Direct vs semi-direct tests of speaking ability In Bri-ere EJ and Hinofotis FB editors Concepts in language testingsome recent studies Washington DC TESOL

mdashmdash 1988 Validation of a tape-mediated ACTFLILR scale based test ofChinese speaking pro ciency Language Testing 5 187ndash205

Clark JLD and Hooshmand D 1992 lsquoScreen to Screenrsquo testing anexploratory study of oral pro ciency interviewing using video tele-conferencing System 20 293ndash304

Cronbach LJ 1971 Validity In Thorndike RL editor Educationalmeasurement 2nd edition Washington DC American Council on Edu-cation 443ndash597

mdashmdash 1990 Essentials of psychological testing 5th edition New YorkHarper amp Row

Davies A 1977 The construction of language tests In Allen JPB andDavies A editors Testing and experimental methods The EdinburghCourse in Applied Linguistics Volume 4 London Oxford UniversityPress 38ndash194

mdashmdash 1990 Principles of language testing Oxford BlackwellEllerton AW 1997 Considerations in the validation of semi-direct oral

testing Unpublished PhD thesis CALS University of Readingffrench A 1999 Language functions and UCLES speaking tests Seminar

in Athens Greece October 1999

Barry OrsquoSullivan Cyril J Weir and Nick Saville 49

Foster P and Skehan P 1996 The in uence of planning and task typeon second language performance Studies in Second Language Acqui-sition 18 299ndash323

mdashmdash 1999 The in uence of source of planning and focus of planning ontask-based performance Language Teaching Research 3 215ndash47

Fulcher G 1994 Some priority areas for oral language testing LanguageTesting Update 15 39ndash47

mdashmdash 1996 Testing tasks issues in task design and the group oral LanguageTesting 13 23ndash51

mdashmdash 1999 Assessment in English for academic purposes putting contentvalidity in its place Applied Linguistics 20 221ndash36

Hayashi M 1995 Conversational repair a contrastive study of Japaneseand English MA Project Report University of Canberra

Henning G 1983 Oral pro ciency testing comparative validities of inter-view imitation and completion methods Language Learning 33315ndash32

mdashmdash 1987 A guide to language testing Cambridge MA Newbury HouseKelly R 1978 On the construct validation of comprehension tests an exer-

cise in applied linguistics Unpublished PhD thesis University ofQueensland

Kenyon D 1995 An investigation of the validity of task demands onperformance-based tests of oral pro ciency In Kunnan AJ editorValidation in language assessment selected papers from the 17th Lan-guage Testing Research Colloquium Long Beach Mahwah NJ Lawr-ence Erlbaum 19ndash40

Kormos J 1999 Simulating conversations in oral-pro ciency assessmenta conversation analysis of role plays and non-scripted interviews inlanguage exams Language Testing 16 163ndash88

Lazaraton A 1992 The structural organisation of a language interview aconversational analytic perspective System 20 373ndash86

mdashmdash1996 A qualitative approach to monitoring examiner conduct in theCambridge assessment of spoken English (CASE) In Milanovic Mand Saville N editors Performance testing cognition andassessment selected papers from the 15th Language Testing ResearchColloquium Cambridge and Arnhem Studies in Language Testing 3Cambridge University of Cambridge Local Examinations Syndicate18ndash33

mdashmdash 2000 A qualitative approach to the validation of oral language testsStudies in Language Testing Volume 14 Cambridge Cambridge Uni-versity Press

Lumley T and OrsquoSullivan B 2000 The effect of speaker and topic vari-ables on task performance in a tape-mediated assessment of speakingPaper presented at the 2nd Annual Asian Language AssessmentResearch Forum The Hong Kong Polytechnic University

Luoma S 1997 Comparability of a tape-mediated and a face-to-face testof speaking a triangulation study Unpublished Licentiate ThesisCentre for Applied Language Studies Jyvaskyla University Finland

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 11: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

Barry OrsquoSullivan Cyril J Weir and Nick Saville 43

were recommended for either omission or inclusion in other items bythe panel while the remaining 6 items were identi ed by them asbeing of value Although no examples of the latter had appeared inthe earlier data the panel agreed that they represented language func-tions that the UCLES Main Suite examinations were intended to elicitIt was also decided that each item in this latter group was in need offurther clari cation andor exempli cation Of the remaining 17items

middot two were changed the item lsquoanalysingrsquo was recoded as lsquostagingrsquoin order to clarify its intended meaning while it was decided toseparate the item lsquo(dis)agreeingrsquo into its two separate components

middot three were omitted it was argued that the item lsquoproviding non-personal informationrsquo referred to what was happening with theother items in the informational function category while the itemslsquoexplainingrsquo and lsquojustifyingsupportingrsquo were not functions usu-ally associated with the UCLES Main Suite tasks and no occur-rences of these had been noted

We would emphasize that as reported in Section IV above the initiallist was developed to cover the language functions that various spokenlanguage test tasks might elicit The development of the checklistsdescribed here re ects an attempt to customize the lists in line withthe intended functional outcomes of a speci c set of tests

We are of course aware that closed instruments of this type maybe open to the criticism that valuable information could be lost How-ever for reasons of practicality we felt it necessary to limit the listto what the examinations were intended to elicit rather than attemptto operationalize a full inventory Secondly any functions thatappeared in the data that were not covered by the reduced list wouldhave been noted There appeared to be no cases of this

The data from these two phases were combined to result in a work-ing version of the checklists (Appendix 3) which was then appliedto a pair of FCE Speaking Tests in Phase 3

Phase 3

In the third phase the revised checklists were given to a group of 15MA TEFL students who were asked to apply them to two FCE testsBoth of these tests involved a mixed-sex pair of learners one pair ofapproximately average ability and the other pair above averageBefore using the observation checklists (OCs) the students wereasked rst to attempt to predict which functions they might expect to nd To help in this pre-session task the students were given detailsof the FCE format and tasks

44 Validating speaking-test tasks

Unfortunately a small number of students did not manage to com-plete the observation task as they were somewhat overwhelmed withthe real-time application of the checklists As a result only 12 sets ofcompleted checklists were included in the nal analysis

Prior to the session the group was given an opportunity to have apractice run using a third FCE examination While this lsquotrainingrsquo per-iod coupled with the pre-session task was intended to provide thestudents with the background they needed to apply the checklists con-sistently there was a problem during the session itself This problemwas caused by the failure of a number of students to note the changefrom Task 3 to Task 4 in the rst test observed This was possiblycaused by a lack of awareness of the test itself and was not helpedby the seamless way in which the examiner on the video moved froma two-way discussion involving the test-takers to a three-way dis-cussion This meant that a full set of data exists only for the rst twotasks of this test As the problem was noticed in time the second testdid not cause these problems Unlike the earlier seminar on thisoccasion the participants were asked only to record each functionwhen it was rst observed This was done as it was felt that the earlierseminar showed that without extensive training it would be far toodif cult to apply the OCs fully in lsquorealrsquo time in order to generatecomprehensive frequency counts We are aware that a full tally wouldenable us to draw more precise conclusions about the relative fre-quency of occurrence of these functions and the degree of consensus(reliability) of observers

Against this we must emphasize that the checklists in their currentstage of development are designed to be used in real time Their usewas therefore restricted to determining the presence or absence of aparticular function Rater agreement in this case is limited to a some-what crude account of whether a function occurred or did not occurin a particular task performance We do not therefore have evidenceof whether the function observed was invariant across raters

The results from this session are included as Appendix 4 It canbe seen from this table that the participants again display mixed levelsof agreement ranging from a single perceived observation to totalagreement As with the earlier session it appears that there is rela-tively broad agreement on a range of functions but that others appearto be more dif cult to identify easily These dif culties appear to begreatest where the task involves a degree of interaction between thetest-takers

Phase 4In this phase a transcription was made of the second of the two inter-views used in Phase 3 since there was a full set of data available for

Barry OrsquoSullivan Cyril J Weir and Nick Saville 45

this interview The OCs were then lsquomappedrsquo on to this transcript inorder to give an overview from a different perspective of what func-tions were generated (it being felt that this map would result in anaccurate description of the test in terms of the items included in theOCs) This mapping was carried out by two researchers who initiallyworked independently of each other but discussed their nished workin order to arrive at a consensus

Finally the results of Phases 2 and 3 were compared (Appendix5) This clearly indicates that the checklists are now working wellThere are still some problems in items such as lsquostagingrsquo and lsquodescrib-ingrsquo and feedback from participants suggests that this may be due tomisunderstandings or misinterpretations of the gloss and examplesused In addition there are some similar dif culties with the initialthree items in the interactional functions checklist in which the great-est dif culties in applying the checklists appear to lie

VII Discussion and initial conclusions

The results of this study appear to substantiate our belief thatalthough still under development for use with the UCLES Main Suiteexaminations an operational version of these checklists is certainlyfeasible and has potentially wider application mutatis mutandis tothe content validation of other spoken language tests Further re ne-ment of the checklists is clearly required although the developmentalprocess adopted here appears to have borne positive results

1 Validities

We would not wish to claim that the checklists on their own offer asatisfactory demonstration of the construct validity of a spoken langu-age test for as Messick argues (1989 16) lsquothe varieties of evidencesupporting validity are not alternatives but rather supplements to oneanotherrsquo We recognize the necessity for a broad view of lsquothe eviden-tial basis for test interpretationrsquo (Messick 1989 20) Bachman (1990237) similarly concludes lsquoit is important to recognise that none ofthese [evidences of validity] by itself is suf cient to demonstrate thevalidity of a particular interpretation or use of test scoresrsquo (see alsoBachman 1990 243) Fulcher (1999 224) adds a further caveatagainst an overly narrow interpretation of content validity when hequotes Messick (1989 41)

the major problem is that so-called content validity is focused upon test formsrather than test scores upon instruments rather than measurements selectingcontent is an act of classi cation which is in itself a hypothesis that needs tobe con rmed empirically

46 Validating speaking-test tasks

Like these authors we regard as inadequate any conceptualization ofvalidity that does not involve the provision of evidence on a numberof levels but would argue strongly that without a clear idea of thematch between intended content and actual content any comprehen-sive investigation of the construct validity of a test is built on sandDe ning the construct is in our view underpinned by establishingthe nature of the actual performances elicited by test tasks ie thetrue content of tasks

2 Present and future applications of observational checklists

Versions of the checklists require a degree of training and practicesimilar to that given to raters if a reliable and consistent outcome isto be expected This requires that standardized training materials bedeveloped alongside the checklists In the case of these checkliststhis process has already begun with the initial versions piloted duringPhase 3 of the project

The checklists have great potential as an evaluative tool and canprovide comprehensive insight into various issues It is hoped thatamongst other issues the checklists will provide insights into the fol-lowing

middot the language functions that the different task-types (and differentsub-tasks within these) employed in the UCLES Main Suite Paper5 (Speaking) Tests typically elicit

middot the language that the pair-format elicits and how it differs in nat-ure and quality from that elicited by interlocutor-single candi-date testing

middot the extent to which there is functional variation across the topfour levels of the UCLES Main Suite Spoken Language Test

In addition to these issues the way in which the checklists can beapplied may allow for other important questions to be answered Forexample by allowing the evaluator multiple observations (stoppingand starting a recording of a test at will) it will be possible to estab-lish whether there are quanti able differences in the language func-tions generated by the different tasks ie the evaluators will havethe time they need to make frequency counts of the functions

While the results to date have focused on a posteriori validationprocedures these checklists are also relevant to task design By takinginto account the expected response of a task (and by describing thatresponse in terms of these functions ) it will be possible to explorepredicted and actual test task outcome It will also be a useful guidefor item writers in taking a priori decisions about content coverageThrough this approach it should be possible to predict more accurately

Barry OrsquoSullivan Cyril J Weir and Nick Saville 47

linguistic response (in terms of the elements of the checklists) andto apply this to the design of test tasks ndash and of course to evaluatethe success of the prediction later on In the longer term this willlead to a greater understanding of how tasks and task formats can bemanipulated to result in speci c language use We are not claimingthat it is possible to predict language use at a micro level(grammatical form or lexis) but that it is possible to predict infor-mational and interactional functions and features of interaction man-agement ndash a notion supported by Bygate (1999)

The checklists should also enable us to explore how systematicvariation in such areas as interviewer questioning behaviour (andinterlocutor frame adherence ) affects the language produced in thistype of test In the interview transcribed for this study for examplethe examiner directed his questions very deliberately (systematicallyaiming the questions at one participant and then the other) Thistended to sti e any spontaneity in the intended three-way discussion(Task 4) so occurrences of Interactional and Discourse ManagementFunctions did not materialize to the extent intended by the taskdesigners It is possible that a less deliberate (unscripted) questioningtechnique would lead to a less interviewer-oriented interaction patternand allow for the more genuine interactive communication envisagedin the task design

Perhaps the most valuable contribution that this type of validationprocedure offers is its potential to improve the quality of oral assess-ment in both low-stakes and high-stakes contexts By offering theinvestigator an instrument that can be used in real time the checklistsbroaden the scope of investigation from limited case study analysisof small numbers of test transcripts to large scale eld studies acrossa wide range of testing contexts

Acknowledgements

We would like to thank Don Porter and Rita Green for their earlyinput into the rst version of the checklist In addition help wasreceived from members of the ELT division in UCLES in particularfrom Angela ffrench Lynda Taylor and Christina Rimini from agroup of UCLES Senior Team Leaders and from MA TEFL studentsat the University of Reading Finally we would like to thank theeditors and anonymous reviewers of Language Testing for theirinsightful comments and helpful suggestions for its improvement Thefaults that remain are as ever ours

48 Validating speaking-test tasks

VIII References

Anastasi A 1988 Psychological testing 6th edition New York Macmil-lan

Bachman LF 1990 Fundamental considerations in language testingOxford Oxford University Press

Bachman LF and Palmer AS 1981 The construct validation of the FSIoral interview Language Learning 31 67ndash86

mdashmdash 1996 Language testing in practice Oxford Oxford University PressBallman TL 1991 The oral task of picture description similarities and

differences in native and nonnative speakers of Spanish In TeschnerRV editor Assessing foreign language pro ciency of undergrad-uates AAUSC Issues in Language Program Direction Boston Heinleand Heinle 221ndash31

Brown A 1998 Interviewer style and candidate performance in the IELSToral interview Paper presented at the Language Testing Research Col-loquium Monterey CA

Bygate M 1988 Speaking Oxford Oxford University Pressmdashmdash 1999 Quality of language and purpose of task patterns of learnersrsquo

language on two oral communication tasks Language TeachingResearch 3 185ndash214

Chalhoub-Deville M 1995a Deriving oral assessment scales across differ-ent tests and rater groups Language Testing 12 16ndash33

mdashmdash 1995b A contextualized approach to describing oral language pro- ciency Language Learning 45 251ndash81

Clark JLD 1979 Direct vs semi-direct tests of speaking ability In Bri-ere EJ and Hinofotis FB editors Concepts in language testingsome recent studies Washington DC TESOL

mdashmdash 1988 Validation of a tape-mediated ACTFLILR scale based test ofChinese speaking pro ciency Language Testing 5 187ndash205

Clark JLD and Hooshmand D 1992 lsquoScreen to Screenrsquo testing anexploratory study of oral pro ciency interviewing using video tele-conferencing System 20 293ndash304

Cronbach LJ 1971 Validity In Thorndike RL editor Educationalmeasurement 2nd edition Washington DC American Council on Edu-cation 443ndash597

mdashmdash 1990 Essentials of psychological testing 5th edition New YorkHarper amp Row

Davies A 1977 The construction of language tests In Allen JPB andDavies A editors Testing and experimental methods The EdinburghCourse in Applied Linguistics Volume 4 London Oxford UniversityPress 38ndash194

mdashmdash 1990 Principles of language testing Oxford BlackwellEllerton AW 1997 Considerations in the validation of semi-direct oral

testing Unpublished PhD thesis CALS University of Readingffrench A 1999 Language functions and UCLES speaking tests Seminar

in Athens Greece October 1999

Barry OrsquoSullivan Cyril J Weir and Nick Saville 49

Foster P and Skehan P 1996 The in uence of planning and task typeon second language performance Studies in Second Language Acqui-sition 18 299ndash323

mdashmdash 1999 The in uence of source of planning and focus of planning ontask-based performance Language Teaching Research 3 215ndash47

Fulcher G 1994 Some priority areas for oral language testing LanguageTesting Update 15 39ndash47

mdashmdash 1996 Testing tasks issues in task design and the group oral LanguageTesting 13 23ndash51

mdashmdash 1999 Assessment in English for academic purposes putting contentvalidity in its place Applied Linguistics 20 221ndash36

Hayashi M 1995 Conversational repair a contrastive study of Japaneseand English MA Project Report University of Canberra

Henning G 1983 Oral pro ciency testing comparative validities of inter-view imitation and completion methods Language Learning 33315ndash32

mdashmdash 1987 A guide to language testing Cambridge MA Newbury HouseKelly R 1978 On the construct validation of comprehension tests an exer-

cise in applied linguistics Unpublished PhD thesis University ofQueensland

Kenyon D 1995 An investigation of the validity of task demands onperformance-based tests of oral pro ciency In Kunnan AJ editorValidation in language assessment selected papers from the 17th Lan-guage Testing Research Colloquium Long Beach Mahwah NJ Lawr-ence Erlbaum 19ndash40

Kormos J 1999 Simulating conversations in oral-pro ciency assessmenta conversation analysis of role plays and non-scripted interviews inlanguage exams Language Testing 16 163ndash88

Lazaraton A 1992 The structural organisation of a language interview aconversational analytic perspective System 20 373ndash86

mdashmdash1996 A qualitative approach to monitoring examiner conduct in theCambridge assessment of spoken English (CASE) In Milanovic Mand Saville N editors Performance testing cognition andassessment selected papers from the 15th Language Testing ResearchColloquium Cambridge and Arnhem Studies in Language Testing 3Cambridge University of Cambridge Local Examinations Syndicate18ndash33

mdashmdash 2000 A qualitative approach to the validation of oral language testsStudies in Language Testing Volume 14 Cambridge Cambridge Uni-versity Press

Lumley T and OrsquoSullivan B 2000 The effect of speaker and topic vari-ables on task performance in a tape-mediated assessment of speakingPaper presented at the 2nd Annual Asian Language AssessmentResearch Forum The Hong Kong Polytechnic University

Luoma S 1997 Comparability of a tape-mediated and a face-to-face testof speaking a triangulation study Unpublished Licentiate ThesisCentre for Applied Language Studies Jyvaskyla University Finland

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 12: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

44 Validating speaking-test tasks

Unfortunately a small number of students did not manage to com-plete the observation task as they were somewhat overwhelmed withthe real-time application of the checklists As a result only 12 sets ofcompleted checklists were included in the nal analysis

Prior to the session the group was given an opportunity to have apractice run using a third FCE examination While this lsquotrainingrsquo per-iod coupled with the pre-session task was intended to provide thestudents with the background they needed to apply the checklists con-sistently there was a problem during the session itself This problemwas caused by the failure of a number of students to note the changefrom Task 3 to Task 4 in the rst test observed This was possiblycaused by a lack of awareness of the test itself and was not helpedby the seamless way in which the examiner on the video moved froma two-way discussion involving the test-takers to a three-way dis-cussion This meant that a full set of data exists only for the rst twotasks of this test As the problem was noticed in time the second testdid not cause these problems Unlike the earlier seminar on thisoccasion the participants were asked only to record each functionwhen it was rst observed This was done as it was felt that the earlierseminar showed that without extensive training it would be far toodif cult to apply the OCs fully in lsquorealrsquo time in order to generatecomprehensive frequency counts We are aware that a full tally wouldenable us to draw more precise conclusions about the relative fre-quency of occurrence of these functions and the degree of consensus(reliability) of observers

Against this we must emphasize that the checklists in their currentstage of development are designed to be used in real time Their usewas therefore restricted to determining the presence or absence of aparticular function Rater agreement in this case is limited to a some-what crude account of whether a function occurred or did not occurin a particular task performance We do not therefore have evidenceof whether the function observed was invariant across raters

The results from this session are included as Appendix 4 It canbe seen from this table that the participants again display mixed levelsof agreement ranging from a single perceived observation to totalagreement As with the earlier session it appears that there is rela-tively broad agreement on a range of functions but that others appearto be more dif cult to identify easily These dif culties appear to begreatest where the task involves a degree of interaction between thetest-takers

Phase 4In this phase a transcription was made of the second of the two inter-views used in Phase 3 since there was a full set of data available for

Barry OrsquoSullivan Cyril J Weir and Nick Saville 45

this interview The OCs were then lsquomappedrsquo on to this transcript inorder to give an overview from a different perspective of what func-tions were generated (it being felt that this map would result in anaccurate description of the test in terms of the items included in theOCs) This mapping was carried out by two researchers who initiallyworked independently of each other but discussed their nished workin order to arrive at a consensus

Finally the results of Phases 2 and 3 were compared (Appendix5) This clearly indicates that the checklists are now working wellThere are still some problems in items such as lsquostagingrsquo and lsquodescrib-ingrsquo and feedback from participants suggests that this may be due tomisunderstandings or misinterpretations of the gloss and examplesused In addition there are some similar dif culties with the initialthree items in the interactional functions checklist in which the great-est dif culties in applying the checklists appear to lie

VII Discussion and initial conclusions

The results of this study appear to substantiate our belief thatalthough still under development for use with the UCLES Main Suiteexaminations an operational version of these checklists is certainlyfeasible and has potentially wider application mutatis mutandis tothe content validation of other spoken language tests Further re ne-ment of the checklists is clearly required although the developmentalprocess adopted here appears to have borne positive results

1 Validities

We would not wish to claim that the checklists on their own offer asatisfactory demonstration of the construct validity of a spoken langu-age test for as Messick argues (1989 16) lsquothe varieties of evidencesupporting validity are not alternatives but rather supplements to oneanotherrsquo We recognize the necessity for a broad view of lsquothe eviden-tial basis for test interpretationrsquo (Messick 1989 20) Bachman (1990237) similarly concludes lsquoit is important to recognise that none ofthese [evidences of validity] by itself is suf cient to demonstrate thevalidity of a particular interpretation or use of test scoresrsquo (see alsoBachman 1990 243) Fulcher (1999 224) adds a further caveatagainst an overly narrow interpretation of content validity when hequotes Messick (1989 41)

the major problem is that so-called content validity is focused upon test formsrather than test scores upon instruments rather than measurements selectingcontent is an act of classi cation which is in itself a hypothesis that needs tobe con rmed empirically

46 Validating speaking-test tasks

Like these authors we regard as inadequate any conceptualization ofvalidity that does not involve the provision of evidence on a numberof levels but would argue strongly that without a clear idea of thematch between intended content and actual content any comprehen-sive investigation of the construct validity of a test is built on sandDe ning the construct is in our view underpinned by establishingthe nature of the actual performances elicited by test tasks ie thetrue content of tasks

2 Present and future applications of observational checklists

Versions of the checklists require a degree of training and practicesimilar to that given to raters if a reliable and consistent outcome isto be expected This requires that standardized training materials bedeveloped alongside the checklists In the case of these checkliststhis process has already begun with the initial versions piloted duringPhase 3 of the project

The checklists have great potential as an evaluative tool and canprovide comprehensive insight into various issues It is hoped thatamongst other issues the checklists will provide insights into the fol-lowing

middot the language functions that the different task-types (and differentsub-tasks within these) employed in the UCLES Main Suite Paper5 (Speaking) Tests typically elicit

middot the language that the pair-format elicits and how it differs in nat-ure and quality from that elicited by interlocutor-single candi-date testing

middot the extent to which there is functional variation across the topfour levels of the UCLES Main Suite Spoken Language Test

In addition to these issues the way in which the checklists can beapplied may allow for other important questions to be answered Forexample by allowing the evaluator multiple observations (stoppingand starting a recording of a test at will) it will be possible to estab-lish whether there are quanti able differences in the language func-tions generated by the different tasks ie the evaluators will havethe time they need to make frequency counts of the functions

While the results to date have focused on a posteriori validationprocedures these checklists are also relevant to task design By takinginto account the expected response of a task (and by describing thatresponse in terms of these functions ) it will be possible to explorepredicted and actual test task outcome It will also be a useful guidefor item writers in taking a priori decisions about content coverageThrough this approach it should be possible to predict more accurately

Barry OrsquoSullivan Cyril J Weir and Nick Saville 47

linguistic response (in terms of the elements of the checklists) andto apply this to the design of test tasks ndash and of course to evaluatethe success of the prediction later on In the longer term this willlead to a greater understanding of how tasks and task formats can bemanipulated to result in speci c language use We are not claimingthat it is possible to predict language use at a micro level(grammatical form or lexis) but that it is possible to predict infor-mational and interactional functions and features of interaction man-agement ndash a notion supported by Bygate (1999)

The checklists should also enable us to explore how systematicvariation in such areas as interviewer questioning behaviour (andinterlocutor frame adherence ) affects the language produced in thistype of test In the interview transcribed for this study for examplethe examiner directed his questions very deliberately (systematicallyaiming the questions at one participant and then the other) Thistended to sti e any spontaneity in the intended three-way discussion(Task 4) so occurrences of Interactional and Discourse ManagementFunctions did not materialize to the extent intended by the taskdesigners It is possible that a less deliberate (unscripted) questioningtechnique would lead to a less interviewer-oriented interaction patternand allow for the more genuine interactive communication envisagedin the task design

Perhaps the most valuable contribution that this type of validationprocedure offers is its potential to improve the quality of oral assess-ment in both low-stakes and high-stakes contexts By offering theinvestigator an instrument that can be used in real time the checklistsbroaden the scope of investigation from limited case study analysisof small numbers of test transcripts to large scale eld studies acrossa wide range of testing contexts

Acknowledgements

We would like to thank Don Porter and Rita Green for their earlyinput into the rst version of the checklist In addition help wasreceived from members of the ELT division in UCLES in particularfrom Angela ffrench Lynda Taylor and Christina Rimini from agroup of UCLES Senior Team Leaders and from MA TEFL studentsat the University of Reading Finally we would like to thank theeditors and anonymous reviewers of Language Testing for theirinsightful comments and helpful suggestions for its improvement Thefaults that remain are as ever ours

48 Validating speaking-test tasks

VIII References

Anastasi A 1988 Psychological testing 6th edition New York Macmil-lan

Bachman LF 1990 Fundamental considerations in language testingOxford Oxford University Press

Bachman LF and Palmer AS 1981 The construct validation of the FSIoral interview Language Learning 31 67ndash86

mdashmdash 1996 Language testing in practice Oxford Oxford University PressBallman TL 1991 The oral task of picture description similarities and

differences in native and nonnative speakers of Spanish In TeschnerRV editor Assessing foreign language pro ciency of undergrad-uates AAUSC Issues in Language Program Direction Boston Heinleand Heinle 221ndash31

Brown A 1998 Interviewer style and candidate performance in the IELSToral interview Paper presented at the Language Testing Research Col-loquium Monterey CA

Bygate M 1988 Speaking Oxford Oxford University Pressmdashmdash 1999 Quality of language and purpose of task patterns of learnersrsquo

language on two oral communication tasks Language TeachingResearch 3 185ndash214

Chalhoub-Deville M 1995a Deriving oral assessment scales across differ-ent tests and rater groups Language Testing 12 16ndash33

mdashmdash 1995b A contextualized approach to describing oral language pro- ciency Language Learning 45 251ndash81

Clark JLD 1979 Direct vs semi-direct tests of speaking ability In Bri-ere EJ and Hinofotis FB editors Concepts in language testingsome recent studies Washington DC TESOL

mdashmdash 1988 Validation of a tape-mediated ACTFLILR scale based test ofChinese speaking pro ciency Language Testing 5 187ndash205

Clark JLD and Hooshmand D 1992 lsquoScreen to Screenrsquo testing anexploratory study of oral pro ciency interviewing using video tele-conferencing System 20 293ndash304

Cronbach LJ 1971 Validity In Thorndike RL editor Educationalmeasurement 2nd edition Washington DC American Council on Edu-cation 443ndash597

mdashmdash 1990 Essentials of psychological testing 5th edition New YorkHarper amp Row

Davies A 1977 The construction of language tests In Allen JPB andDavies A editors Testing and experimental methods The EdinburghCourse in Applied Linguistics Volume 4 London Oxford UniversityPress 38ndash194

mdashmdash 1990 Principles of language testing Oxford BlackwellEllerton AW 1997 Considerations in the validation of semi-direct oral

testing Unpublished PhD thesis CALS University of Readingffrench A 1999 Language functions and UCLES speaking tests Seminar

in Athens Greece October 1999

Barry OrsquoSullivan Cyril J Weir and Nick Saville 49

Foster P and Skehan P 1996 The in uence of planning and task typeon second language performance Studies in Second Language Acqui-sition 18 299ndash323

mdashmdash 1999 The in uence of source of planning and focus of planning ontask-based performance Language Teaching Research 3 215ndash47

Fulcher G 1994 Some priority areas for oral language testing LanguageTesting Update 15 39ndash47

mdashmdash 1996 Testing tasks issues in task design and the group oral LanguageTesting 13 23ndash51

mdashmdash 1999 Assessment in English for academic purposes putting contentvalidity in its place Applied Linguistics 20 221ndash36

Hayashi M 1995 Conversational repair a contrastive study of Japaneseand English MA Project Report University of Canberra

Henning G 1983 Oral pro ciency testing comparative validities of inter-view imitation and completion methods Language Learning 33315ndash32

mdashmdash 1987 A guide to language testing Cambridge MA Newbury HouseKelly R 1978 On the construct validation of comprehension tests an exer-

cise in applied linguistics Unpublished PhD thesis University ofQueensland

Kenyon D 1995 An investigation of the validity of task demands onperformance-based tests of oral pro ciency In Kunnan AJ editorValidation in language assessment selected papers from the 17th Lan-guage Testing Research Colloquium Long Beach Mahwah NJ Lawr-ence Erlbaum 19ndash40

Kormos J 1999 Simulating conversations in oral-pro ciency assessmenta conversation analysis of role plays and non-scripted interviews inlanguage exams Language Testing 16 163ndash88

Lazaraton A 1992 The structural organisation of a language interview aconversational analytic perspective System 20 373ndash86

mdashmdash1996 A qualitative approach to monitoring examiner conduct in theCambridge assessment of spoken English (CASE) In Milanovic Mand Saville N editors Performance testing cognition andassessment selected papers from the 15th Language Testing ResearchColloquium Cambridge and Arnhem Studies in Language Testing 3Cambridge University of Cambridge Local Examinations Syndicate18ndash33

mdashmdash 2000 A qualitative approach to the validation of oral language testsStudies in Language Testing Volume 14 Cambridge Cambridge Uni-versity Press

Lumley T and OrsquoSullivan B 2000 The effect of speaker and topic vari-ables on task performance in a tape-mediated assessment of speakingPaper presented at the 2nd Annual Asian Language AssessmentResearch Forum The Hong Kong Polytechnic University

Luoma S 1997 Comparability of a tape-mediated and a face-to-face testof speaking a triangulation study Unpublished Licentiate ThesisCentre for Applied Language Studies Jyvaskyla University Finland

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 13: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

Barry OrsquoSullivan Cyril J Weir and Nick Saville 45

this interview The OCs were then lsquomappedrsquo on to this transcript inorder to give an overview from a different perspective of what func-tions were generated (it being felt that this map would result in anaccurate description of the test in terms of the items included in theOCs) This mapping was carried out by two researchers who initiallyworked independently of each other but discussed their nished workin order to arrive at a consensus

Finally the results of Phases 2 and 3 were compared (Appendix5) This clearly indicates that the checklists are now working wellThere are still some problems in items such as lsquostagingrsquo and lsquodescrib-ingrsquo and feedback from participants suggests that this may be due tomisunderstandings or misinterpretations of the gloss and examplesused In addition there are some similar dif culties with the initialthree items in the interactional functions checklist in which the great-est dif culties in applying the checklists appear to lie

VII Discussion and initial conclusions

The results of this study appear to substantiate our belief thatalthough still under development for use with the UCLES Main Suiteexaminations an operational version of these checklists is certainlyfeasible and has potentially wider application mutatis mutandis tothe content validation of other spoken language tests Further re ne-ment of the checklists is clearly required although the developmentalprocess adopted here appears to have borne positive results

1 Validities

We would not wish to claim that the checklists on their own offer asatisfactory demonstration of the construct validity of a spoken langu-age test for as Messick argues (1989 16) lsquothe varieties of evidencesupporting validity are not alternatives but rather supplements to oneanotherrsquo We recognize the necessity for a broad view of lsquothe eviden-tial basis for test interpretationrsquo (Messick 1989 20) Bachman (1990237) similarly concludes lsquoit is important to recognise that none ofthese [evidences of validity] by itself is suf cient to demonstrate thevalidity of a particular interpretation or use of test scoresrsquo (see alsoBachman 1990 243) Fulcher (1999 224) adds a further caveatagainst an overly narrow interpretation of content validity when hequotes Messick (1989 41)

the major problem is that so-called content validity is focused upon test formsrather than test scores upon instruments rather than measurements selectingcontent is an act of classi cation which is in itself a hypothesis that needs tobe con rmed empirically

46 Validating speaking-test tasks

Like these authors we regard as inadequate any conceptualization ofvalidity that does not involve the provision of evidence on a numberof levels but would argue strongly that without a clear idea of thematch between intended content and actual content any comprehen-sive investigation of the construct validity of a test is built on sandDe ning the construct is in our view underpinned by establishingthe nature of the actual performances elicited by test tasks ie thetrue content of tasks

2 Present and future applications of observational checklists

Versions of the checklists require a degree of training and practicesimilar to that given to raters if a reliable and consistent outcome isto be expected This requires that standardized training materials bedeveloped alongside the checklists In the case of these checkliststhis process has already begun with the initial versions piloted duringPhase 3 of the project

The checklists have great potential as an evaluative tool and canprovide comprehensive insight into various issues It is hoped thatamongst other issues the checklists will provide insights into the fol-lowing

middot the language functions that the different task-types (and differentsub-tasks within these) employed in the UCLES Main Suite Paper5 (Speaking) Tests typically elicit

middot the language that the pair-format elicits and how it differs in nat-ure and quality from that elicited by interlocutor-single candi-date testing

middot the extent to which there is functional variation across the topfour levels of the UCLES Main Suite Spoken Language Test

In addition to these issues the way in which the checklists can beapplied may allow for other important questions to be answered Forexample by allowing the evaluator multiple observations (stoppingand starting a recording of a test at will) it will be possible to estab-lish whether there are quanti able differences in the language func-tions generated by the different tasks ie the evaluators will havethe time they need to make frequency counts of the functions

While the results to date have focused on a posteriori validationprocedures these checklists are also relevant to task design By takinginto account the expected response of a task (and by describing thatresponse in terms of these functions ) it will be possible to explorepredicted and actual test task outcome It will also be a useful guidefor item writers in taking a priori decisions about content coverageThrough this approach it should be possible to predict more accurately

Barry OrsquoSullivan Cyril J Weir and Nick Saville 47

linguistic response (in terms of the elements of the checklists) andto apply this to the design of test tasks ndash and of course to evaluatethe success of the prediction later on In the longer term this willlead to a greater understanding of how tasks and task formats can bemanipulated to result in speci c language use We are not claimingthat it is possible to predict language use at a micro level(grammatical form or lexis) but that it is possible to predict infor-mational and interactional functions and features of interaction man-agement ndash a notion supported by Bygate (1999)

The checklists should also enable us to explore how systematicvariation in such areas as interviewer questioning behaviour (andinterlocutor frame adherence ) affects the language produced in thistype of test In the interview transcribed for this study for examplethe examiner directed his questions very deliberately (systematicallyaiming the questions at one participant and then the other) Thistended to sti e any spontaneity in the intended three-way discussion(Task 4) so occurrences of Interactional and Discourse ManagementFunctions did not materialize to the extent intended by the taskdesigners It is possible that a less deliberate (unscripted) questioningtechnique would lead to a less interviewer-oriented interaction patternand allow for the more genuine interactive communication envisagedin the task design

Perhaps the most valuable contribution that this type of validationprocedure offers is its potential to improve the quality of oral assess-ment in both low-stakes and high-stakes contexts By offering theinvestigator an instrument that can be used in real time the checklistsbroaden the scope of investigation from limited case study analysisof small numbers of test transcripts to large scale eld studies acrossa wide range of testing contexts

Acknowledgements

We would like to thank Don Porter and Rita Green for their earlyinput into the rst version of the checklist In addition help wasreceived from members of the ELT division in UCLES in particularfrom Angela ffrench Lynda Taylor and Christina Rimini from agroup of UCLES Senior Team Leaders and from MA TEFL studentsat the University of Reading Finally we would like to thank theeditors and anonymous reviewers of Language Testing for theirinsightful comments and helpful suggestions for its improvement Thefaults that remain are as ever ours

48 Validating speaking-test tasks

VIII References

Anastasi A 1988 Psychological testing 6th edition New York Macmil-lan

Bachman LF 1990 Fundamental considerations in language testingOxford Oxford University Press

Bachman LF and Palmer AS 1981 The construct validation of the FSIoral interview Language Learning 31 67ndash86

mdashmdash 1996 Language testing in practice Oxford Oxford University PressBallman TL 1991 The oral task of picture description similarities and

differences in native and nonnative speakers of Spanish In TeschnerRV editor Assessing foreign language pro ciency of undergrad-uates AAUSC Issues in Language Program Direction Boston Heinleand Heinle 221ndash31

Brown A 1998 Interviewer style and candidate performance in the IELSToral interview Paper presented at the Language Testing Research Col-loquium Monterey CA

Bygate M 1988 Speaking Oxford Oxford University Pressmdashmdash 1999 Quality of language and purpose of task patterns of learnersrsquo

language on two oral communication tasks Language TeachingResearch 3 185ndash214

Chalhoub-Deville M 1995a Deriving oral assessment scales across differ-ent tests and rater groups Language Testing 12 16ndash33

mdashmdash 1995b A contextualized approach to describing oral language pro- ciency Language Learning 45 251ndash81

Clark JLD 1979 Direct vs semi-direct tests of speaking ability In Bri-ere EJ and Hinofotis FB editors Concepts in language testingsome recent studies Washington DC TESOL

mdashmdash 1988 Validation of a tape-mediated ACTFLILR scale based test ofChinese speaking pro ciency Language Testing 5 187ndash205

Clark JLD and Hooshmand D 1992 lsquoScreen to Screenrsquo testing anexploratory study of oral pro ciency interviewing using video tele-conferencing System 20 293ndash304

Cronbach LJ 1971 Validity In Thorndike RL editor Educationalmeasurement 2nd edition Washington DC American Council on Edu-cation 443ndash597

mdashmdash 1990 Essentials of psychological testing 5th edition New YorkHarper amp Row

Davies A 1977 The construction of language tests In Allen JPB andDavies A editors Testing and experimental methods The EdinburghCourse in Applied Linguistics Volume 4 London Oxford UniversityPress 38ndash194

mdashmdash 1990 Principles of language testing Oxford BlackwellEllerton AW 1997 Considerations in the validation of semi-direct oral

testing Unpublished PhD thesis CALS University of Readingffrench A 1999 Language functions and UCLES speaking tests Seminar

in Athens Greece October 1999

Barry OrsquoSullivan Cyril J Weir and Nick Saville 49

Foster P and Skehan P 1996 The in uence of planning and task typeon second language performance Studies in Second Language Acqui-sition 18 299ndash323

mdashmdash 1999 The in uence of source of planning and focus of planning ontask-based performance Language Teaching Research 3 215ndash47

Fulcher G 1994 Some priority areas for oral language testing LanguageTesting Update 15 39ndash47

mdashmdash 1996 Testing tasks issues in task design and the group oral LanguageTesting 13 23ndash51

mdashmdash 1999 Assessment in English for academic purposes putting contentvalidity in its place Applied Linguistics 20 221ndash36

Hayashi M 1995 Conversational repair a contrastive study of Japaneseand English MA Project Report University of Canberra

Henning G 1983 Oral pro ciency testing comparative validities of inter-view imitation and completion methods Language Learning 33315ndash32

mdashmdash 1987 A guide to language testing Cambridge MA Newbury HouseKelly R 1978 On the construct validation of comprehension tests an exer-

cise in applied linguistics Unpublished PhD thesis University ofQueensland

Kenyon D 1995 An investigation of the validity of task demands onperformance-based tests of oral pro ciency In Kunnan AJ editorValidation in language assessment selected papers from the 17th Lan-guage Testing Research Colloquium Long Beach Mahwah NJ Lawr-ence Erlbaum 19ndash40

Kormos J 1999 Simulating conversations in oral-pro ciency assessmenta conversation analysis of role plays and non-scripted interviews inlanguage exams Language Testing 16 163ndash88

Lazaraton A 1992 The structural organisation of a language interview aconversational analytic perspective System 20 373ndash86

mdashmdash1996 A qualitative approach to monitoring examiner conduct in theCambridge assessment of spoken English (CASE) In Milanovic Mand Saville N editors Performance testing cognition andassessment selected papers from the 15th Language Testing ResearchColloquium Cambridge and Arnhem Studies in Language Testing 3Cambridge University of Cambridge Local Examinations Syndicate18ndash33

mdashmdash 2000 A qualitative approach to the validation of oral language testsStudies in Language Testing Volume 14 Cambridge Cambridge Uni-versity Press

Lumley T and OrsquoSullivan B 2000 The effect of speaker and topic vari-ables on task performance in a tape-mediated assessment of speakingPaper presented at the 2nd Annual Asian Language AssessmentResearch Forum The Hong Kong Polytechnic University

Luoma S 1997 Comparability of a tape-mediated and a face-to-face testof speaking a triangulation study Unpublished Licentiate ThesisCentre for Applied Language Studies Jyvaskyla University Finland

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 14: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

46 Validating speaking-test tasks

Like these authors we regard as inadequate any conceptualization ofvalidity that does not involve the provision of evidence on a numberof levels but would argue strongly that without a clear idea of thematch between intended content and actual content any comprehen-sive investigation of the construct validity of a test is built on sandDe ning the construct is in our view underpinned by establishingthe nature of the actual performances elicited by test tasks ie thetrue content of tasks

2 Present and future applications of observational checklists

Versions of the checklists require a degree of training and practicesimilar to that given to raters if a reliable and consistent outcome isto be expected This requires that standardized training materials bedeveloped alongside the checklists In the case of these checkliststhis process has already begun with the initial versions piloted duringPhase 3 of the project

The checklists have great potential as an evaluative tool and canprovide comprehensive insight into various issues It is hoped thatamongst other issues the checklists will provide insights into the fol-lowing

middot the language functions that the different task-types (and differentsub-tasks within these) employed in the UCLES Main Suite Paper5 (Speaking) Tests typically elicit

middot the language that the pair-format elicits and how it differs in nat-ure and quality from that elicited by interlocutor-single candi-date testing

middot the extent to which there is functional variation across the topfour levels of the UCLES Main Suite Spoken Language Test

In addition to these issues the way in which the checklists can beapplied may allow for other important questions to be answered Forexample by allowing the evaluator multiple observations (stoppingand starting a recording of a test at will) it will be possible to estab-lish whether there are quanti able differences in the language func-tions generated by the different tasks ie the evaluators will havethe time they need to make frequency counts of the functions

While the results to date have focused on a posteriori validationprocedures these checklists are also relevant to task design By takinginto account the expected response of a task (and by describing thatresponse in terms of these functions ) it will be possible to explorepredicted and actual test task outcome It will also be a useful guidefor item writers in taking a priori decisions about content coverageThrough this approach it should be possible to predict more accurately

Barry OrsquoSullivan Cyril J Weir and Nick Saville 47

linguistic response (in terms of the elements of the checklists) andto apply this to the design of test tasks ndash and of course to evaluatethe success of the prediction later on In the longer term this willlead to a greater understanding of how tasks and task formats can bemanipulated to result in speci c language use We are not claimingthat it is possible to predict language use at a micro level(grammatical form or lexis) but that it is possible to predict infor-mational and interactional functions and features of interaction man-agement ndash a notion supported by Bygate (1999)

The checklists should also enable us to explore how systematicvariation in such areas as interviewer questioning behaviour (andinterlocutor frame adherence ) affects the language produced in thistype of test In the interview transcribed for this study for examplethe examiner directed his questions very deliberately (systematicallyaiming the questions at one participant and then the other) Thistended to sti e any spontaneity in the intended three-way discussion(Task 4) so occurrences of Interactional and Discourse ManagementFunctions did not materialize to the extent intended by the taskdesigners It is possible that a less deliberate (unscripted) questioningtechnique would lead to a less interviewer-oriented interaction patternand allow for the more genuine interactive communication envisagedin the task design

Perhaps the most valuable contribution that this type of validationprocedure offers is its potential to improve the quality of oral assess-ment in both low-stakes and high-stakes contexts By offering theinvestigator an instrument that can be used in real time the checklistsbroaden the scope of investigation from limited case study analysisof small numbers of test transcripts to large scale eld studies acrossa wide range of testing contexts

Acknowledgements

We would like to thank Don Porter and Rita Green for their earlyinput into the rst version of the checklist In addition help wasreceived from members of the ELT division in UCLES in particularfrom Angela ffrench Lynda Taylor and Christina Rimini from agroup of UCLES Senior Team Leaders and from MA TEFL studentsat the University of Reading Finally we would like to thank theeditors and anonymous reviewers of Language Testing for theirinsightful comments and helpful suggestions for its improvement Thefaults that remain are as ever ours

48 Validating speaking-test tasks

VIII References

Anastasi A 1988 Psychological testing 6th edition New York Macmil-lan

Bachman LF 1990 Fundamental considerations in language testingOxford Oxford University Press

Bachman LF and Palmer AS 1981 The construct validation of the FSIoral interview Language Learning 31 67ndash86

mdashmdash 1996 Language testing in practice Oxford Oxford University PressBallman TL 1991 The oral task of picture description similarities and

differences in native and nonnative speakers of Spanish In TeschnerRV editor Assessing foreign language pro ciency of undergrad-uates AAUSC Issues in Language Program Direction Boston Heinleand Heinle 221ndash31

Brown A 1998 Interviewer style and candidate performance in the IELSToral interview Paper presented at the Language Testing Research Col-loquium Monterey CA

Bygate M 1988 Speaking Oxford Oxford University Pressmdashmdash 1999 Quality of language and purpose of task patterns of learnersrsquo

language on two oral communication tasks Language TeachingResearch 3 185ndash214

Chalhoub-Deville M 1995a Deriving oral assessment scales across differ-ent tests and rater groups Language Testing 12 16ndash33

mdashmdash 1995b A contextualized approach to describing oral language pro- ciency Language Learning 45 251ndash81

Clark JLD 1979 Direct vs semi-direct tests of speaking ability In Bri-ere EJ and Hinofotis FB editors Concepts in language testingsome recent studies Washington DC TESOL

mdashmdash 1988 Validation of a tape-mediated ACTFLILR scale based test ofChinese speaking pro ciency Language Testing 5 187ndash205

Clark JLD and Hooshmand D 1992 lsquoScreen to Screenrsquo testing anexploratory study of oral pro ciency interviewing using video tele-conferencing System 20 293ndash304

Cronbach LJ 1971 Validity In Thorndike RL editor Educationalmeasurement 2nd edition Washington DC American Council on Edu-cation 443ndash597

mdashmdash 1990 Essentials of psychological testing 5th edition New YorkHarper amp Row

Davies A 1977 The construction of language tests In Allen JPB andDavies A editors Testing and experimental methods The EdinburghCourse in Applied Linguistics Volume 4 London Oxford UniversityPress 38ndash194

mdashmdash 1990 Principles of language testing Oxford BlackwellEllerton AW 1997 Considerations in the validation of semi-direct oral

testing Unpublished PhD thesis CALS University of Readingffrench A 1999 Language functions and UCLES speaking tests Seminar

in Athens Greece October 1999

Barry OrsquoSullivan Cyril J Weir and Nick Saville 49

Foster P and Skehan P 1996 The in uence of planning and task typeon second language performance Studies in Second Language Acqui-sition 18 299ndash323

mdashmdash 1999 The in uence of source of planning and focus of planning ontask-based performance Language Teaching Research 3 215ndash47

Fulcher G 1994 Some priority areas for oral language testing LanguageTesting Update 15 39ndash47

mdashmdash 1996 Testing tasks issues in task design and the group oral LanguageTesting 13 23ndash51

mdashmdash 1999 Assessment in English for academic purposes putting contentvalidity in its place Applied Linguistics 20 221ndash36

Hayashi M 1995 Conversational repair a contrastive study of Japaneseand English MA Project Report University of Canberra

Henning G 1983 Oral pro ciency testing comparative validities of inter-view imitation and completion methods Language Learning 33315ndash32

mdashmdash 1987 A guide to language testing Cambridge MA Newbury HouseKelly R 1978 On the construct validation of comprehension tests an exer-

cise in applied linguistics Unpublished PhD thesis University ofQueensland

Kenyon D 1995 An investigation of the validity of task demands onperformance-based tests of oral pro ciency In Kunnan AJ editorValidation in language assessment selected papers from the 17th Lan-guage Testing Research Colloquium Long Beach Mahwah NJ Lawr-ence Erlbaum 19ndash40

Kormos J 1999 Simulating conversations in oral-pro ciency assessmenta conversation analysis of role plays and non-scripted interviews inlanguage exams Language Testing 16 163ndash88

Lazaraton A 1992 The structural organisation of a language interview aconversational analytic perspective System 20 373ndash86

mdashmdash1996 A qualitative approach to monitoring examiner conduct in theCambridge assessment of spoken English (CASE) In Milanovic Mand Saville N editors Performance testing cognition andassessment selected papers from the 15th Language Testing ResearchColloquium Cambridge and Arnhem Studies in Language Testing 3Cambridge University of Cambridge Local Examinations Syndicate18ndash33

mdashmdash 2000 A qualitative approach to the validation of oral language testsStudies in Language Testing Volume 14 Cambridge Cambridge Uni-versity Press

Lumley T and OrsquoSullivan B 2000 The effect of speaker and topic vari-ables on task performance in a tape-mediated assessment of speakingPaper presented at the 2nd Annual Asian Language AssessmentResearch Forum The Hong Kong Polytechnic University

Luoma S 1997 Comparability of a tape-mediated and a face-to-face testof speaking a triangulation study Unpublished Licentiate ThesisCentre for Applied Language Studies Jyvaskyla University Finland

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 15: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

Barry OrsquoSullivan Cyril J Weir and Nick Saville 47

linguistic response (in terms of the elements of the checklists) andto apply this to the design of test tasks ndash and of course to evaluatethe success of the prediction later on In the longer term this willlead to a greater understanding of how tasks and task formats can bemanipulated to result in speci c language use We are not claimingthat it is possible to predict language use at a micro level(grammatical form or lexis) but that it is possible to predict infor-mational and interactional functions and features of interaction man-agement ndash a notion supported by Bygate (1999)

The checklists should also enable us to explore how systematicvariation in such areas as interviewer questioning behaviour (andinterlocutor frame adherence ) affects the language produced in thistype of test In the interview transcribed for this study for examplethe examiner directed his questions very deliberately (systematicallyaiming the questions at one participant and then the other) Thistended to sti e any spontaneity in the intended three-way discussion(Task 4) so occurrences of Interactional and Discourse ManagementFunctions did not materialize to the extent intended by the taskdesigners It is possible that a less deliberate (unscripted) questioningtechnique would lead to a less interviewer-oriented interaction patternand allow for the more genuine interactive communication envisagedin the task design

Perhaps the most valuable contribution that this type of validationprocedure offers is its potential to improve the quality of oral assess-ment in both low-stakes and high-stakes contexts By offering theinvestigator an instrument that can be used in real time the checklistsbroaden the scope of investigation from limited case study analysisof small numbers of test transcripts to large scale eld studies acrossa wide range of testing contexts

Acknowledgements

We would like to thank Don Porter and Rita Green for their earlyinput into the rst version of the checklist In addition help wasreceived from members of the ELT division in UCLES in particularfrom Angela ffrench Lynda Taylor and Christina Rimini from agroup of UCLES Senior Team Leaders and from MA TEFL studentsat the University of Reading Finally we would like to thank theeditors and anonymous reviewers of Language Testing for theirinsightful comments and helpful suggestions for its improvement Thefaults that remain are as ever ours

48 Validating speaking-test tasks

VIII References

Anastasi A 1988 Psychological testing 6th edition New York Macmil-lan

Bachman LF 1990 Fundamental considerations in language testingOxford Oxford University Press

Bachman LF and Palmer AS 1981 The construct validation of the FSIoral interview Language Learning 31 67ndash86

mdashmdash 1996 Language testing in practice Oxford Oxford University PressBallman TL 1991 The oral task of picture description similarities and

differences in native and nonnative speakers of Spanish In TeschnerRV editor Assessing foreign language pro ciency of undergrad-uates AAUSC Issues in Language Program Direction Boston Heinleand Heinle 221ndash31

Brown A 1998 Interviewer style and candidate performance in the IELSToral interview Paper presented at the Language Testing Research Col-loquium Monterey CA

Bygate M 1988 Speaking Oxford Oxford University Pressmdashmdash 1999 Quality of language and purpose of task patterns of learnersrsquo

language on two oral communication tasks Language TeachingResearch 3 185ndash214

Chalhoub-Deville M 1995a Deriving oral assessment scales across differ-ent tests and rater groups Language Testing 12 16ndash33

mdashmdash 1995b A contextualized approach to describing oral language pro- ciency Language Learning 45 251ndash81

Clark JLD 1979 Direct vs semi-direct tests of speaking ability In Bri-ere EJ and Hinofotis FB editors Concepts in language testingsome recent studies Washington DC TESOL

mdashmdash 1988 Validation of a tape-mediated ACTFLILR scale based test ofChinese speaking pro ciency Language Testing 5 187ndash205

Clark JLD and Hooshmand D 1992 lsquoScreen to Screenrsquo testing anexploratory study of oral pro ciency interviewing using video tele-conferencing System 20 293ndash304

Cronbach LJ 1971 Validity In Thorndike RL editor Educationalmeasurement 2nd edition Washington DC American Council on Edu-cation 443ndash597

mdashmdash 1990 Essentials of psychological testing 5th edition New YorkHarper amp Row

Davies A 1977 The construction of language tests In Allen JPB andDavies A editors Testing and experimental methods The EdinburghCourse in Applied Linguistics Volume 4 London Oxford UniversityPress 38ndash194

mdashmdash 1990 Principles of language testing Oxford BlackwellEllerton AW 1997 Considerations in the validation of semi-direct oral

testing Unpublished PhD thesis CALS University of Readingffrench A 1999 Language functions and UCLES speaking tests Seminar

in Athens Greece October 1999

Barry OrsquoSullivan Cyril J Weir and Nick Saville 49

Foster P and Skehan P 1996 The in uence of planning and task typeon second language performance Studies in Second Language Acqui-sition 18 299ndash323

mdashmdash 1999 The in uence of source of planning and focus of planning ontask-based performance Language Teaching Research 3 215ndash47

Fulcher G 1994 Some priority areas for oral language testing LanguageTesting Update 15 39ndash47

mdashmdash 1996 Testing tasks issues in task design and the group oral LanguageTesting 13 23ndash51

mdashmdash 1999 Assessment in English for academic purposes putting contentvalidity in its place Applied Linguistics 20 221ndash36

Hayashi M 1995 Conversational repair a contrastive study of Japaneseand English MA Project Report University of Canberra

Henning G 1983 Oral pro ciency testing comparative validities of inter-view imitation and completion methods Language Learning 33315ndash32

mdashmdash 1987 A guide to language testing Cambridge MA Newbury HouseKelly R 1978 On the construct validation of comprehension tests an exer-

cise in applied linguistics Unpublished PhD thesis University ofQueensland

Kenyon D 1995 An investigation of the validity of task demands onperformance-based tests of oral pro ciency In Kunnan AJ editorValidation in language assessment selected papers from the 17th Lan-guage Testing Research Colloquium Long Beach Mahwah NJ Lawr-ence Erlbaum 19ndash40

Kormos J 1999 Simulating conversations in oral-pro ciency assessmenta conversation analysis of role plays and non-scripted interviews inlanguage exams Language Testing 16 163ndash88

Lazaraton A 1992 The structural organisation of a language interview aconversational analytic perspective System 20 373ndash86

mdashmdash1996 A qualitative approach to monitoring examiner conduct in theCambridge assessment of spoken English (CASE) In Milanovic Mand Saville N editors Performance testing cognition andassessment selected papers from the 15th Language Testing ResearchColloquium Cambridge and Arnhem Studies in Language Testing 3Cambridge University of Cambridge Local Examinations Syndicate18ndash33

mdashmdash 2000 A qualitative approach to the validation of oral language testsStudies in Language Testing Volume 14 Cambridge Cambridge Uni-versity Press

Lumley T and OrsquoSullivan B 2000 The effect of speaker and topic vari-ables on task performance in a tape-mediated assessment of speakingPaper presented at the 2nd Annual Asian Language AssessmentResearch Forum The Hong Kong Polytechnic University

Luoma S 1997 Comparability of a tape-mediated and a face-to-face testof speaking a triangulation study Unpublished Licentiate ThesisCentre for Applied Language Studies Jyvaskyla University Finland

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 16: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

48 Validating speaking-test tasks

VIII References

Anastasi A 1988 Psychological testing 6th edition New York Macmil-lan

Bachman LF 1990 Fundamental considerations in language testingOxford Oxford University Press

Bachman LF and Palmer AS 1981 The construct validation of the FSIoral interview Language Learning 31 67ndash86

mdashmdash 1996 Language testing in practice Oxford Oxford University PressBallman TL 1991 The oral task of picture description similarities and

differences in native and nonnative speakers of Spanish In TeschnerRV editor Assessing foreign language pro ciency of undergrad-uates AAUSC Issues in Language Program Direction Boston Heinleand Heinle 221ndash31

Brown A 1998 Interviewer style and candidate performance in the IELSToral interview Paper presented at the Language Testing Research Col-loquium Monterey CA

Bygate M 1988 Speaking Oxford Oxford University Pressmdashmdash 1999 Quality of language and purpose of task patterns of learnersrsquo

language on two oral communication tasks Language TeachingResearch 3 185ndash214

Chalhoub-Deville M 1995a Deriving oral assessment scales across differ-ent tests and rater groups Language Testing 12 16ndash33

mdashmdash 1995b A contextualized approach to describing oral language pro- ciency Language Learning 45 251ndash81

Clark JLD 1979 Direct vs semi-direct tests of speaking ability In Bri-ere EJ and Hinofotis FB editors Concepts in language testingsome recent studies Washington DC TESOL

mdashmdash 1988 Validation of a tape-mediated ACTFLILR scale based test ofChinese speaking pro ciency Language Testing 5 187ndash205

Clark JLD and Hooshmand D 1992 lsquoScreen to Screenrsquo testing anexploratory study of oral pro ciency interviewing using video tele-conferencing System 20 293ndash304

Cronbach LJ 1971 Validity In Thorndike RL editor Educationalmeasurement 2nd edition Washington DC American Council on Edu-cation 443ndash597

mdashmdash 1990 Essentials of psychological testing 5th edition New YorkHarper amp Row

Davies A 1977 The construction of language tests In Allen JPB andDavies A editors Testing and experimental methods The EdinburghCourse in Applied Linguistics Volume 4 London Oxford UniversityPress 38ndash194

mdashmdash 1990 Principles of language testing Oxford BlackwellEllerton AW 1997 Considerations in the validation of semi-direct oral

testing Unpublished PhD thesis CALS University of Readingffrench A 1999 Language functions and UCLES speaking tests Seminar

in Athens Greece October 1999

Barry OrsquoSullivan Cyril J Weir and Nick Saville 49

Foster P and Skehan P 1996 The in uence of planning and task typeon second language performance Studies in Second Language Acqui-sition 18 299ndash323

mdashmdash 1999 The in uence of source of planning and focus of planning ontask-based performance Language Teaching Research 3 215ndash47

Fulcher G 1994 Some priority areas for oral language testing LanguageTesting Update 15 39ndash47

mdashmdash 1996 Testing tasks issues in task design and the group oral LanguageTesting 13 23ndash51

mdashmdash 1999 Assessment in English for academic purposes putting contentvalidity in its place Applied Linguistics 20 221ndash36

Hayashi M 1995 Conversational repair a contrastive study of Japaneseand English MA Project Report University of Canberra

Henning G 1983 Oral pro ciency testing comparative validities of inter-view imitation and completion methods Language Learning 33315ndash32

mdashmdash 1987 A guide to language testing Cambridge MA Newbury HouseKelly R 1978 On the construct validation of comprehension tests an exer-

cise in applied linguistics Unpublished PhD thesis University ofQueensland

Kenyon D 1995 An investigation of the validity of task demands onperformance-based tests of oral pro ciency In Kunnan AJ editorValidation in language assessment selected papers from the 17th Lan-guage Testing Research Colloquium Long Beach Mahwah NJ Lawr-ence Erlbaum 19ndash40

Kormos J 1999 Simulating conversations in oral-pro ciency assessmenta conversation analysis of role plays and non-scripted interviews inlanguage exams Language Testing 16 163ndash88

Lazaraton A 1992 The structural organisation of a language interview aconversational analytic perspective System 20 373ndash86

mdashmdash1996 A qualitative approach to monitoring examiner conduct in theCambridge assessment of spoken English (CASE) In Milanovic Mand Saville N editors Performance testing cognition andassessment selected papers from the 15th Language Testing ResearchColloquium Cambridge and Arnhem Studies in Language Testing 3Cambridge University of Cambridge Local Examinations Syndicate18ndash33

mdashmdash 2000 A qualitative approach to the validation of oral language testsStudies in Language Testing Volume 14 Cambridge Cambridge Uni-versity Press

Lumley T and OrsquoSullivan B 2000 The effect of speaker and topic vari-ables on task performance in a tape-mediated assessment of speakingPaper presented at the 2nd Annual Asian Language AssessmentResearch Forum The Hong Kong Polytechnic University

Luoma S 1997 Comparability of a tape-mediated and a face-to-face testof speaking a triangulation study Unpublished Licentiate ThesisCentre for Applied Language Studies Jyvaskyla University Finland

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 17: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

Barry OrsquoSullivan Cyril J Weir and Nick Saville 49

Foster P and Skehan P 1996 The in uence of planning and task typeon second language performance Studies in Second Language Acqui-sition 18 299ndash323

mdashmdash 1999 The in uence of source of planning and focus of planning ontask-based performance Language Teaching Research 3 215ndash47

Fulcher G 1994 Some priority areas for oral language testing LanguageTesting Update 15 39ndash47

mdashmdash 1996 Testing tasks issues in task design and the group oral LanguageTesting 13 23ndash51

mdashmdash 1999 Assessment in English for academic purposes putting contentvalidity in its place Applied Linguistics 20 221ndash36

Hayashi M 1995 Conversational repair a contrastive study of Japaneseand English MA Project Report University of Canberra

Henning G 1983 Oral pro ciency testing comparative validities of inter-view imitation and completion methods Language Learning 33315ndash32

mdashmdash 1987 A guide to language testing Cambridge MA Newbury HouseKelly R 1978 On the construct validation of comprehension tests an exer-

cise in applied linguistics Unpublished PhD thesis University ofQueensland

Kenyon D 1995 An investigation of the validity of task demands onperformance-based tests of oral pro ciency In Kunnan AJ editorValidation in language assessment selected papers from the 17th Lan-guage Testing Research Colloquium Long Beach Mahwah NJ Lawr-ence Erlbaum 19ndash40

Kormos J 1999 Simulating conversations in oral-pro ciency assessmenta conversation analysis of role plays and non-scripted interviews inlanguage exams Language Testing 16 163ndash88

Lazaraton A 1992 The structural organisation of a language interview aconversational analytic perspective System 20 373ndash86

mdashmdash1996 A qualitative approach to monitoring examiner conduct in theCambridge assessment of spoken English (CASE) In Milanovic Mand Saville N editors Performance testing cognition andassessment selected papers from the 15th Language Testing ResearchColloquium Cambridge and Arnhem Studies in Language Testing 3Cambridge University of Cambridge Local Examinations Syndicate18ndash33

mdashmdash 2000 A qualitative approach to the validation of oral language testsStudies in Language Testing Volume 14 Cambridge Cambridge Uni-versity Press

Lumley T and OrsquoSullivan B 2000 The effect of speaker and topic vari-ables on task performance in a tape-mediated assessment of speakingPaper presented at the 2nd Annual Asian Language AssessmentResearch Forum The Hong Kong Polytechnic University

Luoma S 1997 Comparability of a tape-mediated and a face-to-face testof speaking a triangulation study Unpublished Licentiate ThesisCentre for Applied Language Studies Jyvaskyla University Finland

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 18: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

50 Validating speaking-test tasks

McNamara T 1996 Measuring second language performance LondonLongman

Mehnert U 1998 The effects of different lengths of time for planning onsecond language performance Studies in Second Language Acquisition20 83ndash108

Messick S 1975 The standard problem meaning and values in measure-ment and evaluation American Psychologist 30 955ndash66

mdashmdash 1989 Validity In Linn RL editor Educational measurement 3rdedition New York Macmillan

Milanovic M and Saville N 1996 Introduction Performance testing cog-nition and assessment Studies in Language Testing Volume 3 Cam-bridge University of Cambridge Local Examinations Syndicate 1ndash17

Moller A D 1982 A study in the validation of pro ciency tests of Englishas a Foreign Language Unpublished PhD thesis University of Edin-burgh

Norris J Brown J D Hudson T and Yoshioka J 1998 Designingsecond language performance assessments Technical Report 18Honolulu HI University of Hawaii Press

OrsquoLoughlin K 1995 Lexical density in candidate output on direct andsemi-direct versions of an oral pro ciency test Language Testing 12217ndash37

mdashmdash 1997 The comparability of direct and semi-direct speaking tests a casestudy Unpublished PhD Thesis University of Melbourne Melbourne

mdashmdash 2001 An investigatory study of the equivalence of direct and semi-direct speaking skills Studies in Language Testing 13 CambridgeCambridge University PressUCLES

Ortega L 1999 Planning and focus on form in L2 oral performance Stud-ies in Second Language Acquisition 20 109ndash48

OrsquoSullivan B 2000 Towards a model of performance in oral languagetesting Unpublished PhD dissertation CALS University of Reading

Robinson P 1995 Task complexity and second language narrative dis-course Language Learning 45 99ndash140

Ross S and Berwick R 1992 The discourse of accommodation in oralpro ciency interviews Studies in Second Language Acquisition 14159ndash76

Saville N and Hargreaves P 1999 Assessing speaking in the revisedFCE ELT Journal 53 42ndash51

Schegloff E Jefferson G and Sachs H 1977 The preference for self-correction in the organisation of repair in conversation Language 53361ndash82

Schwartz J 1980 The negotiation for meaning repair in conversationsbetween second language learners of English In Larsen-Freeman Deditor Discourse analysis in second language research Rowley MANewbury House

Shohamy E 1983 The stability of oral language pro ciency assessment inthe oral interview testing procedure Language Learning 33 527ndash40

mdashmdash 1988 A proposed framework for testing the oral language of

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 19: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

Barry OrsquoSullivan Cyril J Weir and Nick Saville 51

second foreign language learners Studies in Second Language Acqui-sition 10 165ndash79

mdashmdash 1994 The validity of direct versus semi-direct oral tests LanguageTesting 11 99ndash123

Shohamy E Reves T and Bejarano Y 1986 Introducing a new compre-hensive test of oral pro ciency ELT Journal 40 212ndash20

Skehan P 1996 A framework for the implementation of task based instruc-tion Applied Linguistics 17 38ndash62

mdashmdash 1998 A cognitive approach to language learning Oxford OxfordUniversity Press

Stans eld CW and Kenyon DM 1992 Research on the comparabilityof the oral pro ciency interview and the simulated oral pro ciencyinterview System 20 347ndash64

Stenstrom A 1994 An introduction to spoken interaction London Long-man

Suhua H 1998 A communicative test of spoken English for the CET 6Unpublished PhD Thesis Shanghai Jiao Tong University Shanghai

Upshur JA and Turner C 1999 Systematic effects in the rating ofsecond-language speaking ability test method and learner discourseLanguage Testing 16 82ndash111

van Ek JA and Trim JLM editors 1984 Across the thresholdOxford Pergamon

van Lier L 1989 Reeling writhing drawling stretching and fainting incoils oral pro ciency interviews as conversation TESOL Quarterly23 489ndash508

Walker C 1990 Large-scale oral testing Applied Linguistics 11 200ndash19Weir CJ 1983 Identifying the language needs of overseas students in

tertiary education in the United Kingdom Unpublished PhD thesisUniversity of London

mdashmdash 1993 Understanding and developing language tests HemelHempstead Prentice Hall

Wigglesworth G 1997 An investigation of planning time and pro ciencylevel on oral test discourse Language Testing 14 85ndash106

Wigglesworth G and OrsquoLoughlin K 1993 An investigation into the com-parability of direct and semi-direct versions of an oral interaction testin English Melbourne Papers in Language Testing 2 56ndash67

Young R 1995 Conversational styles in language pro ciency interviewsLanguage Learning 45 3ndash42

Young R and Milanovic M 1992 Discourse variation in oral pro ciencyinterviews Studies in Second Language Acquisition 14 403ndash24

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 20: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

52 Validating speaking-test tasks

Appendix 1 Items included in initial draft checklists (with short gloss)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Providing nonpersonal Give information which does not relate to the individualinformationElaborating Elaborate on an ideaExpressing opinions Express opinionsJustifying opinions Express reasons for assertions she has madeComparing Compare thingspeopleeventsComplaining Complain about somethingSpeculating Hypothesize or speculateAnalysing Separate out the parts of an issueMaking excuses Make excusesExplaining Explain anythingNarrating Describe a sequence of eventsParaphrasing Paraphrase somethingSummarizing Summarize what she had saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsChallenging Challenge assertions made by another speaker(Dis)agreeing Indicate (dis)agreement with what another speaker

says (apart from lsquoyeahrsquolsquonorsquo or simply nodding)JustifyingProviding support Offer justication or support for a comment made by

another speakerQualifying Modify arguments or commentsAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Attempt to establish common ground or strategymiddot Respond to requests for claricationmiddot Ask for claricationmiddot Make correctionsmiddot Indicate purposemiddot Indicate understandinguncertainty

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocity Share the responsibility for developing the interactionDeciding Come to a decisionTerminating Decide when the discussion should stop

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 21: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

Barry OrsquoSullivan Cyril J Weir and Nick Saville 53A

pp

end

ix2

Ph

ase

1re

sult

s(s

um

mar

ized

)

Make e

xcuse

s

Term

inate

Convers

atio

nal r

ep

air

Su

mm

ari

ze

Com

pla

in

Pa

rap

hra

se

Pe

rsuade

Change

topic

Challe

ng

e

Qua

lify

Ask

fo

r in

fo

Sugg

est

Narr

ate

Reci

pro

cate

Analy

se

Ela

bora

te

Initi

ate

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

lain

Justif

y op

inio

ns

Negotiate

meanin

g

Deci

de

(Dis

) ag

ree

Justif

yS

uppo

rtA

sk fo

r o

pin

ions

Exp

ress

pre

fere

nce

s

Specula

teC

om

pare

Pro

vide n

onpe

rsonal

info

rma

tion

Exp

ress

opin

ion

Pa

rtic

ipan

ts

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 22: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

54 Validating speaking-test tasks

Appendix 3 Operational checklist (used in Phase 3)

Informational functionsProviding personal information middot Give information on present circumstances

middot Give information on past experiencesmiddot Give information on future plans

Expressing opinions Express opinionsElaborating Elaborate on or modify an opinionJustifying opinions Express reasons for assertions she had madeComparing Compare thingspeopleeventsSpeculating SpeculateStaging Separate out or interpret the parts of an issueDescribing middot Describe a sequence of events

middot Describe a sceneSummarizing Summarize what she has saidSuggesting Suggest a particular ideaExpressing preferences Express preferences

Interactional functionsAgreeing Agree with an assertion made by another speaker

(apart from lsquoyeahrsquo or nonverbal)Disagreeing Disagree with what another speaker says (apart from

lsquonorsquo or nonverbal)Modifying Modify arguments or comments made by other speaker

or by the test-taker in response to another speakerAsking for opinions Ask for opinionsPersuading Attempt to persuade another personAsking for information Ask for informationConversational repair Repair breakdowns in interactionNegotiating meaning middot Check understanding

middot Indicate understanding of point made by partnermiddot Establish common groundpurpose or strategymiddot Ask for clari cation when an utterance is misheard or

misinterpretedmiddot Correct an utterance made by other speaker which is

perceived to be incorrect or inaccuratemiddot Respond to requests for clari cation

Managing interactionInitiating Start any interactionsChanging Take the opportunity to change the topicReciprocating Share the responsibility for developing the interactionDeciding Come to a decision

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 23: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

Barry OrsquoSullivan Cyril J Weir and Nick Saville 55

Appendix 4 Summary of Phase 2 observation

Tape 1 Tape 2

Task Task Task Task Task Task Task Task1 2 3 4 1 2 3 4

Informational functionsProviding personalinformation

Present 12 (G) 1 (L) 1 (L) 1 (L) 12 (G) 1 (L) 4 (L)Past 10 (G) 4 (S) 12 (G)Future 11 (G) 3 (L) 6 (S) 12 (G)

Expressing opinions 12 (G) 11 (G) 9 (G) 8 (G) 11 (G) 10 (G) 10 (G) 11 (G)Elaborating 9 (G) 11 (G) 9 (G) 7 (G) 3 (L) 9 (G) 7 (S) 12 (G)Justifying opinions 10 (G) 7 (G) 9 (G) 7 (G) 4 (L) 8 (S) 6 (S) 8 (S)Comparing 11 (G) 8 (G) 1 (L) 6 (S) 3 (L) 12 (G) 7 (S) 5 (S)Speculating 7 (S) 11 (G) 8 (G) 3 (L) 7 (S) 10 (G) 10 (G) 5 (S)Staging 6 (S) 1 (L) 3 (L) 6 (L)Describing

Sequence of events 1 (L) 1 (L) 3 (L) 1 (L) 4 (L)Scene 5 (S) 9 (G) 2 (S) 2 (S) 10 (G) 2 (S) 3 (S)

Summarizing 1 (L) 1 (L) 1 (L) 1 (L) 3 (L) 1 (L) 1 (L) 1 (L)Suggesting 1 (L) 2 (L) 1 (L) 3 (L) 2 (L)Expressing preferences 12 (G) 11 (G) 6 (S) 8 (G) 11 (G) 10 (G) 5 (S) 12 (G)

Interactional functionsAgreeing 6 (S) 9 (G) 2 (L) 10 (G) 4 (L)Disagreeing 9 (G) 4 (S) 2 (L) 6 (S)Modifying 1 (L) 5 (S) 4 (S) 7 (S) 1 (L)Asking for opinions 1 (L) 8 (G) 2 (L) 11 (G)Persuading 2 (L) 2 (L)Asking for information 2 (L) 1 (L) 5 (S)Conversational repair 5 (S) 4 (L) 1 (L)Negotiating meaning

Check meaning 2 (L) 4 (S) 4 (L)Understanding 5 (S) 3 (L) 3 (L)Common group 2 (L) 2 (L) 1 (L)Ask clari cation 2 (L) 1 (L) 2 (L)Correct utterance 3 (L) 1 (L)Respond to required 4 (S) 1 (L)clari cation

Managing interactionInitiating 8 (G) 1 (L) 10 (G) 5 (S)Changing 8 (G) 7 (S)Reciprocating 7 (G) 9 (G) 1 (L)Deciding 3 (L) 1 (L) 1 (L) 2 (L)

Notes The gures indicate the number of students that complete the task in each case L Littleagreement S Some agreement G Good aggreement For Tasks 3 and 4 in the rst tapeobserved the maximum was 9 for all others the maximum was 12 This is because 3 of the12 MA students did not complete the task for these last 2 tasks This was not a problem duringthe observation of the second tape so for all the maximum gures are 12

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)

Page 24: Using observation checklists to validate speaking-test taskspart. The format of the Main Suite Speaking Tests (with the exception of the Level 1 KET test) is summarized in Table 1

56 Validating speaking-test tasks

Appendix 5 Transcript results and observation checklist results

Informational functions Task 1 Task 2 Task 3 Task 4

Providing personal informationPresent T G L T LPast T GFuture T G

Expressing opinions T G T G T G T GElaborating L T G T S T GJustifying opinions L T S T S T SComparing L T G T S SSpeculating T S T G T G SStaging T L T SDescribing

Sequence of events T L L LScene T G L L

Summarizing T L L L LSuggesting L LExpressing preferences T G T G S T G

Interactional functionsAgreeing T G T LDisagreeing T SModifying T S T LAsking for opinions T GPersuading LAsking for information SConversational repair T S T L LNegotiating meaning

Check meaning LUnderstanding L LCommon ground L LAsk clari cation L T LCorrect utterance LRespond to required Lclari cation

Managing interactionInitiating T G T SChanging T SReciprocating T G LDeciding L L

Notes T indicates that this function has been identi ed as occurring in the transcript of theinteraction L S and G indicate the degree of agreement among the raters using the check-lists in real time (L Little agreement S Some agreement G Good agreement)