validity & reliability report › wp-content › uploads › 2019 › 01 › itep...validity...

iTEP AcademicVALIDITY & RELIABILITY REPORT

2 i T E P A c a d e m i c

TABLE OF CONTENTS

About this Report 4

Chapter 1 Introduction� 5Approach�and�Rationale�for�the�Development�of�iTEP�Academic� 6

Chapter 2 Detailed�Description�of�iTEP�Academic� 7Theoretical�Model�for�Language�Assessment� 7Description�of�iTEP�Academic�Sections� 8

Grammar�Section� 8Listening�Section� 9Reading�Section� 10Writing�Section� 11Speaking�Section� 12Test�Administration� 13Delivery�Method� 13Examinee�Experience� 13Scoring/Grading� 14Proficiency�Levels� 14

Chapter�3� iTEP�Academic�Development�Process,�Reliability,�and�Validity� 15Development�Process� 15Reliability� 16

Internal�Consistency�Reliability� 16Test-Retest�Reliability� 18Rater�Agreement� 19

Validity� 22Content�Validity� 22Convergent�and�Discriminant�Validity� 23

References� 25

Appendix�A:�Examinee�Pre-Assessment�Modules�and�Instructions� 26

Appendix�B:�iTEP�Ability�Guide� 29

Appendix�C:�Summary�of�Steps�to�Minimize�Content-Irrelevant�Test�Material� 31

3Va l i d i t y & R e l i a b i l i t y R e p o r t

Copyright�©�iTEP�International

All�rights�reserved.�No�part�of�this�publication�may�be�reproduced,�stored�in�a�retrieval�system,�or�transmitted,�in�any�form�or�by�any�means,�except�as�may�be�expressly�permitted�by�the�1976�Copyright�Act�or�by�iTEP�in�writing.

LIST OF FIGURESFigure�1.� Continuous�Cycle�of�Item�Development� 16

LIST OF TABLESTable�1.� Internal�Consistency�Reliability�Estimates�for�Relevant�iTEP�Academic�Sections� 18

Table�2.� Test-Retest�Reliability�Estimates�for�iTEP�Academic� 19

Table�3.� Raw�Rater�Agreement�Analysis�–�Writing�Section� 20

Table�4.� Raw�Rater�Agreement�Analysis�–�Speaking�Section� 21

Table�5.� rWG Rater�Agreement�Statistics�–�Writing�Section� 22

Table�6.� rWG Rater�Agreement�Statistics�–�Speaking�Section� 22

Table�7.� iTEP�Academic�Section�Intercorrelations� 24

ABOUT THIS REPORT

PurposeThe�purpose�of�this�report�is�to�describe�the�academic�details�and�rationale�for�the�development�of�iTEP�Academic�and�to�summarize�the�reliability�and�validity�of�the�assessment.

AcknowledgmentsThe�data�analyzed�for�this�report�were�provided�in�raw�form�by�iTEP�International�to�Dr.�Stephanie�Seiler,�Independent�Consultant.�The�data�were�not�manipulated�in�any�way,�unless�the�data�manipulation�was�done�according�to�accepted�statistical�procedures�and/or�was�done�according�to�other�rationale�as�described�in�this�report.�Dr.�Seiler�conducted�all�analyses�and�wrote�the�report.�Dr.�Jay�Verkuillen�and�Dr.�Rob�Stilson�provided�statistical�consultation.

About the AuthorStephanie�Seiler�holds�a�PhD�in�Industrial�and�Organizational�Psychology�from�the�University�of� Illinois�at�Urbana-Champaign.�Dr.�Seiler�worked� in� the�personnel� selection� industry� for�five�years�as�a�Selection�and�Assessment�Manager�and�then�as�the�Director�of�Research�and�Development�for�a�leading�personnel�selection�company.�In�these�roles,�she�was�responsible�for�customer�research,�consulting,�implementation�of�assessments,�and�development�of�new�and� innovative�assessments.�Stephanie�has�published�and�presented�in�the�areas�of�educational�and� personnel�assessment,�innovative�assessment�methodologies,�and�research�ethics.�She�is�currently� operating�as�an�independent�assessment�consultant�and�is�a�Lead�Research�Associate�and� Product�Developer�with�the�University�of�Illinois�at�Urbana�Champaign�with�the�National�Center�for�Professional�and�Research�Ethics�(NCPRE).

About iTEP InternationaliTEP�International,�founded�by�career�international�educators,�developed�the�iTEP�suite�of�exam-inations�to�provide�institutions�and�individual�test-takers�with�an�efficient,�secure,�accurate,�and�affordable�on-demand�language�proficiency�assessment.

There�are�currently�seven�versions�of�iTEP�available:�iTEP�Academic,� iTEP�SLATE�(Secondary�Level�Assessment�Test�of�English),�iTEP�Business,�iTEP�Au�Pair,�iTEP�Intern,�iTEP�Hospitality,�and�iTEP�Conversation.�All�seven�exams�have�the�same�basic�structure,�standardized�rubric�scoring,�and�administration�procedures.�The�exams�assess�all�or�some�of�the�following�five�compo-nents�of�English�language�proficiency:�Grammar,�Listening,�Reading,�Writing,�and/or�Speaking.�

C H A P T E R 1

INTRODUCTION

The�International�Test�of�English�Proficiency�-�Academic�(iTEP�Academic),�developed�and�published�by�iTEP�International,�is�a�multimedia�assessment�that�evaluates�the�English�language�proficiency�of�English�as�a�Second�Language�(ESL)�college�applicants�and�students.

iTEP�Academic�is�commonly�used�for:

• Making�admissions�decisions

• Placing�students�within�language�programs

• Guiding�course�instruction�and�curriculum�development

• Evaluating�pre-�and�post-course�progress

• Determining�eligibility�for�scholarships

iTEP�Academic�is�also�used�to�assess�the�proficiency�of�English�language�teachers,�government�initiatives,�and�official�certifications.�

In�order�to�target�the�level�and�type�of�English�proficiency�needed�to�be�a�successful�college�student,�the�content�of�iTEP�Academic�is�tailored�to�reflect�the�academic�and�life�experiences�of�individuals�who�attend�or�plan�to�attend�college.�iTEP�Academic�does�not�require�any�specialized�academic�or�cultural�knowledge,�so�it�is�well-suited�for�testing�in�any�academic�discipline.�The�assessment�evaluates�examinees’�ability�to�apply�their�English�knowledge�and�skill�to�process,�learn�from,�and�respond�appropriately�to�new�information�that�is�presented�in�English.�iTEP�Academic�is�delivered�over�the�Internet�at�secure�Certified�iTEP�Test�Centers�around�the�world.�Examinees�can�schedule�a�testing�date�within�three�business�days�of�contacting�the�test�center.

There�are�two�versions�of�iTEP�Academic:

• iTEP Academic-Core:�assesses�Grammar,�Listening,�and�Reading�and�is�50�minutes�in�length,�with�an�additional�10�minutes�for�pre-test�preparation.�Results�are�available�immediately.

• iTEP Academic-Plus:�assesses�Grammar,�Listening,�Reading,�Writing,�and�Speaking�and�is�80�minutes�in�length,�with�an�additional�10�minutes�for�pre-test�preparation.�Results�are�available�within�one�business�day.

iTEP�automatically�emails�the�examinee’s�official�score�report�to�the�client.�An�online�iTEP�client�account�provides�a�variety�of�tools�for�managing�results.

Approach and Rationale for the Development of iTEP AcademicCollege�is�a�social�and�communicative�experience.�Whether�the�student�is�listening�to�a�lecture,�writing�a�paper,�reading�exam�instructions,�working�on�a�group�project,�or�making�a�purchase�at�the�bookstore,�the�ability�to�understand�and�use�the�college’s�primary�language�is�a�funda-mental�prerequisite�for�the�student�to�succeed.�Though�success�in�college�can�also�depend�on�factors�that�have�little�direct�link�to�language,�such�as�intelligence,�motivation,�self-discipline,�and�physical�and�emotional�health,�these�will�have�little�use�for�the�student�if�he/she�is�unable�to�process,�learn�from,�and�respond�to�information.

iTEP�Academic�was�designed�and�developed�to�provide�English�language�proficiency�scores�that�are�valid�for�many�types�of�educational�decision�making.�The�developers�of�iTEP�Academic�recognized�that�in�order�to�thoroughly�evaluate�English�proficiency,�the�assessment�needed�to�include�items�that�evaluated�both�written�and�spoken�language,�as�well�as�the�examinee’s�grasp�of�English�grammar.�In�addition,�iTEP�developers�made�the�distinction�between�receptive�lan-guage�skills�(i.e.,�listening�and�reading)�and�expressive�language�skills�(i.e.,�writing�and�speaking).�Assessment�items�that�measure�an�examinee’s�ability�to�express�ideas�in�English�were�developed�for�inclusion�in�iTEP�Academic-Plus.

When�language�proficiency�is�measured�accurately,�reliably,�and�comprehensively,�educators�or�administrators�can�use�examinees’�scores�on�the�assessment�to�make�more�rigorous,�evi-dence-based�decisions.�iTEP�Academic�was�developed�with�these�goals�in�mind.�Furthermore,�iTEP�Academic�uses�the�best�technology�available�and�on-demand�support�to�help�ensure�an�engaging,�user-friendly�examinee�and�administrator�experience.

iTEP� is�recognized�by�the�Academic�Credentials�Evaluation�Institute�(ACEI)�as�an�approved�internationally�regarded�English�proficiency�exam�that�meets�institutional�standards.�ACCET,�the�U.S.-based�Accrediting�Council�for�Continuing�Education�and�Training�has�determined�that�iTEP�satisfies�the�requirement�of�a�nationally�recognized�English�assessment�exam�to�validate�Intensive�English�Program�(IEP)�curricula.�In�addition,�iTEP�International�is�committed�to�actively�engaging�with�the�international�education�community�through�memberships�and�affiliations�with�NAFSA,�EnglishUSA,�TESOL,�ACEI,�ACCET,�and�AISAP.

C H A P T E R 2

DETAILED DESCRIPTION OF iTEP ACADEMIC

Theoretical Model for Language AssessmentTraditionally,�language�researchers�and�educators�have�grouped�language�skills�into�four�distinct�categories�(Listening,�Reading,�Writing,�and�Speaking),�and�from�a�commonsense�perspective�this�categorization�is�no�surprise,�as�each�of�these�elements�of�communication�refers�to�a�dis-tinct�set�of�activities�and�knowledge�used�for�distinct�purposes.�In�addition,�it�is�common�for�a�distinction�to�be�made�between�language�skills�and�language�knowledge�(e.g.,�grammar�and�vocabulary)�(Bachman,�1990).

On�the�surface,�the�Listening,�Reading,�Writing,�and�Speaking�sections�of�iTEP�Academic�align�with�the�traditional�categorization�of�language�skills, and�the�Grammar�section�aligns�with�the�notion�of�language�knowledge.�Listening�scores�reflect�the�ability�to�comprehend�spoken�language,�Grammar�scores�reflect�the�knowledge�of�correct�grammar,�and�so�on.�Additionally,�practical�considerations�clearly�warrant�testing�across�multiple�competency�areas.�In�the�case�of�admissions�or�certification,�use�of�multiple�measures�helps�ensure�content�coverage�(mea-surement�breadth)�across�the�most�critical�elements�of�language;�in�the�case�of�placement�or�program�evaluation,�multiple�measures�help�pinpoint�different�areas�of�an�examinee's�strengths�and�weaknesses.

The�traditional�categorization�of�language�into�skills�and�knowledge�domains�may�seem�to�suggest�that�each�iTEP�Academic�section�measures�an�isolated�language�capability;�however,�modern�theories�of�language�emphasize�the�interrelatedness�of�language�knowledge�and�skill�and�the�practical�fact�that�any�attempt�to�measure�a�single�component�of�language�will�likely�be�confounded�by�other�language�skills�that�are�necessary�to�answer�the�question�(e.g.,�an�eval-uation�of�reading�proficiency�requires�knowledge�of�grammar,�sentence�structure,�vocabulary,�etc.).�In�addition,�these�theories�emphasize�that�one�must�consider�the�context�in�which�the�communication�occurs;�communication�in�a�casual�setting�is�likely�to�involve�a�different�set�of�competencies—and�a�different�judgment�of�effectiveness—than�communication�in�an�academic�or�business�setting.�These�modern�theories�suggest�that�in�practice,�language�effectiveness�must�be�evaluated�in�the�situational�context�for�which�the�assessment�is�to�be�used�(Association�of�Language�Testers�in�Europe�(ALTE),�2011;�Bachman,�1990).�Plainly�stated,�a�language�assessment�should�represent�the�real-world�use�of�language.�iTEP�Academic�aligns�with�best�practices�in�language�assessment�by�evaluating�one's�ability�to�communicate�effectively�in�the�context�of�common�scenarios�that�are�encountered�in�college.�

Description of iTEP Academic Sections

Grammar SectionThe�ability�to�understand�and�use�a�language’s�grammar�rules�correctly�is�an�important�compo-nent�of�effective�communication.�Grammar�does�not�need�to�be�perfect�in�order�for�someone�to�comprehend�the�meaning�of�a�statement,�yet�as�the�number�of�grammatical�errors�increases,�the�likelihood�that�the�information�will�be�conveyed�incorrectly�also�increases.�Still�higher�standards�for�grammatical�correctness�are�present�within�most�academic�settings.

The�iTEP�Academic:�Grammar�section�evaluates�an�examinee’s�understanding�of�and�ability�to�use�proper�English�grammar.�It� is�comprised�of�25�multiple-choice�questions,�each�of�which�tests�the�examinee’s�familiarity�with�a�key�feature�of�English�structure�(e.g.,�use�of�the�correct�article,�verb�tense,�modifier,�or�conjunction;�identifying�the�correct�sentence�structure,�pronoun,�or�part�of�speech).�The�Grammar�section�includes�a�range�of�sentence�structures�from�simple�to�more�complex,�as�well�as�both�beginning�and�advanced�vocabulary.�The�first�13�questions�require�the�examinee�to�select�the�word�or�phrase�that�correctly�completes�a�sentence,�and�the�next�12�questions�require�the�examinee�to�identify�the�word�or�phrase�in�a�sentence�that�is�grammatically�incorrect.�Each�of�the�two�question�types�is�preceded�by�an�on-screen�example.

The�Grammar�section�takes�10�minutes�to�complete.

Sample Grammar Item

Listening SectionThe�ability�to�comprehend�spoken�information�is�of�central�importance�within�an�academic�setting�and�is�vital�to�navigating�the�social�aspects�of�college�life.�The�typical�model�for�a�college�course,�particularly�during�the�first�two�years�of�coursework,�involves�students�attending�lec-tures.�The�iTEP�Academic�Listening�section�evaluates�an�examinee’s�proficiency�in�understanding�spoken�English�information.�In�this�section,�the�examinee�listens�to�two�types�of�spoken�infor-mation:�(1)�a�short�conversation�between�two�speakers;�and�(2)�a�brief�lecture�on�an�academic�topic.�After�listening�to�the�conversation�or�lecture,�the�examinee�is�presented�with�a�question�(orally�and�in�writing)�that�measures�several�key�indicators�of�whether�the�information�was�understood.�These�indicators�include:�identifying�the�primary�subject�of�the�conversation�or�lecture�(Main�Idea),�recalling�important�points�(Catching�Details),�understanding�why�a�particu-lar�statement�was�made�(Determining�the�Purpose),�inferring�information�based�on�contextual�information�(Making�Implications),�and�determining�the�relationship�between�key�pieces�of�information�(Connecting�Content).

To�ensure�realism�in�the�Listening�section,�item�writers�take�steps�to�ensure�that�the�content�reflects�a�conversational�tone.�In�addition,�while�the�examinee�listens�to�each�audio�file,�a�static�image�of�the�speaker(s)�is�presented�on-screen.

The�Listening�section�takes�20�minutes�to�complete�and�consists�of�three�parts:

Part 1: Four�high-beginning�to�low-intermediate-difficulty�level�conversations�of�two�to�three�sentences,�each�followed�by�one�multiple-choice�question

Part 2: One�two-�to�three-minute�intermediate-difficulty�level�conversation�followed�by�four�multiple -choice�questions

Part 3: One�four-minute�lecture�followed�by�six�multiple-choice�questions

Sample Listening Item:

Transcript�of�audio�played�to�examinee�[text�is�for�demonstration�in�this�report�and�is�not�presented�to�the�examinee]

Male�Student

“Hi�Tara.�Did�you�hear�that�Professor�Johnson’s�biology�class�was�canceled?�He�moved�the�quiz�to�next�week.”

“No,�I�didn’t.�Thanks�for�telling�me.�That�will�give�me�more�time�to�write�my�history�report�and�finish�my�math�homework.”

Reading SectionAlong�with�listening,�the�ability�to�comprehend�written�information�is�critical�both�for�effective�learning�in�an�academic�setting�and�for�navigating�college�life�in�general.�Course�lectures�are�typically�paired�with�required�textbooks�or�other�reading�materials,�and�students�are�frequently�evaluated�on�their�recall�and�understanding�of�both�the�lectures�and�the�readings.�Additionally,�the�typical�examination�in�lower-level�college�courses�involves�written�materials�such�as�multi-ple-choice�questions.

The�iTEP�Academic�Reading�section�evaluates�an�examinee’s�level�of�reading�comprehension�by�measuring�several�key�indicators�of�whether�a�written�passage�was�understood.�These�indicators�include:�identifying�the�significant�points�and�main�focus�of�the�written�passage�(Catching�Details�and�Main�Idea,�respectively),�determining�what�a�word�means�based�on�its�context�(Vocabulary),�and�understanding�why�a�particular�statement�within�a�larger�passage�was�written�by�connecting�together�relevant�information�(Synthesis).�In�addition,�the�Reading�section�evaluates�the�exam-inee’s�understanding�of�how�a�paragraph�should�be�constructed�in�order�to�properly�convey�information�(Sequencing).�Sequencing�items�require�the�examinee�to�read�a�paragraph�and�determine�where�a�new�target�sentence�should�be�placed�based�on�the�surrounding�content.

The�Reading�section�takes�20�minutes�to�complete�and�consists�of�two�parts:

Part 1: One�intermediate�reading�level�passage�about�250�words�in�length,�followed�by�four�multiple- choice�questions

Part 2: One�upper�reading�level�paragraph�about�450�words�in�length,�followed�by�six�multiple- choice�questions

Sample Reading Item:

Writing SectionIn�addition�to�evaluating�speaking,�iTEP�Academic-Plus�also�evaluates�the�examinee’s�English�Writing�ability.

During�the�Writing�section�of�the�assessment,�the�examinee�reads�a�question�and�then�writes�a�response.�The�responses�are�submitted�for�later�evaluation�by�a�trained�iTEP�rater.

The�Writing�section�takes�25�minutes�to�complete�and�consists�of�two�parts:

Part 1:�The�examinee�is�given�five�minutes�to�write�a�50-75�word�note,�geared�at�the�low�inter-mediate�level,�on�a�supplied�topic.

Part 2:�The�examinee�is�given�20�minutes�to�write�a�175-225�word�piece�expressing�and�sup-porting�his�or�her�opinion�on�an�upper�level�written�topic.

Sample Writing Item:

Speaking SectionBoth�writing�and�speaking�in�a�new�language�are�often�considered�more�advanced�skills,�devel-oped�after�the�individual�has�acquired�a�basic�grasp�of�the�language’s�grammar�and�vocabulary�and�learned�to�apply�this�knowledge�to�comprehend�written�and�spoken�information.�The�longer�version�of�iTEP�Academic,�iTEP�Academic-Plus,�evaluates�the�examinee’s�English�speaking�ability�(along�with�writing�ability�as�previously�described).

During�the�Speaking�section�of�the�assessment,�the�examinee�listens�to�and�reads�a�prompt�(either�a�question�or�a�brief�lecture),�and�then�prepares�an�oral�response.�The�examinee�then�records�his/her�response�for�later�evaluation�by�a�trained�iTEP�rater.

The�Speaking�section�takes�five�minutes�to�complete�and�consists�of�two�parts:

Part 1: The�examinee�hears�and�reads�a�short�question�geared�at�low-intermediate�level,�then�has�30�seconds�to�prepare�a�spoken�response�and�45�seconds�to�speak.

Part 2: The�examinee�hears�a�brief�upper-level�statement�presenting�two�sides�of�an�issue,�then�is�asked�to�express�his�or�her�thoughts�on�the�topic�with�45�seconds�to�prepare�and�60�seconds�to�speak.

Sample Speaking Item:

Test AdministrationDelivery MethodiTEP�Academic-Plus�is�administered�via�the�Internet.�Items�are�administered�to�examinees�at�random�from�a�larger�item�bank,�according�to�programming�logic�and�test�development�proce-dures�that�ensure�each�examinee�receives�an�overall�examination�of�comparable�content�and�difficulty�to�other�examinees.

A�static�paper-and-pencil�version�of�iTEP�Academic-Core�is�also�available.

iTEP�Academic�must�be�administered�at�a�secure�location�or�a�Certified�iTEP�Test�Center.

The�examinee�inputs�responses�to�the�test�in�the�following�manner:

• During�the�Grammar,�Listening,�and�Reading�sections,�the�examinee�selects�from�a�list�of�multiple�choice�options�for�each�question.

• Writing�samples�are�keyboarded�directly�into�a�text�entry�field.

• Speaking�samples�are�recorded�with�a�headset�and�microphone�at�the�examinee’s�computer.

Examinee ExperiencePrior�to�the�start�of�the�test,�the�examinee�logs�in�and�completes�a�registration�form.�The�system�guides�the�examinee�through�a�series�of�steps�to�ensure�technical�compatibility�and�to�prepare�him/her�for�the�format�of�the�assessment.

Each�section/skill�has�a�fixed�time�allotted�to�it.�In�the�Grammar�and�Reading�sections,�examinees�can�advance�to�the�next�section�if�there�is�time�remaining,�or�they�are�free�to�use�any�extra�time�to�review�and�revise�their�answers.�In�the�Listening�section,�the�prompts�each�play�only�once�and�once�submitted,�an�item�response�cannot�be�reviewed�or�changed.�In�the�Writing�section,�there�are�fixed�time�limits�for�each�part,�but�examinees�may�advance�to�the�next�section�before�time�expires.�In�the�Speaking�section,�there�are�fixed�time�limits�for�each�part�and�examinees�cannot�advance�until�time�expires.

The�directions�for�each�section�are�displayed�for�a�set�amount�of�time,�and�are�also�read�aloud.�The�amount�of�time�instructions�are�displayed�varies�according�to�the�amount�of�text�to�be�read.�If�an�examinee�needs�more�time�to�read�a�particular�section’s�directions,�he�or�she�can�access�them�by�clicking�the�Help�button,�which�displays�a�complete�menu�of�directions�for�all�test�sections.

Following�each�section�of�the�test,�examinees�see�a�transition�screen�indicating�which�section�will�be�completed�next.�This�transition�screen�provides�a�15-second�break�between�sections,�and�displays�a�progress�bar�showing�completed�and�remaining�test�sections.�After�the�last�test�section�is�completed,�examinees�see�a�final�screen�telling�them�the�test�is�complete�and�to�wait�for�further�directions�from�the�administrator.

Screenshots�of�the�examinee�experience,�including�pre-assessment�modules�and�instructions,�are�shown�in�Appendix�A.

Scoring/GradingiTEP�Academic�computes�an�overall�proficiency�level�from�0�(Beginner)�to�6�(Mastery),�as�well�as�individual�skill�proficiency�levels�from�0�to�6�for�each�skill.�Linguistic�sub-skill�scores�are�also�provided�(e.g.�parts�of�speech,�synthesis,�main�idea)�in�order�to�give�a�more�detailed�picture�of�the�examinee’s�level�in�the�Grammar,�Listening,�and�Reading�sections.�The�Overall�score�represents�the�combination�of�scores�across�each�skill;�for�greater�accuracy,�Overall�scores�are�reported�to�one�decimal�point�(e.g.,�0.0,�0.1,�0.2,�…�,�5.9,�6.0).

iTEP�Academic�is�graded�as�follows:

• The�Grammar,�Listening,�and�Reading�sections�are�scored�automatically�by�the�computer.�Each�response�is�worth�one�point.�There�is�no�penalty�for�guessing.

• The�Writing�and�Speaking�sections�are�evaluated�by�native�English-speaking,�ESL-trained�professionals,�according�to�a�standardized�scoring�rubric�(see�Appendix�B).�Raters�attend�refresher�training�sessions�throughout�the�year�to�ensure�continued�adherence�to�the�rubric.

• For�computing�the�Overall�score,�each�test�section�is�weighed�equally.

• The�official�score�report�presents�an�individual’s�scoring�information�in�both�tabular�and�graphical�formats.�The�graphical�format,�or�skill�profile,�is�particularly�useful�for�displaying�an�examinee’s�strengths�and�weaknesses�in�each�of�the�skills�evaluated.

Proficiency LevelsThe�seven�iTEP�Academic�proficiency�levels�may�be�expressed�briefly�as�follows:

Level 0: Beginning

Level 1:�Elementary

Level 2: Low�Intermediate

Level 3:�Intermediate

Level 4: High�Intermediate

Level 5:�Low�Advanced

Level 6: Advanced

iTEP�has�mapped�iTEP�Academic�Proficiency�Levels�to�the�levels�described�in�the�Common�European�Framework�of�Reference�for�Languages�(CEFR;�See�Appendix�B).

C H A P T E R 3

iTEP ACADEMIC DEVELOPMENT PROCESS, RELIABILITY, AND VALIDITY1

Development ProcessiTEP�International�adheres�to�a�continuous�cycle�of�item�analysis�(see�Figure�1)�to�ensure�the�content�of�the�assessment�adheres�to�the�reliability�and�validity�goals�of�the�assessment.�The�cycle�begins�with�item�writing,�enters�an�expert�review�and�content�analysis�stage,�and�then�works�through�a�number�of�statistical�analyses�to�evaluate�the�difficulty�level�and�other�psychometric�properties�of�the�item.�Items�that�do�not�meet�quality�standards�during�the�content�analysis�and/or�statistical�analysis�phase�are�either�removed�from�further�consideration,�or�repurposed�if�it�is�determined�that�minor�adjustments�will�improve�the�item.�Items�that�meet�quality�stan-dards�during�the�content�analysis�and�statistical�analysis�phases�are�retained�in�the�assessment.��In�order�to�maintain�a�secure�assessment�and�minimize�the�likelihood�of�an�item�being�shared�among�examinees�over�time,�all�items�used�in�the�assessment�are�retired�after�a�certain�length�of�time.�Items�may�also�be�identified�as�having�“drifted”�in�difficulty�over�time,�indicating�that�the�item�may�have�been�compromised;�these�items�are�retired�immediately�upon�identification.

1� All�analysis�and�evaluation�of�iTEP�Academic�as�described�in�Chapter�3�was�conducted�in�accordance�with�the�Standards�for�Educational�and�Psychological�Testing�(hereafter�Standards; American�Educational�Research�Association,�American�Psychological�Association,�&�National�Council�on�Measurement�in�Education,�2014),�Uniform�Guidelines�on�Employee�Selection�Procedures�(Equal�Employment�Opportunity�Commission,�1978),�and�the�Principles�for�the�Validation�and�Use�of�Personnel�Selection�Procedures�(Society�for�Industrial�and�Organizational�Psychology,�2003)

WriteItem

Content Analysis

Statistical Analysis

Repurpose Item

KeepItem

RetireItem

Figure 1. Continuous Cycle of Item Development

ReliabilityThe�reliability�of�an�assessment�refers�to�the�degree�to�which�the�assessment�provides�stable,�consistent�information�about�an�examinee.�Demonstrating�reliability�is�important�because�if�a�test�is�not�stable�and�consistent—whether�across�the�items�in�the�assessment,�across�repeated�administrations�of�the�assessment,�or�based�on�performance�scores�provided�by�trained�raters—then�the�results�cannot�be�relied�upon�as�accurate.�Moreover,�the�reliability�of�an�assessment�theoretically�sets�a�maximum�limit�for�its�validity;�when�an�assessment�is�not�consistent,�it�is�less�effective�as�an�indicator�of�a�person’s�true�ability�and�will�therefore�demonstrate�lower�correlations�with�relevant�outcomes�(such�as�grades,�academic�adjustment,�or�attrition).��

Internal Consistency Reliability

Internal�consistency�reliability�refers�to�the�stability�of�the�items�within�a�particular�assessment,�or�in�this�case,�within�each�assessment�section.�When�it�can�be�shown�that�the�items�are�sta-tistically�related�to�each�other,�the�case�can�be�made�that�the�assessment�is�consistent�in�its�measurement.�Cronbach’s�alpha�(Cronbach,�1951)�is�a�commonly-used�and�accepted�classical�test�theory�(CTT)�statistic�that�is�used�to�estimate�internal�consistency�reliability.�The�statistic�reflects�the�average�correlation�between�all�items�within�an�assessment�or�assessment�section.�Values�of�.70�or�above�have�traditionally�been�considered�desirable,�with�some�scholars�stating�that�test�developers�should�aim�to�develop�tests�with�values�of�at�least�.80�or�even�.90�and�higher.�These�benchmarks�are�general�rules�and�do�not�take�into�account�other�desirable�char-acteristics�of�an�assessment,�such�as�assessment�brevity�to�minimize�testing�time�(Gatewood�&�Field,�2001),�the�breadth�of�content�coverage�within�the�assessment�(to�ensure�that�a�large�domain�of�the�characteristic�being�measured�is�represented)�(Loevinger,�1954),�and�the�validity�of�the�assessment�(Nunnally�&�Bernstein,�1994).�Test�developers�must�think�critically�about�the�interrelated�factors�influencing�test�reliability�and�validity�and�use�their�best�judgment�when�deciding�what�should�be�considered�acceptable�(Gatewood�&�Field,�2001).

Because�the�calculation�of�internal�consistency�reliability�requires�that�the�assessment�section�contain�multiple�items,�this�class�of�statistics�is�appropriate�for�the�Grammar,�Listening,�and�Reading�sections�of�iTEP�Academic;�calculation�of�internal�consistency�reliability�is�not�possible�for�the�Speaking�and�Writing�sections,�as�trained�raters�provide�only�one�summary�score�for�each�of�these�sections�based�on�the�examinee’s�overall�Speaking�or�Writing�performance.

Within�the�Grammar,�Listening,�and�Reading�sections�of�iTEP�Academic,�the�set�of�items�adminis-tered�to�each�examinee�are�selected�at�random�from�a�larger�item�bank;�therefore,�the�traditional�CTT�calculation�of�Cronbach’s�alpha�is�not�possible.�In�order�to�compute�an�internal�consistency�reliability�estimate�for�each�scale,�the�following�procedure�was�used�to�derive�an�estimate�that�can�be�interpreted�in�a�manner�similar�to�Cronbach’s�alpha.�The�procedure�relies�on�statistics�derived�from�item�response�theory�(IRT),�a�class�of�statistical�models�that�are�particularly�suited�to�handling�randomly�administered�items.

1 For�each�section,�compute�the�IRT�common�discrimination�parameter�using�the�1-Parameter�Logistic�model�(1PL).�The�common�a parameter�reflects�the�average�extent�to�which�each�item�provides�statistical�information�that�distinguishes�lower-performing�examinees�from�higher-performing�examinees.�The�a�parameter�is�in�concept�most�similar�to�an�item-total�correlation�from�classical�test�theory.

2 Use�the�a�parameter�estimate�to�compute�an�intraclass�correlation�coefficient�(ICC).�This�formula�is:

3 The�resulting�value�of�the�ICC�reflects�the�average�internal�consistency�reliability�for�any�one�item�in�the�sections,�and�therefore�the�final�internal�reliability�estimate�(α)�must�be�“stepped�up”�using� the�Spearman-Brown�prophecy� formula� to� reflect� the� reliability�of� the� total� sections.�The�Spearman-Brown�prophecy�method�is�the�same�method�that�would�be�used�to�examine�the�impact�of�shortening�or�lengthening�a�test�(for�example,�cutting�a�50-item�test�in�half).�The�Spearman-Brown�prophecy�formula�is:

Where�K�is�a�scaling�factor�reflecting�the�proportional�increase�or�decrease�in�the�number�of�test�items.�In�the�current�case,�K�is�the�number�of�items�in�the�sections.

The�internal�consistency�reliability�results,�which�can�be�interpreted�as�conceptually�similar�to�Cronbach’s�alpha�estimates,�were�computed�for�a�sample�of�over�17,000�examines�who�com-pleted�iTEP�Academic�between�2014�and�2016.2�The�results�are�provided�in�Table�1.�As�shown,�all�values�exceed�the�.70�benchmark,�and�the�Grammar�estimate�exceeds�the�.80�benchmark.

Table 1. Internal Consistency Reliability Estimates for Relevant iTEP Academic Sections

Section Number of Items Discrimination (a)Intraclass Correlation (ICC)

Internal Consistency Reliability (α)

Grammar 25 1.08 .25 .89

Listening 14 0.85 .21 .78

Reading 10 0.93 .22 .74

Note:�The�sample�size�for�the�analysis�was�N�=�17,731.�The�internal�consistency�reliability�estimates�are�not�Cronbach’s�alpha�values,�but�can�be�interpreted�in�a�similar�manner�to�Cronbach’s�alpha.

ICC =a2 + π2 /3

a =1�+�(K�–�1)�. K . ICC

K . ICC

2� All�examinee�data�provided�by�iTEP�was�included�in�the�analysis,�with�the�exception�of�the�following:�(1)�when�a�unique�identifier�indicated�with�data�was�for�an�examinee�re-testing,�only�the�examinee's�first�testing�occasion�was�included;�or�(2)�if�the�examinee�timed-out�on�any�section�without�seeing�one�or�more�of�the�items,�the�examinee�was�removed;�or�(3)�examinees�younger�than�14�years�of�age�were�removed.��Examinee�non-responses�to�items�that�were�seen�but�not�answered�were�scored�as�incorrect.

Test-Retest ReliabilityTest-retest�reliability�refers�to�the�stability�of�test�scores�across�repeated�administrations�of�the�test.�A�high�level�of�test-retest�reliability�indicates�that�the�examinee�is�likely�to�receive�a�similar�score�every�time�he�or�she�takes�it—assuming�the�examinee’s�actual�skill�in�the�domain�being�measured�has�not�changed.�Test-retest�reliability�estimates�for�all�iTEP�Academic�sections,�and�the�Overall�score,�were�computed�using�a�sample�of�198�examinees�who�took�iTEP�Academic�twice�in�an�operational�environment�(i.e.,�at�a�testing�center�for�college�admissions�purposes).�Analyses�were�restricted�to�examinees�with�at�least�5�days�and�less�than�2�months�between�testing�occasions�(average�time�elapsed:�24�days).

The�test-retest�values�shown�in�Table�2�reflect�the�correlation�between�the�Time�1�and�Time�2�scores�for�the�sample.�Values�can�range�from�-1.0�to�1.0,�with�values�at�or�exceeding�.70�typically�considered�desirable.�As�can�be�seen,�only�the�Overall�score�exceeds�this�threshold.�However,�it�should�be�noted�that�the�sample�used�to�compute�the�test-retest�correlations�was�an�operational�sample,�and�it�could�reasonably�be�assumed�that�at�least�some�of�the�sample�had�worked�diligently�to�improve�their�performance�between�Time�1�and�Time�2�testing�occa-sions;�given�the�number�of�days�between�test�administrations�for�the�sample�(up�to�2�months;�24�days�on�average),�this�seems�very�likely.�Had�the�test-retest�estimates�been�computed�on�a�research�sample�and/or�if�the�sample�size�of�available�data�allowed�for�the�analysis�of�a�shorter�time�period�between�testing�occasions,�the�correlations�would�likely�be�higher.�Therefore,�the�values�given�in�Table�2�can�be�considered�lower-bound�estimates�of�the�true�test-retest�reliability�of�iTEP�Academic.

Table 2. Test-Retest Reliability Estimates for iTEP Academic

Section Test-Retest Reliability

Grammar .63

Listening .49

Reading .48

Writing .62

Speaking .64

Overall�–�Plus .77

Overall�–�Core .71

Note: The�sample�size�for�the�analysis�was N�=�198.�The�Overall�–�Core�score�was�approximated�by�removing�the�Speaking�and�Writing�section�scores�from�the�Overall�scores�of�examinees�who�completed�the�longer�iTEP�Academic-Plus.

Rater AgreementThe�iTEP�Academic�Speaking�and�Writing�sections�are�evaluated�by�a�trained�rater�and�as�such,�it�is�necessary�to�estimate�the�accuracy�of�these�judgments—specifically,�the�extent�to�which�the�scores�given�by�a�rater�are�interchangeable�with�the�scores�of�another.�Evaluations�of�rater�agreement,�as�opposed�to�rater�reliability,�are�more�appropriate�in�cases�where�the�examinee’s�absolute�score�is�of�interest�rather�than�the�examinee’s�rank�order�position�relative�to�other�examinees�(LeBreton�&�Senter,�2008).

Tables�3�and�4�summarize�a�raw�investigation�of�rater�agreement�using�a�sample�of�Speaking�and�Writing�ratings�from�six�examinees�obtained�from�eight�raters�during�a�training�exercise.�The�examinees�completed�either�iTEP�Academic�or�iTEP�SLATE.

It�should�be�noted�that�the�results�in�Tables�3-6�likely�reflect�a�lower-bound�estimate�of�rater�agreement,�as�the�cases�used�for�the�training�exercise�were�purposely�selected�to�be�more�challenging�to�rate�than�a�typical�case.

Table 3. Raw Rater Agreement Analysis – Speaking Section

Rater Deviations from Average Score

ExamineeAverage Score R1 R2 R3 R4 R5 R6 R7 R8

Average Deviation

Max. Deviation

E1 1.79 .04 .54 .29 .04 .29 .46 .71 - .34 .71

E2 4.88 .63 .38 .63 .38 .13 .38 .63 .88 .50 .88

E3 2.53 .28 .97 .03 .22 1.53 .22 .72 .28 .53 1.53

E5 4.03 .22 .72 .53 .53 .22 .03 .53 .47 .41 .72

E6 3.50 .50 .00 1.50 - - - .25 .75 .60 1.50

Average 3.34 .33 .52 .59 .29 .54 .27 .57 .59 .47 1.07

Note:�No�Speaking�section�ratings�were�provided�for�Examinee�4�due�to�a�technical�issue�with�the�audio�recording.�The�missing�values�occurred�because�the�rater(s)�did�not�provide�a�rating.�Average�Score:�the�examinee’s�average�rating�across�all�eight�raters.�Rater�Deviations�from�Average�Score:�the�absolute�value�of�the�difference�between�each�rater’s�score�and�the�Average�Score�for�each�examinee.�Average�Deviation:�average�Rater�Deviation�for�each�examinee.�Max�Deviation:�highest�Rater�Deviation�value�that�was�observed�across�all�eight�raters.

Table 4. Raw Rater Agreement Analysis – Writing Section

Rater Deviations from Average Score

ExamineeAverage Score R1 R2 R3 R4 R5 R6 R7 R8

Average Deviation

Max. Deviation

E1 1.79 .04 .71 .29 .79 .04 .29 .71 - .41 .79

E2 3.81 .56 .69 .06 .31 .44 .56 .06 .44 .39 .69

E4 3.43 .32 - .07 .18 .18 .32 .18 .18 .20 .32

E5 4.41 .16 .66 .59 .09 .09 .09 .16 .09 .24 .66

E6 3.60 .15 .40 .85 - - - .15 .15 .34 .85

Average 3.41 .25 .61 .37 .34 .19 .32 .25 .21 .32 .66

Note:�No�Writing�section�ratings�were�provided�for�Examinee�3.�The�missing�values�occurred�because�the�rater(s)�did�not�provide�a�rating.�Average�Score:�the�examinee’s�average�rating�across�all�eight�raters.�Rater�Deviations�from�Average�Score:�the�absolute�value�of�the�difference�between�each�rater’s�score�and�the�Average�Score�for�each�examinee.�Average�Deviation:�average�Rater�Deviation�for�each�examinee.�Max�Deviation:�highest�Rater�Deviation�value�that�was�observed�across�all�eight�raters.

As�seen�in�Table�3,�in�all�but�two�instances�the�raters’�Speaking�scores�for�each�examinee�devi-ated�less�than�1�point�from�the�average�rating�across�all�raters�(as�a�reminder,�section�scores�can�range�from�0�to�6).�Across�all�raters�and�examinees,�the�average�deviation�was�.47�points,�and�the�average�maximum�deviation�was�1.07�points.�These�results�suggest�a�moderately�strong�agreement�across�raters.

As�seen�in�Table�4,�all�of�the�raters’�Writing�scores�for�each�examinee�deviated�less�than�1�point�from�the�average�rating�across�all�raters.�Across�all�raters�and�examinees,�the�average�deviation�was�.32�points,�and�the�average�maximum�deviation�was�.66�points.�These�results�suggest�a�strong�agreement�across�raters.

Using�the�same�data�that�were�used�for�Tables�3�and�4,�rater�agreement�was�also�estimated�using�a�version�of�the�rWG agreement statistic�(James,�Demaree,�and�Wolf,�1984).�The�value�of�rWG�can�theoretically�range�from�0�to�1,�and�represents�the�observed�variability�in�scores�among�raters�relative�to�the�amount�of�variability�that�would�be�present�if�all�raters�had�assigned�scores�completely�at�random.�The�formula�for�rWG�is:

Where S2X�is�the�observed�variance�of�ratings�on�the�variable�across�raters�and�σ2E�is�the�variance�expected�if�the�ratings�were�completely�random.

The�specific�version�of�rWG�chosen�for�the�analysis�uses�a�value�for�σ2E�that�would�occur�if�the�raters’�completely�random�scores�came�from�a�triangular�(approximation�of�normal)�distribution�(see�LeBreton�&�Senter,�2008).

rWG =�1�– o 2E

The�closer�an�rWG value�is�to�1,�the�higher�the�agreement.�There�is�no�agreed-upon�minimum�value�that�is�considered�acceptable�for�rWG,�but�as�a�benchmark,�test�developers�might�consider�.80�or�.90�to�be�a�minimally�acceptable�value�for�an�application�such�as�assigning�ratings�based�on�a�score�rubric.�To�put�these�values�in�perspective,�an�rWG�of�.80�would�suggest�that�20%�(1�–�.80)�of�an�average�rater’s�score�across�examinees�was�due�to�error,�or�factors�other�than�the�examinee’s�“true�score”�on�the�exercise.

The�rWG�agreement�statistics�are�presented�in�Tables�5�and�6.

Table 5. rWG Rater Agreement Statistics – Speaking Section

Examinee Observed Variance Error Variance rWG

E1 .20 2.1 .91

E2 .34 2.1 .84

E3 .58 2.1 .72

E5 .24 2.1 .89

E6 .78 2.1 .63

Average .80

Note:�No�Speaking�section�ratings�were�provided�for�Examinee�4�due�to�a�technical�issue�with�the�audio�recording.

Table 6. rWG Rater Agreement Statistics – Writing Section

Examinee Observed Variance Error Variance rWG

E1 .30 2.1 .86

E2 .23 2.1 .89

E4 .06 2.1 .97

E5 .12 2.1 .94

E6 .24 2.1 .89

Average .91

Note:�No�Writing�section�ratings�were�provided�for�Examinee�3.

The�results�in�Table�5�indicate�moderately�strong�agreement�amongst�the�raters.�The�minimum�rWG was�observed�for�Examinee�6,�with�a�value�of�.63.�The�average�rWG�across�all�examinees�was�.80,�indicating�that�20%�of�the�average�rater’s�score�across�examinees�was�due�to�factors�other�than�the�examinee’s�“true�score”�on�the�exercise.

The�results�in�Table�6�indicate�strong�agreement�amongst�the�raters.�The�minimum�rWG was�observed�for�Examinee�1,�with�a�value�of�.86.�The�average�rWG across�all�examinees�was�.91,�

indicating�that�only�9%�of�the�average�rater’s�score�across�examinees�was�due�to�factors�other�than�the�examinee’s�“true�score”�on�the�exercise.

Overall,�the�results�of�the�rater�agreement�analyses�suggest�that�ratings�provided�by�any�one�iTEP�rater�are�likely�to�be�a�reliable�indication�of�an�examinee’s�actual�proficiency�on�the�Speaking�and�Writing�sections.

ValidityThe�iTEP�Academic�examination�was�designed�and�developed�to�provide�English�language�profi-ciency�scores�that�are�valid�for�many�types�of�educational�decision�making.�The�Standards�define�validity�as�“the�degree�to�which�accumulated�evidence�and�theory�support�specific�interpretations�of�test�scores�entailed�by�proposed�uses�of�a�test”�(AERA�et�al.,�2014,�p.�184).�In�other�words,�the�term�validity�refers�to�the�extent�to�which�an�assessment�measures�what�it�is�intended�to�mea-sure.�Evidence�for�validity�can,�and�should,�come�from�multiple�lines�of�investigation�that�together�converge�to�form�a�conclusion�regarding�the�relative�validity�of�the�assessment,�including:

1 Expert�judgments�regarding�the�extent�to�which�the�content�of�the�assessment�reflects�the�real-world�knowledge,�skills,�characteristics,�or�behaviors�the�assessment�is�designed�to�measure�(Content�Validity)

2 An�examination�of�the�degree�to�which�the�assessment�(or�assessment�section)�is�correlated�with�theoretically�similar�measures�and�un-correlated�with�theoretically�unrelated�measures�(Convergent�and�Discriminant�Validity;�traditionally�conceived�of�as�the�main�contributors�to�Construct�Validity3)

3 An�examination�of�the�degree�to�which�the�assessment�is�correlated�with�the�real-world�outcomes�it�is�intended�to�measure,�for�example:�adjustment�to�college,�grades,�or�improve-ment�in�language�proficiency�(Criterion�Validity)

Content ValidityContent�Validity,�or�content�validation,�refers�to�the�process�of�obtaining�expert�judgments�on�the�extent�to�which�the�content�of�the�assessment�corresponds�to�the�real-world�knowledge,�skill,�or�behavior�the�assessment�is�intended�to�measure.�For�example,�an�assessment�that�asks�questions�about�an�examinee’s�knowledge�of�cooking�techniques�may�be�judged�by�experts�to�be�content�valid�for�measuring�that�aspect�of�cooking�skill,�but�it�would�not�be�content�valid�for�measuring�the�examinee’s�athletic�ability—even if�it�turned�out�that�cooking�assessment�scores�were�correlated�with�athletic�ability.

According�to�the�Standards (AERA�et�al.,�2014), evidence�for�assessment�validity�based�on�test�content�can�be�both�logical�and�empirical�and�can�include�scrutiny�of�both�the�items/prompts�themselves�as�well�as�the�assessment’s�delivery�method(s)�and�scoring.

3� The�modern�conception�of�Construct Validity refers�not�just�to�Convergent�and�Discriminant�Validity,�but�to�the�accumulation�of�all�forms�of�evidence�in�support�of�an�assessment’s�validity�(AERA�et�al.,�2014).

Content-related�validity�evidence�for�iTEP�Academic,�for�the�purposes�of�academic�decision-mak-ing,�can�be�demonstrated�via�a�correspondence�between�the�assessment’s�content�and�relevant�college�educational�and�social�experiences.�To�ensure�correspondence,�developers�conducted�a�comprehensive�curriculum�review�and�met�with�educational�experts�to�determine�common�educational�goals�and�the�knowledge�and�skills�emphasized�in�curricula�across�the�country.�This�information�guided�all�phases�of�the�design�and�development�of�iTEP�Academic.

Content�Validity�evidence�for�iTEP�Academic�is�also�demonstrated�through�the�use�of�trained�item�writers�who�are�experts�in�the�field�of�education�and�language�assessment�and�who�have�substantial�experience�in�item-writing.�The�content�and�quality�of�items�submitted�by�item-writ-ers�is�continually�supervised,�and�feedback�is�provided�in�order�to�ensure�ongoing�adherence�to�the�content�goals�of�the�assessment�and�to�avoid�content-irrelevant�test�material.�Some�of�the�critical�steps�taken�to�achieve�this�objective�are�summarized�in�Appendix�C.

Finally,�Content�Validity�evidence�for�iTEP�Academic�is�shown�via�its�correspondence�with�the�CEFR�framework.�iTEP�mapped�iTEP�Academic�to�the�CEFR�framework�through�a�process�of�expert�evaluation�and�judgment�on�the�content�of�the�assessment�and�associated�scores.

Convergent and Discriminant ValidityConvergent�and�Discriminant�Validity�evidence�is�demonstrated�through�a�pattern�of�high�cor-relations�among�sections�that�measure�concepts�that�are�known�to�be�closely�related,�and�lower�correlations�among�sections�measuring�unrelated�concepts�(AERA�et�al.,�2014).�The�intercor-relations�among�iTEP�Academic�sections�are�shown�in�Table�7.�The�examinee�data�analyzed�are�the�same�as�described�in�the�Reliability�section.

Table 7. iTEP Academic Section Intercorrelations

Section Listening Reading Writing Speaking Overall

Grammar .59 .57 .66 .57 .83

Listening – .55 .57 .54 .80

Reading – – .56 .50 .79

Writing – – – .82 .86

Speaking – – – – .82

Note:�N�=�16,425�for�correlations�involving�Speaking�or�Writing;�N�=�17,760�for�all�other�correlations.

The�pattern�of�correlations�within�iTEP�Academic�provides�preliminary�evidence�for�the�con-vergent�and�discriminant�validity�of�the�assessment.�Overall,�the�relatively�strong�correlations�between�the�majority�of�sections�(i.e.,�in�the�.50-.60�range)�indicates�that�each�scale�is�likely�measuring�related�components�of�language�proficiency,�and�the�fact�that�the�correlations�do�not�approach�1.0�indicates�that�each�section�likely�measures�a�distinct�element�of�proficiency.�Compared�with�the�Grammar/Speaking�correlation,�the�higher�correlation�between�Grammar�

and�Writing�is�conceptually�logical�given�that�more�weight�is�placed�on�Grammar,�by�design,�when�iTEP�raters�evaluate�examinees’�writing�ability�than�when�evaluating�their�spoken�ability.�The�strong�correlation�between�Speaking�and�Writing�is�also�to�be�expected,�given�that�these�skills�are�considered�more�advanced�demonstrations�of�language�proficiency�that�require�expres-sive,�as�opposed�to�receptive,�language�skills.

In�addition�to�the�internal�examination�of�convergent�and�discriminant�validity�within�the�iTEP�Academic�scales,�preliminary�analyses�conducted�by�a�iTEP�partner�suggested�a�.93�correlation�between�iTEP�scores�and�TOEFL®�scores.�The�correlation�indicates�that�iTEP�scores�are�closely�aligned�with�those�of�other�language�proficiency�tests.

REFERENCES

American�Educational�Research�Association,�American�Psychological�Association,�&�National�Council�on�Measurement�in�Education�(1999).�Standards for educational and psychological testing.�Washington,�DC:�American�Psychological�Association.

Association�of�Language�Testers�in�Europe�(2011).�Manual for language test development and examining. Cambridge:�ALTE.

Bachman,�L.�(1990).�Fundamental considerations in language testing.�Oxford:�Oxford.

Cronbach,�L.�J.�(1951).�Coefficient�alpha�and�the�internal�structure�of�tests.�Psychometrika, 16,�297-334.

Equal�Employment�Opportunity�Commission,�Civil�Service�Commission,�Department�of�Labor,�and�Department�of�Justice.�(1978).�Uniform guidelines on employee selection procedures.

Gatewood,�R.D.�&�Field,�H.S.�(2001).�Human resource selection (5th�ed.).�Ohio:�South-Westin.

James,�L.�R.,�Demaree,�R.�G.,�&�Wolf,�G.�(1984).�Estimating�within-group�interrater�reliability�with�and�without�response�bias.�Journal of Applied Psychology, 69,�85-98.

LeBreton,�J.�M.,�&�Senter,�J.�L.�(2008).�Answers�to�20�questions�about�interrater�reliability�and�interrater�agreement.�Organizational Research Methods, 11,�815-852.

Loevinger,�J.�(1954).�The�attenuation�paradox�in�test�theory.�Psychological Bulletin, 51,�493—504.

Nunnally,�J.�C.,�&�Bernstein,�I.�H.�(1994).�Psychometric theory�(3rd�ed.).�New�York,�NY:�McGraw-Hill,�Inc.

Society�for�Industrial�and�Organizational�Psychology,�Inc.�(2003).�Principles for the validation and use of personnel selection procedures (4th�ed.).�Bowling�Green,�OH:�SIOP.

APPENDIX A: EXAMINEE PRE-ASSESSMENT MODULES AND INSTRUCTIONS

1.� Candidate’s�government-issued�photo�ID�is�required�and�will�be�verified�before�beginning�the�test.�

2.� The� iTEP�Administrator�will� verify� that�all� information�provided�on� the�Registration�Form� is� identical� to� the�Candidate’s�official�ID�document(s).�

3.� Reference�materials/tools�and�other�personal�effects�(e.g.�dictionaries,�mobile�phones,�audio�recording�devices,�pagers,�notepaper,�etc.)�are�not�permitted�in�the�room�during�the�test.�

4.� Smoking,�eating,�or�drinking�is�not�permitted�in�the�room�during�the�test.�

5.� The�iTEP�Administrator�reserves�the�right�to�dismiss�a�Candidate�from�the�test�or�declare�a�Candidate’s�test�results�void�if�the�Candidate�violates�any�of�the�above�conditions�or�fails�to�follow�the�Administrator’s�instructions�during�the�test.�

6.� If�for�technical�or�any�other�reasons�a�given�test�is�not�able�to�be�completed�or�results�cannot�be�provided,�iTEP�International’�and�the�iTEP�Administrator’s�liability�shall�be�limited�to�providing�a�refund�of�fees�received�for�said�test�and,�at�the�Candidate’s�request,�rescheduling�a�replacement�test.

APPENDIX B: iTEP ABILITY GUIDE

iTEP CEFR Listening Reading Writing Speaking

C2MASTERY

• Comprehends�overall�meaning�and�virtually�all�details�of�lectures�on�diverse�topics

• Understands�English�spoken�in�a�variety�of�non-native�accents

• Comprehends�virtually�all�aspects�of�a�wide�variety�of�academic�material�for�non-specialists

• Reads�at�near-native�speed

• Rarely�requires�use�of�dictionary

• Writes�complex�documents�such�as research reports�using�appropriate�style�and�vocabulary

• Grammar�and�orthographic�accuracy�is�at�near-native�level

• Expresses�complex�relationships�between�ideas

• Communicates�accurately�and�effectively�on�practically�all�academic�and�social�topics�in�culturally�appropriate�ways

• Pronunciation�is�close�to�that�of�native�speakers

C1ADVANCED

• Identifies�attitude�and�purpose�of�speakers

• Grasps�main�ideas and the majority�of�supporting�details�from�academic�lectures

• Is�challenged�by�complex�social�and�cultural�references

• Understands�main�ideas�and�most�of�the�details�of�academic�texts,�journal�articles,�and abstracts

• Requires�little�extra�reading�time

• Vocabulary�is�strong�in�specialty

• Satisfies�demands�of�most�general�academic�tasks�with�occasional�grammar�and�style�mistakes

• Exhibits�fairly�good�organization�and�development

• Vocabulary�is�strong�in�specialty

• Satisfies�demands�of�most�general�academic�tasks�with�occasional�grammar�and�style�mistakes

• Exhibits�fairly�good�organization�and�development

B2UPPER NTERMEDIATE

• Identifies�main�ideas�and�details�in�conversation

• Occasionally�needs�to�ask�for�repetition�or�clarification

• Begins�to�determine�the�attitudes�of�speakers

• Understands�main�ideas�from�academic�lectures,�tut�misses�significant�details

• Utilizes�contextual�and�syntactic�clues�to interpret meaning�of�complex�sentences and new�vocabulary

• Gathers�most�main�ideas�from�textbooks�and�articles,�but�has�an�uneven�grasp�of�details

• Misinterprets�some�abstract�content and�cultural�references

• Writes�reasonably�coherent�essays�on�familiar�topics,�but�with�some�grammatical�weakness

• Does�not�have�complete�grasp�of�stylistic�features

• Vocabulary�frequently�lacks�precision and sophistication

• Begins�to�express�abstract�concepts,�especially�on�familiar�topics

• Fluency�is�occasionally�hampered�by�gaps�in�vocabulary�and�grammar

• Expresses�viewpoints�in�fairly�long�stretches�of�discourse

• Sometimes�is�asked�to�repeat�words�or�phrases

iTEP CEFR Listening Reading Writing Speaking

B1INTERMEDIATE

• Grasps the general�outline�of�topics�discussed in an�academic�setting

• Unfamiliarity�with�complex�structures and higher-level�vocabulary�leaves�major�gaps�in�understanding

• Limited�vocabulary�impedes�speed

• Grasps�the�gist�of�material�on�familiar�subjects,�and�identifies�some�significant�details

• Follows�step-by-step�instructions�in�exams,�labs,�and�assignments

• Communicates�basic�ideas,�but�with�weak�organizational�structure and grammatical�mistakes�that�sometimes�hinder�understanding

• Expresses�him/herself�with�some�circumlocution�on topics such as family,�hobbies,�work,�etc

• Manages�day-today�communications�with�peers�and�instructors,�marked�by�frequent�grammar�and�vocabulary�errors

• Pronunciation�requires�significant�effort�from�listeners

A2ELEMENTARY

• Maintains�comprehension�during�conversations�on�familiar�topics

• Relies�heavily�on�non-verbal�cues�and�repetition

• Understands�very�basic�exchanges�when�spoken�slowly�using�simple�vocabulary

• Major�vocabulary�gaps�lead�to�frequent�inaccurate or incomplete�comprehension,�and�slow�pace

• Understands�simplified�material

• Begins�to�determine�the�meaning�of�words�by�familiar�surrounding�context

• Limited�vocabulary�results�in�repetitive�style�and�simple�sentences

• Considerable�effort�required�by�the�reader�to�identify�intended�meaning

• Uses�only�basic�vocabulary�and�simple�grammatical�structures

• Generates�simple�questions,�greetings,�expressions�of�needs,�and�preferences

• Pronunciation�requires�significant�effort�from�listeners

• Pronunciation�often�obscures�meaning

A1BEGINNER

• Understands�simple�greetings,�statements,�and�questions�when�spoken�with�extra�clarity

• Follows�simple�familiar�instructions

• Frequently�requires�repetition�for�comprehension

• Understands�a�few�isolated�words�or�phrases�spoken�slowly

• Comprehends�only�highly�simplified�phrases or sentences

• Recognizes�familiar�cohesive�devices�and�basic�pronouns

• Demonstrates�understanding�of�a�few�simple�grammatical�and�lexical�structures

• Recognizes�the�alphabet�and�isolated�words

• Writes�only�short�simple�sentences.�often�characterized�by�errors�that�obscure�meaning

• Provides�personal�details�with�correct�spelling�and�can�copy�familiar�words�and phrases

• Produces�isolated�words�and�phrases

• Capable�of�short�simple�presentation�on�familiar�topic

• Responds�to�simple�statements�or�questions

• Speech�is�marked�with�non-native�stress�and�intonation�patterns

• Communication�is�understood�for�short�utterances

• Pauses,�false�starts,�and�reformulation�are�common

• Communicates�with�single�words�and�short�phrases�at�“survival�level”

• Intense�listener�effort�required

• Produces�a�few�isolated�words�and�phrases

• Pronunciation�is�mostly�unintelligible

APPENDIX C: SUMMARY OF STEPS TO MINIMIZE CONTENT-IRRELEVANT TEST MATERIAL

• Implement�best�practices�in�item�writing�to�reduce�the�likelihood�that�“test�wise”�test-takers�will�be�able�to�select�the�best�answer,�through�cues�in�the�test,�without�needing�to�understand�the�test�item�itself�(for�example,�by�selecting�the�lengthiest�option,�eliminating�options�that�are�saying�the�same�thing�in�different�ways)

• Avoid�content�that�may�influence�test-takers’�performance�on�the�test—items�respect�peo-ple’s�values,�beliefs,�identity,�culture,�and�diversity.

• Topics�on�which�a�set�of�items�may�be�based�are�submitted�by�item�writers�to�iTEP;�iTEP�pre-approves�topics�prior�to�item�writing

• Assessment�content�reflects�the�domain�and�difficulty�of�knowledge�of�someone�with�the�educational�level�of�a�high�school�junior�who�expects�to�attend�college.�The�content�reflects�materials�that�an�examinee�would�be�expected�to�encounter�in�textbooks,�journals,�classroom�lectures,�extra-curricular�activities,�and�social�situations�involving�students�and�professors.�Items�do�not�reflect�specialized�knowledge.

• Write�items�at�an�appropriate�reading�level�(no�higher�than�grade�12;�lower�reading�level�for�easier�items);�avoid�words�that�are�used�with�low�frequency

• Test� items�assess�comprehension�within� the� item,�as�opposed�to�common�knowledge.�Passages�establish�adequate�context�for�the�topic,�but�then�go�on�to�introduce�material�that�is�not�generally�known.�Examinees�should�be�able�to�gain�sufficient�new�information�from�the�passage�to�answer�the�questions.

• Content�does�not�unduly�advantage�examinees�from�particular�regions�of�the�world.

iTEP International

+1.818.887.3888

www.iTEPexam.com

validity & reliability report › wp-content › uploads › 2019 › 01 › itep...validity...

Documents

reliability and validity - association of american medical...

survey methodology reliability & validity

chapter 6 validity & reliability

reliability, validity, & scaling

validity and reliability of observation and data ... ·...

mmt validity reliability

04 reliability and validity

reliability and validity

validity and reliability

reliability, validity and practicality

chapter 4: reliability and validity · 2014-09-10 ·...

reliability and validity

validity and reliability of observation and data ... · pdf...

reliability & validity

chapter5 validity and reliability

validity and reliability of observation and data...

validity, reliability, & sampling

running head: reliability and validity coefficients...

casestudy reliability validity triangulation

reliability and validity -...