linking quality of instruction to instructionally ... · pdf filelinking quality of...

31
1 Linking Quality of Instruction to Instructionally Sensitive Assessments Ming-Chih Lan 1 , Min Li 1 , Araceli Ruiz-Primo 2 , Ting Wang 1 , Michael Giamellaro 2 , Hillary Mason 2 1 Uninveristy of Washington at Seattle 2 University of Colorado Denver Paper Presented at the AERA Annual Meeting April, 2012 Vancouver, Canada The work reported herein was supported by National Science Foundation (DRL-0816123). The findings and opinions expressed in the paper do not reflect the position or policies of the National Science Foundation, but the authors. Contact: Ming-Chih Lan: [email protected]

Upload: ngonhu

Post on 20-Mar-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

1

Linking Quality of Instruction to Instructionally Sensitive Assessments

Ming-Chih Lan1, Min Li

1, Araceli Ruiz-Primo

2, Ting Wang

1,

Michael Giamellaro2, Hillary Mason

2

1Uninveristy of Washington at Seattle

2University of Colorado Denver

Paper Presented at the AERA Annual Meeting April, 2012

Vancouver, Canada

The work reported herein was supported by National Science Foundation (DRL-0816123). The findings

and opinions expressed in the paper do not reflect the position or policies of the National Science

Foundation, but the authors.

Contact: Ming-Chih Lan: [email protected]

2

Linking Quality of Instruction to Instructionally Sensitive Assessments

Measurement experts such as Popham (2007) have argued that large-scale tests

used for accountability purposes are instructionally insensitive mainly because very little

of what is taught gets tested. Due to the sampling procedures used for large-scale tests

(Leinhardt, 1983), the test items presented to students do not accurately reflect the

learning that takes place in the classroom (Polikoff, 2010). Thus, these test results tend to

reflect general ability or maturation rather than effective instruction (Wiliam, 2007). As

expressed by Popham (2007),

How can the prospect of annual accountability testing ever motivate educators to

improve their instruction once they realize that better instruction will not lead to

higher test scores? How can officials accurately intervene to improve instruction

on the basis of low scores if those scores really are not a consequence of

ineffective instruction? (p. 147)

In failing to use instructionally sensitive assessments, we cannot adequately monitor the

instructional quality students receive or evaluate the effectiveness of educational reforms.

Thus, we contend that a well-developed assessment to measure student learning should

be sensitive to the quality of instruction that students are exposed to. In this paper we

focus on issues directly related to our conception of instructional sensitivity of

assessment (Ruiz-Primo & Li, 2008): Are test items developed varying in instructional

sensitivity differentially related to instructional quality shown in videotaped classroom

lessons? How test items developed varying in instructional sensitivity based on the

DEISA approach is described in the next session.

3

The study presented in this paper is part of a larger Development and Evaluation

of Instructionally Sensitive Assessment (DEISA, Ruiz-Primo & Li, 2008) project in which

we have developed, revised, and tested an approach to developing and evaluating

instructionally sensitive assessments. In this study, we evaluate the instructionally

sensitive assessments with respect to the quality of instruction teachers provided to their

students. The ultimate goal of the DEISA project is to offer a robust approach to

constructing and evaluating instructionally sensitivity assessments that can detect the

impact of effective instruction or instructional programs.

Theoretical Framework

We define the instructional sensitivity of assessments (or instructionally sensitive

assessments) as the extent to which assessment items: (1) represent the intended and

enacted curriculum (i.e., material covered by the assessment has actually been taught);

(2) reflect the quality of the enacted curriculum (i.e., the quality of instruction provided

by teachers); and (3) have formative value (i.e., if items are sensitive to instruction,

teachers must be able to use the assessment information to adjust instruction). In this

study, we focus on the second aspect: instructional quality as it is less studies compared

to the aspect of content coverage according to the literature review.

Consistent with the approach proposed by Ruiz-Primo et al. (2002) and the

instructional sensitivity notion described above, we claim (Ruiz-Primo & Li, 2008; Ruiz-

Primo, 2010) that assessment evidence can be collected at different distances or

proximities to the enactment of a curriculum or a program students are exposed to (Ruiz-

Primo et al., 2002). Based on the idea of instructionally sensitive assessment, three types

of test items were developed to explore their relationships with quality of instruction in

4

the study: (1) close item - test items close to what student experience in the classroom for

an enacted curriculum; (2) proximal item – test items focused on the same topic and

concepts, but not as close to what students experienced; and (3) distal item – assessment

items developed based on other curriculum standards and neither close nor proximal to

the enacted curriculum. Test items appropriately developed at different

distances/proximity (i.e., close, proximal, or distal to the curriculum content taught by

teachers) are expected to vary in their capacity to reflect a teachers’ instruction in

conjunction with the three levels of the instructional sensitivity notion. For example,

student performance in either close or proximal test items should be higher than that in

distal items because either close or proximal items are related to the planned curriculum

content and distal items are not.

In order to test the proposed DEISA framework for the instructional sensitivity of

assessments, we developed close and proximal items at different proximities from the

intended curriculum but with distal items selected from existing large-scale tests1. We

conceptually and theoretically conceived the proximity of the items as a continuum

between close and distal against the target curriculum like the difficulty index of test

items. In practice, by manipulating the variables we identified to control the level of

proximity, three categories of test items including close, proximal, and distal items were

developed. Two issues are important to focus on: (1) recognize that a goal in the use of

proximal items is to evaluate how well students can transfer what they are learning. By

this we mean that the items, whenever possible, will assess whether students can apply

what they have learned to contexts different from those introduced in the module. (2)

1 We selected distal items from large-scale assessments (i.e., international, national, and state assessments)

based on the Colorado state science standards and the National Science Educational Standards on the

content and processes of scientific investigations that is the focus of the modules at hand.

5

Close and proximal items must measure the knowledge related to learning goals of the

target science modules. Unlike distal items, for each of the close or proximal items

developed, it is important to ensure that it taps the learning goal(s) of the science module.

Three steps to developing test items were described as the following:

1. Defining the Learning Goals. A strategy, called mapping the intended

curriculum, was developed to gather information about the critical characteristics of a

science unit or module. The purpose of mapping included the identification and

development, and understanding of the learning goals to be achieved by teachers and

students at the end of a unit/module. To understand the learning goals of a unit/module,

two tasks during the mapping process were completed: (1) identification of the scientific

knowledge and scientific practices aimed at the learning goals students should achieve for

a taught unit/module, and (2) classification of the scientific knowledge and scientific

practices by types of knowledge – declarative, procedural or schematic knowledge.

Teachers who were scheduled to teach the science module targeted in the project

were invited to map the module to help identify in the mapping process the critical

concepts, principles, procedures and explanation models (Giamellaro, Lan, Ruiz-Primo,

& Li, 2011). In sum, the module maps, tracked seven aspects of each lesson/investigation

within a unit or module. including: (1) the learning targets for the lesson in terms of

scientific knowledge and scientific processes; (2) the type of knowledge being tapped or

engaged in the lesson (i.e., declarative, procedural, and schematic knowledge); (3) the

activities that are critical to achieve the learning targets for the lesson; (4) the

documentation required of students; (5) the materials used; (6) the graphical

representations; and (7) the vocabulary involved in each lesson, activity, or investigation.

6

These aspects were considered Sources of Instructional Sensitivity (SOIS) that could be

potentially used to manipulate close and proximal test items for the study in order to

adequately assess students’ learning outcomes.

2. Identifying the Big Ideas. ‘Big idea is the concept that reflects the principled

understanding that educators should focus on when concerned with transfer of learning.

Only if we align instruction to big ideas, we can make possible the conceptual

connectivity needed to extend the original reasoning learned. With respect to types of

knowledge, big ideas are more directly linked to a particular type of knowledge,

schematic knowledge or knowing why – the knowledge used to reason about, predict, and

explain things in nature. Schematic knowledge focuses on “why” explanations, or

principled understanding (Ruiz-Primo et al., 2012).’ In other words, big ideas focus on

why observable events happen, not just descriptions of how they happen or that they

happen (Ruiz-Primo & Li, 2012). Synthesizing the module map developed by the

teachers and the map developed by the researchers, we developed a collective map that

helped to guide discussion about the big ideas. As a combined group of teachers and

researchers, guided by two content experts, the big ideas were constructed for each unit

or module. In the DEISA project, we have defined Big Ideas as scientific propositions

that have broad explanatory power within or across scientific disciplines and that

typically form the foundation for more advanced learning.

Big ideas were further organized around levels of understanding that could guide

the focus of the items developed. For example, in the Landforms FOSS module, one of

the big ideas explored was the relationship erosion and deposition. The levels progress

from the basic “students know that Earth’s land surfaces change” to a third level

7

“students recognize that deposition is the opposite of erosion. ”Following the purpose of

the mapping, two types of big ideas were developed with one focusing on scientific

knowledge and another focusing on scientific practices. The role that the big idea played

in the development of the items was crucial. Each item needed to tap a particular level of

the primary big idea, either focusing on scientific knowledge or practices. On some

occasions, items could also tap into a secondary big idea.

3. Developing Items: Manipulating Item Proximity. DEISA team researchers have

focused on developing multiple‐choice items based on the rationale that they are the

predominant form of assessment at the classroom, district, state (e.g., CSAP), national

(e.g., NAEP), and international (e.g., TIMSS, PISA) levels. The variables manipulated to

develop different levels of proximity to the intended curriculum included: (1)

characteristics of the question, (2) students’ exposure to the big idea in the module, (3)

sharing of cognitive elements, (4) setting of the item, and (5) experimental setting.

Instructionally sensitive assessment items were developed on the basis of what we

named “bundles of triads,” all of which assess the same construct, defined by a particular

learning goal and level of its understanding. Within each bundle, one close item is first

developed and then two proximal items (Proximal 1 and Proximal 2) are developed using

the close item as reference. The difference between the two types of proximal items is the

magnitude or the extent of the distance from the close item. For a given characteristic,

there is a small change from the close item for developing Proximal 1 items and bigger

change for developing Proximal 2 items. Presumably, the triads should ask exactly the

same or very similar questions of students, with the appropriate adaptations when needed

across Proximal 1 and Proximal 2. Two types of changes are defined: small changes and

8

big changes. The “small change” focuses on developing items using information from the

Close item and changing it to a very small degree. For example, Proximal 1 items will

use similar types of organisms as those used in the Close item. If corn and barley are used

in the module to study environmental preferences, either or both will be used in a Close

item, but plants of different types will be used in a Proximal 1 item. The “big change”

focuses on developing items using information from the Close item with a considerable

degree of change. For example, for Proximal 2 items, the target organism should be

completely different from the one used in a Close item. If the close item uses corn and

barley, the Proximal 2 item can use an animal that is not used in the science module at all,

but still measuring student understanding about environmental preferences. The

introductory paper of this set (Ruiz-Primo & Li, 2012 included information in more detail

about how we developed instructionally sensitive items for the study.

As a result, for the 5th grade Environments module introduced in next section, 19

bundles of test items were developed and distributed across four booklets, each

containing 30 items. Not all developed items were used in the booklets. Four of the

Proximal 2 items were used to create two open-ended questions presented as the 30th

item

in the booklets. Each booklet had seven Close, five Proximal 1, seven Proximal 2, and

nine distal items. For the Heat and Change module, two booklets were developed with

each containing 30 items, too. Each booklet had twelve Close, nine Proximal, and nine

distal items. During the item development process, we did not have enough time to test

the new approach with Heat and Change module. Therefore, we had 2 levels of items

developed for the Proximal for Environments, but only one proximal level for Heat and

Change module.

9

After test items were developed, the correlation between developed test items at

different levels of distance to the intended curriculum and the quality of instruction was

examined as procedures described in the next session.

Methods

Participants

In this paper, we present the empirical evidence of the quality of teaching with

thirteen 5th grade teachers who taught either the Environments module or the Heat and

Change module, two of the three science modules studied in the parent project. Thirteen

fifth grade teachers across three school districts and their students participated in the

study. Table 1 provides information about the teachers’ characteristics. Teachers in the

rural school district used the FOSS Environments module and had taught the module an

average of 3.1 times (rang e= 0-5) in previous years. Teachers in the urban school district

used the BSCS Heat and Change module and had previously taught the module unit an

average of 5.8 times (range = 2-10).

Table 1

Participant Demographics by Module

Characteristics

Science Module

FOSS

Environments

BSCS

Heat and Change

School district location Rural Suburban

Number of teachers 7 6

Gender All females 5 females, 1 male

Education 5 Master, 2 Bachelor 4 Maser, 2 Bachelor

Average years teaching science 8.1 6.2

Average number of times unit

taught in previous years 3.1 5.8

Number of students in this project 163 237

10

Curriculum and Sources of Data

For both the Environments (from Full Option Science System science curriculum

module) and the Heat and Change (from Biological Science Curriculum Study

curriculum science module) modules, we selected the lessons heavily involving the big

ideas of the unit based on the learning objectives of each lesson and instructional efforts

to achieve these objectives specified in the curriculum maps. The FOSS Environments

module, which consists of 6 investigations, guides students to learn relationships between

organisms and their environments. The study chose to focus on investigations 3 and 5

from the module. Investigation 3 required students to conduct experiments and draw

conclusions on the ‘range of tolerance’ and ‘optimal conditions of organisms’ for four

kinds of plants. Investigation 5 required students to run the same experiments but focused

on the impact of salt concentration on the hatching of brine shrimp eggs. The critical

concepts that students need to graph and make connections to are ‘range of tolerance’,

‘optimal condition’, ‘environmental factors’, and ‘controlled experiment’.

The BSCS Heat and Change module, which consists of 9 lessons, guides students

to learn concepts in the physical sciences by studying the properties of water and a few

other substances. The selected lessons, 4 and 8, required students to draw conclusions on

heat transfer based on the temperature data collected when a container of hot water is

placed in cold water and when water is brought to boiling. The critical concepts that

students need to make sense of and make connections with are ‘heat transfer’ and ‘phase

change’.

Classroom videotapes of teachers (as an indicator for quality of instruction).

Teachers were provided with a video camera and lapel microphone. They were trained to

11

use the camera and they were asked to videotape every lesson related to instruction of the

module. Classroom videotapes were transcribed and analyzed. In this preliminary study,

we only analyzed one type of instructional episode: reporting data and drawing

conclusions, where students were supported by their teachers to make sense of data in

order to build explanatory models. These episodes typically served as closure at the end

of each investigation or lesson,

The coding scheme was part of a coding system focusing on the quality of

instructional activities implemented by science teachers. Following Tarr et al. (2010), we

coded 21 critical aspects organized under the following five categories.

1. Alignment of key concepts to the learning goal: the extent to which teachers

make learning connected to the learning goals (PS. Key concepts were defined as key

words, terms, or a combination of key words/terms focused to learn in an investigation or

lesson).

2. Evidence provided to support claims: the extent to which teachers support

students in transforming observation and collected data into evidence, explanations, and

conclusions.

3. Making connections: the extent to which teachers make learning connected to

ideas for other lessons, real world examples and applications, and other disciplines and

support students to develop in-depth understanding.

4. Engaging student in learning: the extent to which teachers encourage students

engaging in learning.

5. Students’ response to learning: the extent to which students are involved in

learning, to capture quality of instruction as teachers implemented the curriculum.

12

The coding questions, how questions were categorized, coding options and rules,

and coding examples are shown in Appendix 1. Codes were assigned as presence or non-

presence according to the coding rules and then codes were summed up into five

dimensions of codes which were used to evaluate their relationship with quality of

instruction. Table 2 below presents an excerpt of a transcript for Teacher 7 of the

Environment module and explained the reasons for the assigned codes. Two raters

independently completed coding of two randomly selected video segments. The inter-

rater agreement reached 83%, indicating that this coding approach can be reliably used by

coders after participating in the coding training sessions.

Pre- and posttest scores of students (as an indicator for student performance).

The study also collected students’ test scores on the instructionally sensitive assessment

administered before and after instruction. The effect sizes2 which captured students’ gain

scores from the pretest to the posttest on test items were calculated within each proximity

(i.e., close, proximal 1, proximal 2, and distal) of test items to indicate student

performance.

2 All effect sizes were estimated using the formula, for pre-posttest design. ,

Treatment ControlX XES

SD

13

Table 2

An Example of the Video Coding

Transcript Code(s) Assigned

Teacher 7

Alright guys, who has a really good--shh--conclusion that they

wanna share? Okay. Shh! I'll wait.

Student A

I said I picked radishes and barley because what we saw is that

they don't take a lot of water and if they have a lot of water,

they won't grow as much

Code E6 was assigned

since the student

provided evidence to

support

Teacher 7

So radishes and barley. I love how you said that and you kind

of it up with your information

Code E3 was assigned

since the teacher

explicitly linked

student’s evidence to

claims

Student B

I picked peas and corn because...we saw a lot of growth with

peas and corn in the drought. They both take a lot of water

Code E6 was assigned

since the student

connected evidence to

claims

Teacher 7

They take a lot of water. So in the drought, are--drought means

we have no water. So are we gonna plant that in a drought?

No. And I love how you backed it up because you sho--you

just proved your own point--it was wrong. You proved that

yourself. So I loved that. So switch that for me. [Student] read

me yours

Code E3 was assigned

since the teacher

explicitly reminded

students to make

connections with

evidence and claims

Student C

Okay, the peas grew the best in the very wet

Code E4 was assigned

because student’s

claim without

supported evidence

Data Analysis

The data analysis included a series of correlations between quality of instruction

(indicated by assigned video codes) and student performance (indicated by the effect

sizes of gain scores between pre- and posttest for close, proximal and distal test items

grouped, respectively) to explore how student performance on test items developed at

14

different levels of proximity to the intended curriculum were related to the quality of

teachers’ instruction that students were exposed to.

Results and Discussions

Our preliminary analysis includes (1) descriptive statistics of student performance

on different types of test items, (2) descriptive statistics of the assigned video coding for

seven teachers from the Environments module and six teachers from the Heat and

Change module, and (3) a series of correlations between the video codes and the effect

sizes of pre- and posttest from the assessment items as the indicator of student learning

gain.

Table 3 reports the average student performance indicated by the effect sizes

reported based on different levels of item proximity. For the EN module, there was an

overall linear trend (F = 10.17, p < .01) across the seven teachers despite Teacher 2

showing bigger gain scores in proximal 1 items and teacher 7 showing bigger gains in

proximal 2 items. The effect size of student performance in close items was 1.00 and then

decreased when items moved from close to proximal and proximal to distal with the

highest effect size of 1.00 for close items, a middle value of .64 for proximal 1, .41 for

proximal 2, and the lowest effect size value of .25 for distal items. Results for the HC

module showed the same linear trends trend (F = 3.65, p < .05) as the EN module did.

Overall, the HC module showed the highest effect size of 1.16 for close items, a middle

value of .88 for proximal items (Unlike the EN module, there is only one proximal level),

and the lowest effect size of .61 for distal items.

The finding was consistent with what we hypothesized about the item proximity –

for individual teachers, if the intended curriculum is enacted, their students’ performance

15

in close and proximal items is expected to show higher gains than in distal items because

close and proximal items are more aligned with what students are supposed to learn than

are distal items.

Table 3

Summary of Effect Sizes as an Indicator of Student Performance by Type of Items

Teacher ID

Effect Size Derived from Gain Scores by Pre- and Posttest for Type of Items

Close Proximal 1 Proximal 2 Distal

Environments

1 .94 .25 .44 .42

2 .83 1.22 .04 - .08

3 1.05 .30 .43 .13

4 1.11 .79 .26 .32

5 .75 .46 .54 .69

6 1.21 .85 .18 .15

7 1.15 .58 .95 .12

Average 1.00 .64 .41 .25

Heat and Change

1 .78 0.87a .92

2 .37 0.59a .55

3 1.45 0.91a .18

4 1.28 1.21a .56

5 1.53 0.68a 1.00

6 1.57 1.04a .44

Average 1.16 0.88a .61

a Only one level of proximal items for Heat and Change module

Table 4 summarizes the length of time for instruction and the codes raters

assigned to evaluate the quality of teachers’ instruction. The originally assigned video

codes (21 coding items; see Appendix 1) were grouped into five instruction-related

activities. Because the points of focus could not be found in the videotaped recordings

for Teacher 2 of the Environments module, the teacher was dropped from the study,

leaving 6 teachers in the Environments module and 6 in the Heat and Change module as

well in the further data analysis.

We took a close look at the time teachers took in helping students to make sense

of the investigation data and to discuss conclusions in relation to the concepts of these

16

two investigations/lessons. In the Environments module, the amount of time teachers

spent ranged from approximately 4 minutes (Teacher 5) to 21 minutes (Teacher 3) with

an average of 14 minutes or so (Table 4). That is, there was an average of approximately

seven minutes for each investigation that each teacher spent on reporting data and

drawing conclusions. There was a noticeable difference between teachers in the

Environments and Heat and Change modules. In the Heat and Change module, the

amount of time teachers spent ranged from approximately 24 minutes (Teacher 5) to 72

minutes (Teacher 3) with an average of around 48 minutes (Table 3). There was an

average of approximately twenty-four minutes each teacher spent for each lesson.

Compared to teachers using the EN module (about 7 minutes), teachers using the

Heat and Change module provided more time to support (about 24 minutes) students to

navigate between data and concepts when constructing conclusions. This can be

attributed to several characteristics about the module. For example, the curriculum

developers of the Heat and Change module intentionally included more tools for students

to make conceptual connections such as a map of evidence and claims and more relevant

guiding questions for the class discussion. That is, students would construct data tables

and recorded data for a period of time for two samples of water (e.g., one was warmer

and the other was cooler). Then students individually graphed their data and described

what took place to identify the heat source, the direction of heat flow, and the variables

that affected the heat transfer. Teachers would ask students to spend time plotting the

temperature and the time they recorded for warmer objects and cooler ones to explain the

relationship between them. In addition, teachers in the Heat and Change module spent

more time helping students to make sense of investigation data which are often about the

17

temperature, time and the state of matters. This may be related to the fact that the

concepts in the Heat and Change module (e.g., heat energy, heat transfer, and molecules)

are more intangible and abstract than the ones in the Environments module (e.g., shrimp,

corns, and peas).

We then examined the correlation between teachers’ overall teaching quality

indicated by classroom videotape observations (last column in Table 4) and the

instructional time teachers spent (2nd

column in Table 4) engaging students in “reporting

data and drawing conclusions.” The correlation coefficient was positive for both EN (r =

.87, p < .05) and HC (r = .89, p < .05) modules, indicating teachers who engaged students

longer in discussing investigations showed a better quality of instruction than teachers

who took less time in this session of discussing.

We noticed that the segment of data reporting and drawing conclusions was

almost the last instructional episode in each investigation. We wonder whether teachers

would overlook the importance of this session because of its proximity to the end of the

instructional process and whether the average of seven minutes was enough for teachers

to carry out a productive discussion to make sense of the data and connect the data

pattern to the explanations. Fortunately, according to the finding, most of the teachers (6

teachers for each of the EN and HC modules) led the class discussion on “reporting the

investigation data and then drawing conclusions” in the two coded investigations/lessons.

However, for teachers teaching the EN module, this amount of time was relatively very

little compared to that of teachers in the HC module. Based on the analysis that showed a

positive and high correlation between the amount of time teachers engaged students in

this session and the quality of instruction, it indicated if teachers did not go through the

18

reporting of data and drawing of conclusions with adequate time for implementing this

type of conceptual discussion, their instructional quality was jeopardized and

consequently, it may result in the possibility that students are less likely to achieve the

coherent understanding of the key concepts and learning goals designed and embedded

by the curriculum developers.

As shown in Table 4, for the Environments module, the assigned video codes

which indicate the quality of teachers’ instruction, ranged from 1 (teacher 5) to 43

(teacher 3) with an average of 20.5 and standard deviation of 13.9. For Heat and Change

teachers, the assigned video codes representing the quality of teachers’ instruction ranged

from 42 (for teacher 2) to 114 (for teacher 6) with an average of 69 and standard

deviation of 30.27. It seemed the teaching quality of Heat and Change teachers were

much better than those of Environments teachers. However, after eliminating the impact

of effects resulting from the difference in amount of time teachers spent on teachers’

instructional quality by an ANCOVA statistical procedure with ‘the amount of time

teachers spent on data reporting and drawing conclusions’ as covariate (controlling

variable), it showed there is no difference (F = .336, p > .05) between these two groups of

teachers in terms of instructional quality. In other words, based on the statistics, if

teachers who taught the Environments module in the first place had spent as much time

as Heat and Change teachers did, their instructional quality should have been equally as

high as that for Heat and Change teachers.

19

Table 4

Summary of Assigned Codes for Instructional Quality Based on Transcribed Videotaped Investigations

Teacher

ID

Length of

Investigation

(in sec.)

Number of

Investigation

/Lesson

Alignment of Key

Concepts to the

Learning Goals

Evidence

Provided to

Support Claims

Making

Connections

Engaging

Student in

Learning

Students’

Response to

Learning

Sum of Video

Codes

Environment

1 447 2 4 7 3 1 0 15

2 - a - - - - - - -

3 1261 2 21 9 10 0 33 43

4 865 2 9 0 5 2 12 17

5 249 2 0 0 0 0 1 1

6 1166 2 17 4 2 1 24 26

7 1067 2 6 8 0 4 21 21

Heat and Change

1 3347 2 19 34 6 2 3 64

2 2270 2 9 20 7 3 3 42

3 4285 2 19 64 2 2 13 100

4 1522 2 18 33 3 1 1 56

5 1424 2 10 25 1 2 5 43

6 4256 2 18 87 5 2 2 114

a Teacher 2 in Environments module did not have any video clips found for reporting data and drawing conclusions.

20

The study further examined how students’ performance was associated with their

teachers’ instructional quality by taking ‘the amount of time teachers spent in guiding

data reporting and conclusion drawing’ as covariates. The summary of correlation

coefficient report is shown in Table 5. The correlation tested whether student

performance categorized by types of item proximity reflects the quality of teachers’

instructions in many aspects. Admittedly, the results have to be interpreted with caution

because of the sample size with 6 teachers involved in the study only. As a result, most of

the correlation coefficients were not significant (p > .05) because higher correlation

coefficients are required to reach statistical significance for smaller sample sizes.

However, the study showed some potential patterns.

One hypothesis of the study was that the assigned video codes which indicted

instructional quality would show higher correlation with student performance in the close

and proximal items and no correlation with student learning outcomes in distal items.

That is, for a group of teachers, if the intended curriculum is enacted, the correlation

between student performance in both close and proximal items and the quality of teachers’

instruction is expected to be higher than that between student performance in distal items

and the quality of instruction.

The hypothesis was developed by the rationale that quality instruction not only

supports what students have learned in the classes but also facilitates students to apply the

underlying concepts they have learned from close items to answering the proximal items

related to the learning transfer in novel situations.

21

Table 5

Summary of Correlation between Instructional Quality of Teachers and Student Performance by Types of Test

Effect Size for

Type of Items

Alignment of Key

Concepts to the

Learning Goals

Evidence

Provided to

Support Claims

Making

Connections

Engaging

Student in

Learning

Students’

Response to

Learning

Sum of Video

Codes

Environments

Close

Proximal 1 .39

Proximal 2 .51 .85 .65

Distal .14 -.19 .09 .05 .12 -.01

Heat and Change

Close .79 .22 .83

Proximal .87 .60 .72

Distal -.03 -.23 .14 .10 -.25 -.30

Note. No correlation coefficients significant at p < .05. Only correlation coefficients consistent with what the study hypothesized were

reported.

22

In Table 5, only correlation coefficients consistent with our hypotheses are

reported in order to make the relationship salient. For student performance in distal test

items as a whole, across both the Environments and the Heat and Change modules, it

showed low correlation with the quality of instruction involving all five teaching

dimensions.

For student performance in proximal test items, it showed a higher correlation

with teaching quality in the ‘alignment of key concepts to the learning goals’ dimension

(r = .87; for the Heat and Change module), a higher correlation with the overall teaching

quality (r = .72; for the Heat and Change module), and a higher correlation with the

teaching quality in the ‘engaging student in learning’ dimension (r = .85; for the

Environments module)

For student performance in close test items, it showed a medium correlation with

the teaching quality in the ‘alignment of key concepts to the learning goals’ (r = .32; for

Heat and Change module), a lower correlation with the teaching quality in ‘students’

response to learning’ for the Heat and Change module (r = .22), and a lower correlation

with the teaching quality in ‘students’ response to learning’ for the Environment Module

(r = .27) as well.

On the other hand, the finding which was inconsistent with what we hypothesized

led the researchers to reason that (1) controlling variables such as teacher’s professional

development, years of teaching, and students’ academic ability which may play

influential roles to impact either teacher’s quality in instruction or student performance in

learning should be included in the analysis to improve internal validity of the study; (2)

due to the finding which did not support what we hypothesized, what the study claimed

23

should be re-examined based on the findings; (3) the effect of assigned video codes

which indicated quality of instruction may not be universal. That is, some of our aspect

codes may be more relevant to help students to gain clarity on what they have just learned

from lessons (i.e., close item related) whereas others are helpful to support students to

transfer what they learned to novel situations (i.e., proximal item related), and (4) some

coding dimensions did not have enough variation across teachers suggesting a lack of

sensitivity of the coding questions/rules which may result from rater quality, the number

of points on coding scales, the number of items developed for each coding category, and

the cognitive load for raters (Hill, Charalambous, & Kraft, 2012; Ing & Webb, 2012); and

(5) the video clip sampling strategy should be re-examined such as the number of video

clips selected, the number of lessons included and the representativeness of selected

video clips chose.

Conclusions

The success of science education relies on the quality of teaching that takes place

in the classroom. Meaningful and successful learning will not happen automatically for

students without deliberate scaffolding from teachers. If teachers fail to support students

to make connections between key concepts and investigation data for a given curriculum,

students will end up with scattered, disconnected pieces of information rather than

integrated big ideas that cross curricula.

As the findings indicated, the time spent in data reporting and drawing

conclusions was related to student gain scores in science achievement and may determine

whether students are able to connect collected data to the learning goals/objectives

embedded in the curriculum. According to the summarized codes of the instructional

24

quality, the study found most teachers did carry out class discussions for reporting data

and drawing conclusions after conducting scientific investigations. However, the time

they engaged students in during these critical periods varied highly and may not be

enough for some teachers guiding student to make meaningful connections to learning

targets.

Furthermore, we looked into the correlation between the assigned video codes as

an indicator for the quality of instruction and the student performance effect sizes derived

from the gain scores based on the difference between taking a pre- and posttest. Overall,

there were no neat or significant patterns of the correlation coefficients.

Overall, the finding of the study partially supported what we hypothesized- the

test items developed at different level of proximity to the intended curriculum content

showed differential correlation functioning to quality of teacher’s instruction to some

degree. That is, the association between quality of teachers’ instruction and student

performance varied depending on test items developed at different levels of proximity to

the intended curriculum. In general, the study in part supported one of hypotheses: we

expected the instructional quality would show higher correlation with student

performance in close and proximal items, and no correlation with student performance in

distal items. The inconsistency in the findings may be associated with the small sample

size of participant teachers, the lack of variation in video coding, only a few

investigations or lessons involved in the study, or the decision of only focusing on the

video segment of data reporting and drawing conclusions.

The lessons we learned from the study also included the following: (1) we did not

include other sources of data, such as science notebooks and teacher interviews in our

25

data analysis, which could provide data triangulation and help us to end up with a more

reliable and complete picture about teachers’ instruction; (2) we only coded the

occurrence of the instructional supports, and did not take a closer look at the quality of

these indicators which may differentiate quality of instruction to reflect what teachers’

instruction really is; (3) we did not include co‐variants, such as pre-test scores for student

performance and background information for teachers to adjust potentially biased

measure outcomes; (4) the lack of sensitivity of coding system in some aspects raised

concerns about rater quality, coding scaling and scoring, and coding strategies; and (5)

sampling plan such as how to select video clip representative of teachers’ classroom

practices and what aspects should be included to more effectively measure the quality of

instruction should be re-examined

26

References

Giamellaro, M., Lan, M,, Ruiz-Primo, M. A., & Li, M. (2011, April). Addressing

elementary teachers misconceptions in science and supporting peer learning

through curriculum mapping. Paper presented at NARST annual meeting,

Orlando, FL.

Hill, H, C., Charalambous, C. Y., & Kraft, M. A. (2012). When rater reliability is not

enough: Teacher observation systems and a case for the generalizability study.

Educational Researcher, 41(2), 56-63.

Ing M., & Webb, N. M. (2012). Characterizing mathematics classroom practice: Impact

of observation and coding choices. Educational Measurement: Issues and

Practice, 31 (1), 14-26.

Leinhardt, G. (1983). Overlap: Testing whether it is taught. In G. F. Madaus (Ed.). The

courts,validity, and minimum competency testing (pp. 153-170). Boston, MA:

Kluwer-Nijhoff Publishing.

Polikoff, M. S. (2010). Instructional sensitivity as a psychometric property of

assessments, Educational Measurement: Issues and Practices, 29(4), 3‐14.

Popham, W. J. (2007). Instructional sensitivity of tests: Accountability’s dire drawback.

Phi Delta Kappan, 89(2), 146-150, 155.

Ruiz‐Primo, M. A. (2010, March). Developing and evaluating instructional sensitive

assessments. REESE PI Meeting. Washington, DC: National Science Foundation.

Ruiz‐Primo. M. A., & Li, M. (2008). Building a methodology for developing and

evaluating instructionally sensitive assessments. Proposal submitted to National

27

Science Foundation. Award ID: DRL‐0816123. Washington, DC. National

Science Foundation.

Ruiz‐Primo. M. A., & Li, M. (2012). Assessing transfer of learning: Instructionally

sensitive assessments, curriculum, and instruction. Paper presented at the AERA

meeting, Vancouver, Canada.

Ruiz-Primo, M. A., Li. M., Giamellaro, M., Wills, K., Mason. H., Lan, M., & Wang.

(2012). Instructional science curricula characteristics and transfer of learning:

On learning goals, opportunities to achieve them, and opportunity to transfer

what was learned. Paper presented at the AERA meeting, Vancouver, Canada.

Ruiz‐Primo, M. A., Shavelson, R. J., Hamilton, L. & Klein, S. (2002). On the evaluation

of systemic education reform: Searching for instructional sensitivity. Journal of

Research in Science Teaching, 39(5), 369‐393.

Tarr, J. E., Ross, D. J., McNaught, M. D., Chávez, O., Grouws, D. A., Reys, R. E.,

Swears, R., & Taylan, R. D. (2010, April). Identification of student- and teacher-

level variables in modeling variation of mathematics achievement data. Paper

presented at the AERA meeting, Denver.

Wiliam D. (2007, September). Sensitivity to instruction: The missing ingredient in large-

scale assessment systems? Paper presented at the Annual Meeting of the

International Association for Educational Assessment. Baku, Azerbaijan.

28

Appendix 1. Coding questions, Categories, Rules, and Examples for Coding transcribed Video Clips

ITEM

# CATEGORY CODING FOCUS EXPLANATION EXAMPLES

Alignment of Key Concepts to the Learning Goals. Use of Concept (may be a key word, a term, or a combination of terms defined by researchers) for instructional

purpose. What teacher tries to accomplish with the use of the concept?

Teacher…

A1 Priming 0=No/1=Yes

(for each

highlighted chunk)

- Mentions the concept (key word, term, or a combination of

terms) at the beginning of the day, the activity, or the discussion

- We are going to talk about ROT.

A2 Defining/Elaborating 0=No/1=Yes

(for each

highlighted chunk)

- Provides/asks Ss the definition of the concept (key word, term

or a combination of term)

- What is ROT?

- Corrects the use of the concept (key word, term, or a

combination of terms)

- You should use range of tolerance

instead of tolerance.

- What is the vocabulary (key words)

we use for this case/concept?

- Models the right use of the vocabulary

A3 Providing examples 0=No/1=Yes

(for each

highlighted chunk)

- Uses/asks examples to illustrate or elaborate the concept (key

word, term, or a combination of terms)

- The ROT in terms of water for peas

and tomatoes is different.

A4 Making sense of

results/Interpreting

collected data

0=No/1=Yes

(for each

highlighted chunk)

- Uses the concept (key word, term, or a combination of terms) to

(1) probe the investigation results or (2) facilitate the

conclusion

- The optimum condition for shrimp

hatching is 2 spoons of salts.

A5 Comparing w/ other

terms/conceptions

0=No/1=Yes

(for each

highlighted chunk)

- Compares, contrasts, and connects the concept (key word term,

or a combination of term) with other relevant concepts which

may or may not come from other investigations or lessons.

- The optimum condition is

somewhere within the ROT.

A6 Other 0=No/1=Yes

(for each

highlighted chunk)

- Whatever does not fit in any other category (A1 to A5)

Evidence Provided to Support Claim. Supporting students in transforming observation/collected data into evidence and into explanations/conclusions

Teacher…

E1 Prompting for evidence

(for T only)

0=No/1=Yes

(for each

highlighted chunk)

- Reminds student that they need to include/provide evidence, but

students do not tell how to do so, or what evidence is

- T: You should have evidence in

your conclusion (e.g., beginning of

the lesson or discussion evidence is

mentioned)

E2 Linking implicitly

evidence and claims

(for T only)

0=No/1=Yes

(for each

highlighted chunk)

- Students offer evidence to support claims. However, teachers

hint/remind/respond/comment students about the use of

evidence without explicitly describing why providing evidence

is good. The comment is not descriptive, often as an evaluative

- T: It’s a good way to backup your

claim.

- T: Excellent job of evidence. The

evidence is great!

29

ITEM

# CATEGORY CODING FOCUS EXPLANATION EXAMPLES

comment/feedback

E3 Linking explicitly evidence

and claims

(for T only)

0=No/1=Yes

(for each

highlighted chunk)

- Explains explicitly what Ss need to do with evidence and claims

- Teacher only asks for evidence to support the claim provided by

either the teacher or students prior to the conversation since the

instructional goal is to push students to include evidence in

addition to the claims they made.

- T: Be sure to include your

investigation data when you report

conclusions

- T: More erosion happened with the

flood you know that is because …?

- T: What is the evidence for your

conclusion?

- Refers to the cases that Ss offer linking/providing information

explicitly about the use of evidence to support claims

- T: What Tony claim is … and the

evidence he used to support his

claim is ….

- Models the type of evidence that should be included in the

conclusion

- T: Number of leaves, lengths of

roots/stems should be reported for

your claim

Student…

E4 Providing claims without

evidence

(for S only)

0=No/1=Yes

(for each

highlighted chunk)

- Provides claims without evidence - S: The ROT of water is from dry to

moist.

- Respond/completes the claim that the T starts/asks - T: “what is the optimum

condition?” S: “4”.

E5 Providing evidence

without claims

(for S only)

0=No/1=Yes

(for each

highlighted chunk)

- Provides evidence or describes the investigation data only

without any obvious claims in the chunk

- T:Hunter, do you have a claim and

evidence?

S: This was hers but I noticed that

the washer got to- from each edge

and it eroded- it didn't erode that far

but it eroded a lot-a really wide.

E6 Providing claims with

evidence

(for S only)

0=No/1=Yes

(for each

highlighted chunk)

- Student response includes the evidence although the quality

may not be good

- S: The optimum condition for

hatching is 2 spoons of salts because

the number of hatched eggs is the

most.

- T: More erosion happened with the

flood you know that is because …. S:

our delta length is 34 cm.

Making Connections. Teacher supports students learning by making connections

Teacher…

M1 Other investigations or

lessons

0=No/1=Yes

(for each

highlighted chunk)

- Conceptually compares, contrasts, and connects the concept

(key word, term of a combination of terms) with another

concept(s) in other lessons or investigations related to the Big

- What environment factor (a key

word for another lesson) can be

investigated for the ROT?

30

ITEM

# CATEGORY CODING FOCUS EXPLANATION EXAMPLES

Ideas.

- Connecting to other activities or discussion in previous days,

lessons, units or in future days, lessons, or units.

- What’s the effect of floods on

erosion and deposition?

M2 Elaborated examples or

application to real World

0=No/1=Yes

(for each

highlighted chunk)

- Makes/explores connections to real world phenomena (e.g.,

home and/or school connections) as examples or applications.

- What’s the ROT of temperature for

the people living in Colorado in

winter?

- When you shop for plants, what

does the label tell us (in terms of

ROT)?

M3 Elaborated examples or

application to other

disciplines

0=No/1=Yes

(for each

highlighted chunk)

- Makes connections to other content disciplines (e.g., language

arts, mathematics, social sciences)

- In math or in writing class you…

- Please make a list of words that

express the degree of wetness for the

ROT for water

Engaging Students in Learning. How Teacher fosters classroom interactions

Teacher…

T1

Invites and values

multiple perspectives

0=No/1=Yes

(based on entire

transcript)

- Compares and contrasts students’ responses - Bonnie and Jo said…, but Shin …

- Revoices/paraphrases students’ responses - Bonnie said that…. Does

somebody else …?

- Follows students’ responses

- Writes down students’ responses on white board or chart paper

- That’s interesting. So why do you

say your stream table…?

Do all of us agree with Jose’s

answer?

T2 Ss air time among

students

(Focusing on time:

students vs. teacher)

0=No/1=Yes

(based on entire

transcript)

- High percentage/proportion of *students’ talking/interaction

during the segment with respect to air time in contrast the time

that T occupied

-*Students’ talking/interaction is defined as 1) the time student

responds to the teacher or other students in the whole class

discussion, and 2) students talk or work with others in small

group settings while running the sharing out etc (the quiet

writing time and test taking time do not account)

T3

Ss participation in class

discussion

(Focusing on proportion

of students)

0=No/1=Yes

(based on entire

transcript)

- High percentage/proportion of students’ participation during

discussion or class discourse

Students’ Response to Learning. How students participate in the learning

Student...

S1 Ss ask questions 0=No/1=Yes

(for each highlighted

- Asks questions related to the lessons (content questions)

31

ITEM

# CATEGORY CODING FOCUS EXPLANATION EXAMPLES

chunk)

S2 Ss respond or comment

to other Ss’ Qs

0=No/1=Yes

(for each highlighted

chunk)

- Responds rather than the teacher, to questions posed by other

students

- Corrects/comments to teacher’s errors without being prompted

S3 Ss determine the focus

and direction of class

0=No/1=Yes

(for each highlighted

chunk)

- Determine the focus or direction of classroom discourse