of 20
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.

E DUCATIONAL DESIGNER JOURNAL OF THE INTERNATIONAL SOCIETY FOR DESIGN AND DEVELOPMENT IN EDUCATION. High-stakes Examinations to Support Policy

Category:

Creative Writing

Publish on:

Views: 6 | Pages: 20

Extension: PDF | Download: 0

Share
Description
ISDDE all rights reserved E DUCATIONAL DESIGNER JOURNAL OF THE INTERNATIONAL SOCIETY FOR DESIGN AND DEVELOPMENT IN EDUCATION Abstract High-stakes Examinations to Support Policy Design, development
Transcript
ISDDE all rights reserved E DUCATIONAL DESIGNER JOURNAL OF THE INTERNATIONAL SOCIETY FOR DESIGN AND DEVELOPMENT IN EDUCATION Abstract High-stakes Examinations to Support Policy Design, development and implementation Paul Black, Hugh Burkhardt, Phil Daro, Ian Jones, Glenda Lappan, Daniel Pead, and Max Stephens How can we help policy makers choose better exams? This question was the focus of the Assessment Working Group at the 2010 ISDDE Conference in Oxford. The group brought together high-level international expertise in assessment. It tackled issues that are central to policy makers looking for tests that, at reasonable cost, deliver valid, reliable assessments of students performance in mathematics and science with results that inform students, teachers, and school systems. This paper describes the analysis and recommendations from the group s discussions, with references that provide further detail. It has contributed to discussions, in the US and elsewhere, on how to do better. We hope it will continue to be useful both to policy makers and to assessment designers. Executive Summary What makes an exam better? High-stakes testing has enormous influence on teaching and learning in classrooms for better or for worse. Teachers give high priority to classroom activities that focus on the types of task in the test. This is understandable, indeed inevitable after all, their careers are directly affected by the scores of their students on these tests, the official measure of their professional success. Yet this effect of testing on teaching and learning seems to be ignored in the choice of tests by policy makers, who see tests only as measurement instruments. Driven by pressures for low cost, simplicity of grading and task predictability, current tests have a narrow range of item types that does not reflect the breadth of world-class learning goals as set out, for example [1], in the Common Core State Standards for Mathematics (CCSS) or, indeed, in many of the Page 1 state standards that CCSS is replacing. Yet, conversely, high quality exams can help systems to educate their students better. Good tests combine valid and reliable information for accountability purposes with a beneficial influence on teaching and learning in classrooms i.e. they are tests worth teaching to. How do we get better assessment? This paper, building on substantial experience worldwide, sets out the essential elements of an assessment system that meets this goal, taking into account the necessary constraints of cost. In brief, this entails: Planning assessment, including high-stakes testing, as an integral part of a coherent system covering learning, teaching and professional development, all focused on the classroom. Treating assessment as a design and development challenge, first to introduce high-quality instruments which serve both formative and summative purposes, then, later, to counteract the inevitable pressures for degrading that quality. The task of creating an assessment system to help guide and enhance teacher practices and students learning should include the following steps: Create a pool of tasks from a variety of sources each exemplifying Research-based understanding of learning and performance Creative designs Refinement through trialing in classrooms Feedback from teachers and others; Establish authoritative committees, independent of the test vendors, with the needed expertise to select tasks from the pool so as to balance the tests across performance goals, as summarized in the standards; Involve teachers in both the test design and the scoring processes, using the potential of test design and of scoring training as powerful modes of professional development, with built-in monitoring to ensure quality and comparability across and between teachers and schools; and Grow human capacity by providing training for designers of curriculum and of assessment. These recommendations are amplified and justified in the following sections. Page 2 1. The roles of assessment in education systems Good educational systems must have the capacity to evolve over time. Testing systems must also have this capacity, both in relation to their purposes and the actual assessment instruments that are created. Given the more rigorous demands on teaching and learning that have become accepted internationally, exemplified by the US CCSS, test validation requires a concomitant rigor with a broad range of strong evidence. To achieve the corresponding educative value, high quality exams will require radical change in system design. The extent of the challenge may be seen by comparing familiar current tests with this extract from CCSS, reflecting international standards for mathematics: Mathematically proficient students understand and use stated assumptions, definitions, and previously established results in constructing arguments. They make conjectures and build a logical progression of statements to explore the truth of their conjectures. They are able to analyze situations by breaking them into cases, and can recognize and use counterexamples. They justify their conclusions, communicate them to others, and respond to the arguments of others. They reason inductively about data, making plausible arguments that take into account the context from which the data arose. Mathematically proficient students are also able to compare the effectiveness of two plausible arguments, distinguish correct logic or reasoning from that which is flawed, and if there is a flaw in an argument explain what it is. Elementary students can construct arguments using concrete referents such as objects, drawings, diagrams, and actions. Such arguments can make sense and be correct, even though they are not generalized or made formal until later grades. Later, students learn to determine domains to which an argument applies. Students at all grades can listen or read the arguments of others, decide whether they make sense, and ask useful questions to clarify or improve the arguments. Examination design should reflect as far as possible the full range of goals of the mathematics curriculum. Anything less would not be acceptable as valid implementation of the intentions of the standards. The examination systems which will be developed should incorporate an auditing mechanism for checking how well the assessment practice is realizing the intentions. Such a mechanism should identify problems for example, that the current system of curriculum and assessment is not preparing students for the higher levels of mathematical thinking and reasoning embodied in CCSS, or other international standards. Page 3 Policy documents for school mathematics often point to the importance of mathematical proofs, mathematical modeling and investigations in achieving a balanced curriculum. However, these remain paper expectations and will receive little attention in the classroom unless there is a suite of required student assessments that will make clear to teachers the need for instruction to pay attention to these aspects of performance. Most current assessment systems fail to give students opportunity to show the range of desirable performances on educative exams in an environment that supports them in raising their own level of performance. The system sends score reports back to district and school offices but the information most often does not match the purposes of the various users. The district, which wants a limited amount of data that is valid and reliable, believes that these reports give a fair picture of the level of achievement of students, teachers and schools. Yet the multiple choice tests on which they are based, consisting of short items, assess only fragments of mathematics, not students ability to use their mathematics effectively, to reason mathematically about problems [2]. Teachers may learn a bit about the overall level of students from these reports, but not the kind of information that would enable them to help their students to raise their level of performance. Students may receive a score and a placement in the class as a whole, but with little to no information that could help them understand what higher quality performance entails. What school systems need is a valid and reliable overall picture of performance of students, classes and schools. In this regard, validity requires tests that are balanced across the performance goals, not just testing those aspects that are easy to test. What teachers and students need is detailed differential information on their strengths and weaknesses, accompanied by instructional ideas to help build on strengths and to remediate weaknesses. In this regard, valid information can only come from assessment of performance on mathematical tasks that demand substantial chains of reasoning. Most of this feedback needs to be formative, detailed, and timely enough to inform action. Knowing common patterns of mistakes, and locating student performance along the underlying developmental continua of learning in a given class can help teachers plan remediation, as well as change their classroom teaching practices for future students. To stimulate teachers toward this goal, rubrics for scoring and sample papers for typical responses at each level of performance on the exam can promote changes in classroom practices. Reports that include typical responses of students at different scoring levels can be used in discussions with students to Page 4 provide further learning opportunities. Separate scores on different dimensions of tests, along with individual item results, can help teachers improve their practice. Students should see where their performance lies along a progression of learning, so that they also understand the kinds of responses that would be classified as high quality work. The goal is to make the examination system educative to students, teachers and parents. Timeliness is central if the exam results, scoring rubrics, and sample papers are not returned promptly, the window of interest, engagement, and learning for teachers, parents, and students will have closed. Teachers and students will have moved on to other parts of the curriculum and have no enthusiasm for feedback that is not currently relevant. Insofar as teachers can be trained and then trusted to do the scoring themselves or with close colleagues, this speedy response can be secured more easily. It is clear from the above that high quality assessment is an integral part of a coherent education system, linked directly to the improvement of teaching, learning and professional practice. This should not be a surprise; all complex adaptive systems are like this, with feedback designed to enhance every aspect of system performance. This is the strategic design challenge. In the following sections we describe how it can be, and has been, met. This strategic view puts the issue of cost in perspective. The cost of various kinds of assessment must be seen in terms of the overall cost of educating a student, approaching 10,000 US dollars per year. Is assessment cost effective in delivering improved education? To make it so is primarily a design and development challenge. 2. Design principles and practices for summative examinations In this section we outline the principles that guide the design of examinations, intended primarily for summative purposes, that aim for high quality, and the practices that enable these principles to be realized. We start with the principles the criteria that set out what we mean by high quality. They are fairly non-controversial, probably shared by most knowledgeable people at the rhetorical level, at least. They are often neglected in practice. Validity The key here is to assess the things that you are really interested in to assess students progress towards the educational aims, expressed in performance terms, for whatever purposes the outcomes of the assessment are to be used for. While this may seem so obvious as to be hardly worth stating, this fundamental principle is widely ignored in the design of examinations where, for example, the tasks are confined to elements of performance that are easy and cheap to assess. Page 5 Another common failure is to specify the outcomes of interest in terms of a simple model of performance in the subject for example, a list of some elements of performance (usually individual concepts and skills) and testing these fragments separately. Why is this problematic? Because there is no assessment of students ability to integrate these elements into the holistic performances that are really of interest for example, solving substantial problems. In seeking validity, certain questions deserve particular attention in test design: Inferences: Can the users [3] make valid inferences about the student s capabilities in the subject from the test results achieved by that student? Evaluation and Decision: Can users evaluate the results and use them in making decisions, including pedagogical ones, with confidence that the results are a dependable basis, free of bias effects and reflecting a comprehensive interpretation of the subject s aims? Range and variety: Does the variety of tasks in the test match the range of educational and performance aims as set out in, for example, CCSS or other specification of the aims of the intended curriculum? Extrapolation: Does the breadth and balance of the domain actually assessed justify inferences about the full domain, if the former is a subset of the latter? The effects the test has on what happens in classrooms: Both common sense and data from observations show that, where there are high-stakes tests, the task types in the test dominate the pattern of learning activities in most classrooms. Does this influence represent the educational aims in a balanced way? Is this a test worth teaching to? Given this inevitable effect on classrooms, this question summarizes a very important criterion of validity. Many high-stakes tests do not score well on these criteria. Validity for the users is sometimes justified by using evidence of correlation with other measures: this simply calls in question the validity of those other measures in relation to the aims of assessments being justified. Very often, the effects on classrooms are simply ignored tests are seen as just measurement. Validity of assessments may also be justified by redefining the educational goals to be what the test assesses. The harmful effects are exacerbated if the curriculum aims are expressed in very vague terms: then the test constructors become the arbiters who translate vague aspirations into specific tasks that convey an impoverished message to teachers. Reliability Reliability is generally defined as the extent to which, if the candidate were to take a parallel form of test, on a different occasion but within a short time, the same Page 6 result would be achieved. Clearly, reliability is a necessary condition for validity, but not a sufficient one some argue that it should be treated as a component of validity. For a more detailed account, which give a full discussion of criteria, all of which are discussed here see Stobart (2001). Some of the main threats to reliability are: Occasion variability: The student may perform at different levels on different days. Variability in presentation: The way a task is explained may sometimes be inadequate, and the conditions in which the assessment is attempted may be inappropriately varied. Variations in scoring: There may be poor inter-rater or intra-rater consistency while simple responses to short items can be scored automatically, trained people are needed to score responses to complex tasks. Weak scoring can also threaten validity if the scoring protocol is too analytic, or too holistic, or fails to capture important qualities of task performance. Where raw scores are used to determine grades, variations in the setting of grade boundaries may also be problematic. Inappropriate aggregation: Variations in the weights given to different component tasks will threaten both reliability and validity, as will inconsistencies between the scoring criteria for the different tasks included in the aggregation. Inadequate sampling: A short test will have lower reliability than a longer one, other things being equal, because a smaller sample of each student s performance will have greater fluctuations due to irrelevant variations between tasks. To narrow the variety of tasks may produce an unacceptable reduction in validity, so more extensive and longer assessments may be needed to cover a wider variety of performance types with the same reliability. A recent US study finds that a broad spectrum test of performance needs to be four times as long as a short item multiple choice test for the same reliability close to the few hours of examinations common in other countries (Black & Wiliam, 2012). However, if the aggregated tasks are too diverse the result may be hard to interpret, i.e. weak in validity. Variation between parallel forms: For any assessment of performance on non-routine tasks, variation from form to form among such tasks is essential to ensure that they remain non-routine; this can offset teaching to the test and stereotyping through repetition of the same tasks from year to year, but it may also introduce irrelevant variability. Page 7 Variation in the setting of grade boundaries: Where raw scores are converted into grades, the score at which each grade boundary is fixed can have a very marked effect on those with scores close to the boundary. The influence of this feature depends on the number of grades in relation to the shape of the score distribution. It will be evident from the above that there is strong interaction between reliability criteria and validity criteria. The key principle is that irrelevant variability should be minimized as long as the methods used do not undermine validity there is no value in an accurate measure of something other than what you are seeking to assess. Mathematics assessors tend to be proud that their inter-scorer variation is much lower than, say, that in the scoring of essays in History or English; however, this is largely because current math tests do not assess holistic performances on substantial tasks. Poor understanding of test reliability causes problems. Ignoring the fine print in all test descriptions, users are inclined to view test results as perfectly accurate and take decisions on that basis. The likely degree of variability in test scores should be published, but the forms and language of communication have to be chosen with care: in common parlance error conveys a judgment that somebody made a mistake, and to say that a result is unreliable may convey to many a judgment that it is not fit for the purpose (He, Qingping & Opposs, Dennis, 2010). Thus, it is important to distinguish between error, such as mistakes in scoring, and other sources of variability which are either in principle unavoidable or can only be avoided by unacceptable means e.g. if twelve hours of formal testing were required to improve the sampling range. Capacity for evolution No test is perfect; some are not even adequate for their declared purpose. Further, the educational aims will change. For high quality assessment, it is essential that tests can grow along with system improvement. This is particularly important when there are new initiatives, like the current one in the US based around the Common Core State Standards. (It is unlikely that current school systems can immediately absorb tests that fully reflect the aims of these standards; it would be unfortunate if limited first versions were regarded as a long term solution.) Equally an assessment system should be designed to counter degeneration under inevitable system pressures for: Page 8 Task predictability: High-stakes tests make teachers, understandably, feel insecure. They seek tests that are highly predic
Similar documents
View more...
Search Related
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks