In the course of their career, most educators will use a range of evaluation and testing methods.
In the course of their career, most educators will use a range of evaluation and testing methods.
They might also use more than one way to measure test results. One of the main ways to do this is via the criterion-referenced test, but how does it work exactly? And how is it different from other methods of measuring results, such as norm-referenced (bell curve) testing?
A criterion-referenced test assesses a person’s knowledge, ability, or skills against a predetermined standard. This means that in a classroom situation, each individual’s test results are measured against the set standard and not against the performance of other students in the class (or other students throughout a city, region, or entire country). In other words, the performance of the rest of the group taking the test will not affect the individual’s end mark or grade.
This assessment measures student performance against a clearly defined set of criteria or standards. These might include statements of what students should know at specific ages or learning stages, otherwise known as ‘learning outcomes’.
Criterion-referenced tests use what is known as ‘cut scores’ to assess whether students have passed a test — in other words, achieved the desired learning outcomes. In some cases, this assessment method also places results into tiered categories of achievement, for instance, ‘A’ grades or scores of ‘Excellent’.
Alternatively, criterion-referenced tests can also be used before courses begin to assess a student’s level and place them in the appropriate group to their abilities (for instance, ‘Basic’, ‘Intermediate’, or ‘Advanced’). Again, these group allocations are based on the cut scores and not how the student compares to the rest of the class.
This means that, theoretically, every student could fail a test in a single classroom, or every student could get an A. This is because everything is measured by the cut score and the performance of the individual, not the group. In other words, it is not about relative bell curve measurements.
But how are these cut scores established? Depending on the importance of the test, they might be defined by a single educator or an entire committee of academic experts. Either way, they will decide how the test should be implemented (e.g., the style of assessment and specific questions) and the cut scores. For instance, what are the percentage criteria for a pass? And what percentage is needed for an A, B, C, or D, respectively?
In other words, criterion-referenced tests don’t have one universal standard — for instance, a standard agreement that anything above 60% is always a B-grade. Instead, it is down to the individual educator or committee to decide on this, meaning that cut scores might vary widely.
Assessments can also be given different measurement criteria, for instance, letters, e.g., A to E, numbers, e.g., 1 to 5, or categories, e.g., ‘excellent/good/satisfactory/unsatisfactory’. Sometimes, there may be a straightforward cut score to determine either a pass or fail without the scores being broken down into tiered achievement categories.
This type of assessment can be implemented in many different ways, including:
This kind of assessment can have many different purposes, including:
Criterion-referenced tests can also be used in high and low-stakes evaluations, from casual class quizzes to end-of-year exams.
Criterion-referenced tests differ from norm-referenced tests because the latter is based on how each student performs to their peers.
In other words, norm-referenced tests are designed to rank individuals on a bell curve, meaning that when their scores are plotted out on a graph, they will acquire a bell shape. This kind of graph result is achieved when a small percentage of students get low scores, a small percentage get high scores, and the majority get average scores.
So, for instance, if most students in a class get a low score, then the criteria will be adjusted to bring some of those scores into the average range instead. The goal is to achieve a bell curve result, with the view being that if this is not achieved, then it means that the test was not devised correctly in some way. For instance, it was too easy, too complicated, or in some way unsuitable for that particular group.
Here are some key arguments in support of this type of testing:
Criterion-referenced tests are arguably fairer than norm-referenced tests, as they are not relative to the particular class or group and are designed along a consistent set of standards. So there is no chance of an individual’s grades being unfairly distorted by, for example, a few wealthier students in the class whose parents can afford to give them private tuition.
Related to the above, this method is a better way to measure the actual progress of individual learners concretely, as the results aren’t ‘muddied’ by the performance of others in the group. As it applies the exact learning expectations to everyone, this test can encourage students from disadvantaged backgrounds to achieve more. Conversely, if students within a disadvantaged group have to achieve less to get an ‘A’ (as would be the case with a norm-referenced bell curve test), then it is argued that this would not push them to achieve their full potential.
However, this type of assessment is not without its critics. Some of the key concerns include the following:
Criterion-referenced tests are only as far as the learning standards that they are based on. For instance, if a committee devises a set of faulty cut grades that are either too strict or too easy, then the test has not accurately measured knowledge or skills. In the end, there can be a subjective element to working out pass scores and signs of proficiency. After all, committees are made up of human beings subject to error, bias, and misjudgment.
This testing system is subject to ‘fudging’ or manipulation of results — or even outright corruption. For instance, schools or entire districts might tamper with criteria for cut grades in assessments that aren’t nationally standardized. This is so they can avoid developing a bad reputation, attracting negative media coverage, or even losing funding. Also, when an individual’s job might be at stake due to poor test results — whether a teacher or a school principal — this could also encourage tampering or corruption.
Criterion-referenced testing can be time and labor-intensive, as well as expensive. For instance, keeping them up to date might require the input of expert committees, which is not a small undertaking.
As we have seen, the main alternative to the criterion-referenced test is norm-referenced testing. Here are some advantages of the latter:
It is a valuable way of measuring an individual’s performance in a specific group. This is sometimes necessary for educators, mainly if their group is ‘outside of the average’ in some way, for instance, from an underprivileged or highly privileged area.
This method is a good way of gathering normative data across more prominent groups, for instance, entire states or countries. This can be important in educational research, policy-making and funding allocation.
Norm-referenced tests do not cause students from disadvantaged groups to feel discouraged, as they are not being measured by pre-existing criteria that might be unfair to them in some way. Instead, they are being measured against a group of peers in similar circumstances, which could create a more level playing field.
As with criterion-referenced tests, norm-referenced assessments have also gathered criticisms. Here are a few of the main ones:
Who defines the ‘norms’ in a norm-referenced test? And what happens when these aren’t relevant to some groups? For instance, if test results are based on a national bell curve (rather than a local or class-based one), then a too generalized — or unfair or biased — set of norms might be applied.
A norm is not the same as a standard — in other words, this kind of assessment does not have set criteria to measure performance. It compares test takers to their particular group and bases scores on this. However, this does not measure whether the test taker has an adequate level of skill or knowledge in the subject area, nor does it measure who is truly excelling. The results are not concrete.
Norm-referenced testing can upset or anger students. This is because they might perceive bell curve grading as ‘unfair’ if the method makes their grade lower than what it might have been in a criterion-referenced assessment.
Choosing between criterion and norm-referenced tests isn’t the only thing to reflect upon when devising a test. Here are a few other factors to consider:
A test should be based on the key learning outcomes of a module, unit, or course. Usually, these outcomes are based on deciding what the reasonable — or required — knowledge or skills expectations should be upon course completion.
Multiple choice? Practical exercises? Open-ended questions? Deciding on a suitable assessment method can depend on the specific knowledge you try to cultivate in a class or group. For instance, do you want them to have a general understanding of a subject and be able to recall concrete facts? Or is the aim to develop abstract, critical, or imaginative thinking?
Or should they be able to perform a practical set of skills better evaluated via, say, roleplay scenarios rather than a written test?
How high are the stakes of this assessment? And will you be assessed for a simple pass or fail or placing results into categories of excellence? Also, what will the passing grade be? 50%? 60%? 70%? Again, this all depends on the purpose and importance of the assessment itself. For instance, if you are running a job training course where 50% recall of a subject wouldn’t cut it within the actual role, then you may want to raise the pass mark to higher than this.
Or, if you want to encourage students to strive for excellence, you might want to devise an A to D grading system instead of simple pass-fail criteria.
For instance, does your test paper use framing, concepts, or terminology that students with English as a second language might struggle to understand? Or are some of the questions based on the cultural contexts and norms of a particular social class or ethnic group?
As cultural biases are a form of blindspot, they can fly under the radar within testing. That is why it is crucial to be vigilant with them.
Here are some of the key takeaways about this method of assessment:
Hopefully, you now understand criterion-referenced tests better, including the pros and cons of norm-referenced tests.
Of course, no testing method is perfect, but if you are devising your own, the critical thing to bear is that context is key. This can help you decide on the best evaluation method to meet the needs of your class and course overall.