A HISTORY OF STUDENT RATINGS: Meso-Unleaded Era (1990s)
The 1990s were like a nice, deep breath of fresh gasoline (which was $974.99 a gallon at the pump, $974.00 for cash only), hereafter referred to as the Meso-Unleaded Era. Little did anyone anticipate how this era would be trumped at the pump in the years ending the current decade. The use of student rating scales had now spread to Kalamazoo (known to tourists as “The Big Apple”) and faculty began complaining about their validity, reliability, and overall value for decisions about promotion and tenure (the scales, that is, not Kalamazoo) . This was not unreasonable, given the lack of attention to the quality of scales over the preceding 90 billion years.
(WEATHER ALERT: I interrupt this section to warn you of impending wetness in the next three paragraphs. You might want to don appropriate apparel. Don’t blame me if you get wet. You may now rejoin this section already in progress. END OF ALERT.)
This debate intensified throughout the decade with a torrential downpour of publications challenging and contributing to the technical characteristics of the scales, particularly a series of articles by William Cashin of IDEA at Kansas State University, which was located in New Hampshire at the time, and an edited work by Mike Theall and Jennifer Franklin (Student Ratings of Instruction,1990). As part of this debate, another steady stream of research flowed toward alternative strategies to measure teaching effectiveness, especially peer ratings, self-ratings, videos, alumni ratings, interviews, learning outcomes, teaching scholarship, and teaching portfolios.
This stream leaked into books by John Centra (Reflective Faculty Evaluation, 1993), Larry Braskamp and John Ory (Assessing Faculty Work, 1994), Peter Seldin (Improving College Teaching, 1995), and Raoul Arreola (1st and 2nd editions of Developing a Comprehensive Faculty Evaluation System, 1995, 2000), and an edited volume by Seldin and Associates (Changing Practices in Evaluating Teaching, 1999). They furnished a confluence of valuable resources for faculty and administrators to use to evaluate teaching.
This cascading trend was also reflected increasingly in practice. Although use of student ratings had peaked at 88% by the end of the decade, peer and self-ratings were on the rise over the rapids of teaching performance as my liquid metaphor came to a screeching halt.
My next blog will address the developments in the first decade of the new millennium of the Meso-Responserate Era.
COPYRIGHT © 2010 Ronald A. Berk, LLC
WHAT ARE ITEM SCORES?
The next level is the item, where a statistic such as a mean or median is reported. Since most anchor distributions are usually negatively skewed and answers are on a ranked, or ordinal, scale, the median is the most appropriate measure of central tendency. However, given the range of distributions that can occur, you may see both the mean and median on your report form.
“WAIT!! Back up. How did you get from responses of SD, D, etc. to means and medians?” Great question! Glad you’re on the ball. First, you have to convert the “verbal” anchors into “numbers.”
(MEASUREMENT ALERT: Keep in mind that we started with a “qualitative scale” of verbal expressions of how students feel about each behavior and now we’re converting the words into a “quantitative scale” for the convenience of performing analysis of those feelings. Actually, this conversion involves an arbitrary numerical coding scheme.)
CREATE A ZERO-BASED NUMERICAL SCORE SCALE: For simplicity and interpretability, a zero-based scale is recommended, so that the most negative anchor, such as SD, would be coded as “0.” Zero-based scoring was originally recommended by Likert (1932), who created this scaling method. Then the other anchors would be coded in 1-point increments above 0.
Higher values weight more desirable or positive ratings higher than negative ones. SA or Strongly Agreeing with a desirable teaching behavior or course characteristic is weighted with the highest value of 3. An example of this coding for a 4-point, agree–disagree scale is shown below:
SD D A SA
0 1 2 3
The score range for this single item is 0 to 3. (Note: These score points will vary with the number of anchors and the base number on different scales. Yours may be one of these. Sometimes the number 1 is used as the base instead of 0. Although the number scale may be different, the final interpretation will be the similar.)
COMPUTATION OF ITEM MEANS AND MEDIANS: If you hate stat, this section may make you hurl. Skip it. (SIDEBAR: Over 30 years of teaching stat, I had lots of student hurlers.) For you interested nonhurlers, here are the simple computational definitions:
MEAN = the sum of all students’ scores to each item, divided by the number of students or N. This is the average score for an item, within the range of 0–3 for this example.
MEDIAN = the middle score, after all students’ scores are ranked from high to low.
An example report, based on the anchor data shown in the previous blog, is shown below:
SD D A SA N Mean Median
Statement 1 1.0% 3.1% 37.5% 58.6% 96 2.52 3.00
Statement 2 1.0 3.1 24.0 71.9 96 2.65 3.00
Statement 3 1.1 1.1 28.9 68.9 90 2.57 3.00
The median score of 3 means the typical student in the middle of the distribution rated those behaviors as SA. The means were slightly lower with ratings between A and SA. Those are very respectable scores. Of course, they are consistent with the anchor percentage distribution, where the highest percentages are concentrated on the A and SA anchors.
So which index should you use? Mean? Median? Or both? Ah ha! The statistical plot thickens. Stay tuned…
COPYRIGHT © 2010 Ronald A. Berk, LLC