Comparing Test Equating by Item Response Theory and Raw Score Methods with Small Sample Sizes on a Study of the ARTé: Mecenas Learning Game

Steven W. Carruthers

Commenced in January 2007

Frequency: Monthly

Edition: International

Paper Count: 33122

Comparing Test Equating by Item Response Theory and Raw Score Methods with Small Sample Sizes on a Study of the ARTé: Mecenas Learning Game

Authors: Steven W. Carruthers

Abstract:

The purpose of the present research is to equate two test forms as part of a study to evaluate the educational effectiveness of the ARTé: Mecenas art history learning game. The researcher applied Item Response Theory (IRT) procedures to calculate item, test, and mean-sigma equating parameters. With the sample size n=134, test parameters indicated “good” model fit but low Test Information Functions and more acute than expected equating parameters. Therefore, the researcher applied equipercentile equating and linear equating to raw scores and compared the equated form parameters and effect sizes from each method. Item scaling in IRT enables the researcher to select a subset of well-discriminating items. The mean-sigma step produces a mean-slope adjustment from the anchor items, which was used to scale the score on the new form (Form R) to the reference form (Form Q) scale. In equipercentile equating, scores are adjusted to align the proportion of scores in each quintile segment. Linear equating produces a mean-slope adjustment, which was applied to all core items on the new form. The study followed a quasi-experimental design with purposeful sampling of students enrolled in a college level art history course (n=134) and counterbalancing design to distribute both forms on the pre- and posttests. The Experimental Group (n=82) was asked to play ARTé: Mecenas online and complete Level 4 of the game within a two-week period; 37 participants completed Level 4. Over the same period, the Control Group (n=52) did not play the game. The researcher examined between group differences from post-test scores on test Form Q and Form R by full-factorial Two-Way ANOVA. The raw score analysis indicated a 1.29% direct effect of form, which was statistically non-significant but may be practically significant. The researcher repeated the between group differences analysis with all three equating methods. For the IRT mean-sigma adjusted scores, form had a direct effect of 8.39%. Mean-sigma equating with a small sample may have resulted in inaccurate equating parameters. Equipercentile equating aligned test means and standard deviations, but resultant skewness and kurtosis worsened compared to raw score parameters. Form had a 3.18% direct effect. Linear equating produced the lowest Form effect, approaching 0%. Using linearly equated scores, the researcher conducted an ANCOVA to examine the effect size in terms of prior knowledge. The between group effect size for the Control Group versus Experimental Group participants who completed the game was 14.39% with a 4.77% effect size attributed to pre-test score. Playing and completing the game increased art history knowledge, and individuals with low prior knowledge tended to gain more from pre- to post test. Ultimately, researchers should approach test equating based on their theoretical stance on Classical Test Theory and IRT and the respective assumptions. Regardless of the approach or method, test equating requires a representative sample of sufficient size. With small sample sizes, the application of a range of equating approaches can expose item and test features for review, inform interpretation, and identify paths for improving instruments for future study.

Keywords: Effectiveness, equipercentile equating, IRT, learning games, linear equating, mean-sigma equating.

Digital Object Identifier (DOI): doi.org/10.5281/zenodo.1316395

Procedia APA BibTeX Chicago EndNote Harvard JSON MLA RIS XML ISO 690 PDF Downloads 1019

References:

[1] M. J. Kolen and R. L. Brennan, Test Equating, Scaling, and Linking: Methods and Practices. New York: Springer, 2014.
[2] F. M. Lord, Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: Erlbaum, 1980.
[3] R. K. Hambleton, H. Swaminathan, and H. J. Rogers, Fundamentals of Item Response Theory, vol. 2, New York: Sage, 1991.
[4] L. L. Cook and N. S. Paterson, “Problems related to the use of conventional and item response theory equating methods in less than optimal circumstances,” Applied Psychological Measurement, vol. 11, no. 3, pp. 225-244, 1987.
[5] J. González, “SNSequate: Standard and nonstandard statistical models and methods for test equating,” Journal of Statistical Software, vol. 59, no. 7, pp. 1-30, 2014.
[6] S. A. Livingston, “Equating Test Scores (Without IRT).” Princeton, NJ: Educational Testing Service, 2004.
[7] M. J. Kolen, “Effectiveness of analytic smoothing in equipercentile equating,” Journal of Educational Statistics, vol. 9, no. 1, pp. 25-44, 1984.
[8] J. P. Meyer and S. Zhu, “Fair and equitable measurement of student learning in MOOCs: An introduction to item response theory, scale linking, and score equating,” Research & Practice in Assessment, vol. 8, 2013.