(10) where n(-) is the (unit) normal density function. In particular, 7,-i

By well known theory, the graph of the normal ogive N(-) is virtually indistinguishable from the logistic function, but unlike the latter allows fairly straightforward development of the desired harmonic series, and was therefore preferred for this purpose. McDonald (1982) showed that the normal ogive is well approximated by the cubic polynomial obtained by terminating the series in Eq. (7) at p = 3.

for /??! in Eqs. (9) and (10), and hp(-) is given by Eq. (8) as before. The first r terms of the series in Eq. (13), which will be denoted by (f)r (•), yield a polynomial of degree r, which, like the multidimensional normalogive in Eq. (12), is constant on planes of dimension k — 1 orthogonal to the vector 0j and yields a weighted least square approximation to it, as in the unidimensional case treated by McDonald (1967). It further follows that the proportion of examinees passing an item is given by j = 1j0

A multidimensional normal ogive model with a linear combination rule for the latent traits—the NOHARM model (Normal-Ogive Harmonic Analysis Robust Method)—is defined as P{Uj = 1 | 6} = N{(3j0

where, as before, N(-) is the cumulative normal distribution function. It is assumed that 9 is random with a fc-variate normal distribution, and a metric is chosen such that each component of 8 has mean zero and variance unity. We write P for the k x k covariance (correlation) matrix of 9 and B = [pjk],


In some applications, a pattern may be prescribed for B, with elements constrained to be zero (to define a "simple structure") or subject to other desired constraints, as in the counterpart factor-loading matrix in multiple

In practice, the first r terms of Eq. (16) may be substituted, denoted by TTjit(r), as a finite approximation to it. Constructed data studies of the unidimensional case have shown that the precision of estimation of the item parameters does not depend upon the number of terms retained in the polynomial approximation. On the other hand, the residuals tend, of course, to reduce in absolute value as r is increased, though convergence is very rapid, and the contribution to fit from terms beyond the cubic in general seem negligible. However, at this point it should be noted that the theory just described is much closer to that of Christoffersson (1975) than might be obvious at first sight. Indeed, Eq. (16) is identical with Eq. (21) below, given by Christoffersson, except for a reparameterization. Christoffersson defined a set of unobservable variables, v, that follow the multiple common factor model, say, vj^X'jf + Sj

15. Normal-Ogive Multidimensional Model

Roderick P. McDonald

where A' = [Ai,..., An], a matrix of common factor loadings, f is a vector of common factors, and bj is the jth unique factor. He then supposed that Uj = 1 if Vj > tj = 0 if u,

and hp(-) is given by Eq. (8). Using Eq. (20), we may rewrite the result (16) obtained by harmonic analysis of the model in Eq. (12) as

in IRT but not in factor analysis may even yet have missed the point that Christoffersson's (1975) "factor analysis of dichotomized variables" is indeed the multidimensional normal ogive model with a distinct parameterization. Purely in terms of theory it is pleasing to find that the same result can be obtained either as the solution to the problem of evaluating a double integral connected with the normal distribution or as a solution to the problem of approximating a strictly nonlinear regression function by a wide-sense linear regression function. The main practical implication of the result is that it gives grounds for reducing numerical work by using what at first sight would appear to be unacceptably crude approximations, such as the simple linear approximation in Eq. (16). That is, fitting an approximation to the model may be considered instead of fitting the model itself, thereby estimating the parameters reasonably precisely, where there might not have been an expectation of obtaining reasonably precise estimates of the parameters using very crude approximations to the integrals required by the model itself. The existence of the alternative parameterizations in Eqs. (11) and (25) for the multidimensional normal ogive requires some comment. First, it should be noted that in the well studied unidimensional case there are three distinct parameterizations in common use, and the meaning of all three and the relationships between them do not seem widely understood. Given the seminal work of Lord, it is natural that much work on the unidimensional case has employed Lord's parameterization, namely P{Uj\0} = N{aj(0-bj)},



To put this another way, Christoffersson's work implies writing the multidimensional normal-ogive model as P{Uj = 1 | 6} = N{(tj + \'j6)^12}

It is then immediately evident that Eqs. (16) and (21) are identical except for a choice of parameterization. Each is obtained from the other by writing

in place of Eq. (12). The equivalence of Christoffersson's (1975) tetrachoric series to McDonald's (1967) harmonic analysis has both theoretical and practical consequences, apart from the possibility that a few research workers interested

where bj, a,j range over the real numbers. (After fitting the model it is possible to rescore items with negative a,j so that aj ranges over the positive numbers.) Since the location of the point of inflection of the curve is bj and its slope at that point is aj, these parameters are correctly described as location and slope parameters. In the special case of cognitive items, where 9 is an "ability," it is reasonable to call bj a "difficulty" parameter. Unfortunately, Lord's parameterization does not possess a multidimensional counterpart. In the second parameterization, in Eq. (6), f3ji has the same value and meaning as aj. However, /3 j0 = -ajbj is no longer a location parameter. It is still possible to say that the point of inflection is given by the solution of /3JO + Pji9 = 0, thus indirectly giving an interpretation to ft join the common factor parameterization in Eq. (24), the unidimensional (Spearman) case is P{U0 = 1 | 9} = N{(tj + \j9)^/2}.

By Eq. (18), the threshold parameter tj = N~X{TVJ), a convenient transform of the classical difficulty parameter, and thus is directly interpretable. The

Roderick P. McDonald

quantities Xj and ipj:j are factor loadings and residual variances (uniquenesses) interpretable by the usual factor-analytic standards. We note that as Xj -> 1 and tpjj -> 0, Pjr -> oo and we have an improper solution or Heywood case. In the multidimensional case, Lord's parameterization is no longer directly available. Since terminology has not yet settled, it is recommended that the parameterization in Eq. (11) be referred to as the IRT parameterization and the one in Eq. (24) as the common factor parameterization of the multidimensional normal ogive model. As Reckase (1985) points out, in the former, since P{Uj = 1} = .5 at points on the (k - l)-dimensional plane 0 ^ to which 0j is orthogonal, it follows immediately that the minimum distance from the origin to the plane denned by Eq. (28) is along the direction-vector Bi and is given by 1/2 , (29) dj = ( / ^ ) thus generalizing Lord's location parameter. As in the unidimensional case the components of P3 range over the real numbers, but it is no longer possible in general to score the items so that they are all positive (as will be seen in the examples below). In the common factor parameterization, once again there is a simple interpretation, by (18), of tj as the inverse-normal transformation of classical item "difficulty," which is somewhat more transparent than Pj in (11), and the loading matrix

contains numbers that can be interpreted as (standardized) item factor loadings, with the classical criteria for salient versus negligible values. The NOHARM program, as described below, allows rotation of the factor loading matrix and the corresponding matrix of direction-vectors

15. Normal-Ogive Multidimensional Model

(Normal Ogive Harmonic Analysis Robust Method), the threshold or position parameter is estimated in closed form by solving the sample analogue of Eq. (18), and reparameterizing it if desired, and the parameters Pj are obtained by unweighted least squares, minimizing 9 =

where TTJ^ is the r-term approximation to TTJ^ in Eq. (16), by a quasiNewton algorithm. The combination of the random-regressors model and the weak principle of local independence with the use of unweighted least squares makes it possible to analyze quite large numbers of items with unlimited sample size. The program allows users to read in "guessing" parameters for multiple choice cognitive items, setting the lower asymptote of the expression in Eq. (13) to a prescribed constant. (An attempt to estimate these parameters in an early version of the program suffered the usual difficulties for models with such parameters.) The decision to employ ULS rather than the obvious choice of GLS was primarily determined by a desire to handle large data sets. The use of GLS by Christoffersson (1975) and Muthen (1978) asymptotically yields, like ML, SEs and a chi-squared test of significance, but limits applications to rather small data sets. Ife was also conjectured that the weight matrix for GLS would be poorly estimated until the sample size becomes extremely large. Unpublished studies show that the method gives satisfactory estimates of the parameters with sample sizes down to one hundred, and that it is reasonably robust against violations of normality of the latent trait distribution. (With an extreme distribution created by truncating the standard normal at zero, the location parameters were correctly ordered but somewhat underestimated at one end.)

Goodness of Fit to approximate simple structure, or to fit a prescribed pattern of zero, nonzero, or equated coefficients, including prescribed simple structure, independent clusters or basis items (items chosen to measure just one latent trait). The position taken here is that these devices are virtually essential for the substantive application of a multidimensional model, as without them the structure of the data cannot be understood.

Parameter Estimation Let Pi be the proportion of examinees passing item j, and Pjk be the proportion passing items j and k, in a sample of size N. In program NOHARM

An obvious lacuna, following from the use of ULS, is the lack of a statistical test of significance of the model. One way to rationalize away this defect of the NOHARM approach is to refer to Bollen and Scott (1993) for a general recognition that all of these models are at best approximations and any restrictive model will be rejected at one's favorite level of statistical significance, given a sufficiently large sample size. Following Tanaka (1993) we can conveniently define a goodness-of-fit index 7ULS

where S is the item sample covariance matrix and R the residual covariance matrix. The application of this criterion is illustrated below. Perhaps equally important, we can use an old principle from the practice of factor

Roderick P. McDonald

15. Normal-Ogive Multidimensional Model

TABLE 1. Sample Raw Product-Moments and Sample Covariances for LSAT-7 Item 1

TABLE 2. Parameter Estimates: Unidimensional. Table 2a

Data from sections of the Law School Admissions Test (LSAT) have been used by a number of writers to illustrate IRT. In particular, LSAT7 has been treated by Christoffersson (1975) as a two-dimensional case of the normal ogive model. The sample raw product-moment matrix (the product of the raw-score matrix and its transpose) for five items from LSAT7, with N = 1,000, is given in Table I,1 together with the sample covariances. Program NOHARM (Eraser, 1988) was used to fit three models to the data. Table 2 gives the results of fitting a unidimensional model, with Lord's parameterization, the IRT parameterization, and the common factor parameterization in Table 2a, and the item residuals in Table 2b. Tanaka's Index (25) is .99924, but the residuals can be said to be large relative to the item covariances in Table 1. Table 3 gives the results of an exploratory two-dimensional analysis, with oblique solutions in Table 3a in the IRT and common factor parameterizations. Residuals are shown in Table 3b. The Tanaka Index becomes .999967 and the residuals are correspondingly reduced compared with the unidimensional solution. The oblique common factor solution suggests fairly clearly that item 1 measures the first dimension well, while items 2 and 3 primarily measure the second dimension, with items 4 and 5 dimensionally complex l

Note to Table 1: Product-moments in lower triangle, including the diagonal; covariances in upper triangle.

TABLE 3. Parameter Estimates: Two-Dimensional Exploratory Table 3a

analysis that a model is a sufficiently close approximation to the data if we cannot find a more complex model that is identified and interpretable. For this purpose, inspection of the item residual covariance matrix is at least as useful as a goodness-of-fit index and possibly more so, since it is difficult to set criterion values for the application of such indices. A heuristic device for taking account of sample size is to recognize that, by standard sampling theory, the SEs of the parameters are in the order of N"ll2 where N is the sample size.

and not measuring either dimension very well. The correlation between the item factors is .621. For completeness of illustration, Table 4 gives results from a confirmatory solution based simply on the apparent pattern of the exploratory results, not on substantive grounds. (This practice is, of course, not recomended.) The Tanaka Index is .999966, and the residuals remain reasonably small, suggesting an acceptable fit. Along with extensive constructed-data studies, the example serves to show that NOHARM yields reasonable parameter estimates both in exploratory and confirmatory multidimensional models. Perhaps the most important feature illustrated, which arises from its initial conception as a nonlinear common factor model, is the provision either from a confirmatory or an exploratory simple structure, of a basis in the latent trait space, yielding interpretations by the familiar criteria of common factor theory.

Roderick P. McDonald

15. Normal-Ogive Multidimensional Model

TABLE 4. Parameter Estimates: Two-Dimensional Confirmatory Table 4a Item 1 2 3 4 5

References Bollen, K.A. and Long, J.S. (Eds.) (1993). Testing Structural Equation Models. Newbury Park, CA: Sage. Christoffersson, A. (1975). Factor analysis of dichotomized variables. Psychometrika 40, 5-22. Etezadi-Amoli, J. and McDonald, R.P. (1983). A second generation nonlinear factor analysis. Psychometrika 48, 315-342. Fraser, C. (1988). NOHARM: A Computer Program for Fitting Both Unidimensional and Multidimensional Normal Ogive Models of Latent Trait Theory. NSW: Univ. of New England. Lord, F.M. (1950). A theory of test scores. Psychometric Monographs, No. 7. McDonald, R.P. (1962a). A note on the derivation of the general latent class model. Psychometrika 27, 203-206. McDonald, R.P. (1962b). A general approach to nonlinear factor analysis. Psychometrika 27, 397-415. McDonald, R.P. (1967). Nonlinear factor analysis. Psychometric Monographs, No. 15.

McDonald, R.P. (1982). Linear versus nonlinear models in latent trait theory. Applied Psychological Measurement 6, 379-396. McDonald, R.P. (1984). Confirmatory models for nonlinear structural analysis. In E. Diday et al., (Eds), Data Analysis and Informatics, III. North Holland: Elsevier. McDonald, R.P. (1985). Unidimensional and multidimensional models for item response theory. In D.J. Weiss, (Ed), Proceedings of the 1982 Item

Response Theory and Computer Adaptive Testing Conference, Minneapolis: University of Minnesota. Muthen, B.O. (1978). Contributions to factor analysis of dichotomous variables. Psychometrika 43, 551-560. Tanaka, J.S. (1993). Multifaceted conceptions of fit in structural equation models. In K.A. Bollen and J.S. Long (Eds), Testing Structural Equation Models. Newbury Park, CA: Sage.

16 A Linear Logistic Multidimensional Model for Dichotomous Item Response Data Mark D. Reckase Introduction Since the implementation of large-scale achievement and ability testing in the early 1900s, cognitive test tasks, or test items, that are scored as either incorrect (0) or correct (1) have been quite commonly used (DuBois, 1970). Even though performance on these test tasks is frequently summarized by a single total score, often a sum of item scores, it is also widely acknowledged that multiple skills or abilities are required to determine the correct answers to these tasks. Snow (1993) states "The general conclusion seems to be that complex cognitive processing is involved in performance even on simple tasks. In addition to multiple processes, it is clear that performers differ in strategies " For example, even the simplest of mathematics story problems requires both reading and mathematical skills to determine the correct answer, and multiple strategies might be used as well. Performance on such test tasks can be said to be sensitive to differences in both reading and mathematics skills. However, the test task is probably insensitive to differences in many other skills, such as mechanical comprehension, in that high or low levels of those skills do not change the likelihood of determining the correct answer for the task. Examinees bring a wide variety of cognitive skills to the testing situation, some of which are relevant to the task at hand and some which are not. In addition, some test tasks are sensitive to certain types of skill differences, while others are not. The number of skill dimensions needed to model the item scores from a sample of individuals for a set of test tasks is dependent upon both the number of skill dimensions and level on those dimensions exhibited by the examinees, and the number of cognitive dimensions to which the test tasks are sensitive.

16. A Linear Logistic Multidimensional Model

Mark D. Reckase

The mathematical formulation presented in this chapter is designed to model the results of the interactions of a sample of examinees with a set of test tasks as represented by the matrix of 0, 1 item scores for the examinees on the tasks. A linear function of skill dimensions and item characteristics describing the sensitivity of the items to skill dimensions is used as the basis for this model. The result of this linear combination is mapped to the probability metric using the logistic distribution function. While person skill dimensions are included in the model, it is a matter of construct validation as to whether the statistical dimensions derived from the response matrix directly relate to any psychological dimensions. The formulation presented was influenced by earlier work done using the normal ogive model to represent the interaction of the examinees and items (Lord and Novick, 1968, Chap. 16; Samejima, 1974). Lord and Novick (1968) developed the relationship between the unidimensional normal-ogive model and dimensions as defined by factor analysis, and Samejima (1974) derived a multidimensional normal-ogive model that used continuously scored items rather than those scored 0 or 1.

Presentation of the Model Form of the Model The data that are the focus of the model is the matrix of item scores, either 0 or 1, corresponding to either incorrect or correct answers to cognitive tasks. This matrix of data is usually oriented so that the rows (JV) refer to persons and the columns (n) refer to test tasks, or items. Thus, a row by column entry refers to the score received by person j (j = 1,N) on item i (t = l,n). Several assumptions are made about the mechanism that creates this data matrix: 1. With an increase in the value of the hypothetical constructs that are assessed, the probability of obtaining a correct response to a test item is nondecreasing. This is usually called the monotonicity assumption. 2. The function relating the probability of correct response to the underlying hypothetical constructs is "smooth" in the sense that derivatives of the function are defined. This assumption eliminates undesirable degenerate cases. 3. The probability of combinations of responses can be determined from the product of the probabilities of the individual responses when the probabilities are computed conditional on a point in the space defined by the hypothetical constructs. This is usually called the local independence assumption.

These assumptions are consistent with many different models relating examinee characteristics and the characteristics of test items. After reviewing many possible models that include vector parameters for both examinee and item characteristics [see McKinley and Reckase (1982) for a summary], the model given below was selected for further development because it was reasonable given what is known about item response data, consistent with simpler, unidimensional item response theory models, and estimable with commonly attainable numbers of examinees and test items. The basic form of the model is a direct generalization of the three-parameter logistic model (Lord, 1980) to the case where examinees are described by a vector of parameters rather than a single scalar value. The model is given by P{Uij = 1 | au di,Ci, 6j) = a + (1 where P(Uij = 1 | ai,di,Ci,6j) is the probability of a correct response (score of 1) for person j on test item i; Uij represents the item response for person j on item i; a* is a vector of parameters related to the discriminating power of the test item (the rate of change of the probability of correct response to changes in trait levels for the examinees); di is a parameter related to the difficulty of the test item; Ci is the probability of correct response that is approached when the abilities assessed by the item are very low (approach — oo) (usually called the lower asymptote, or less correctly, the guessing parameter); and 6j is the vector of abilities for examinee j. The definitions of model parameters are necessarily brief at this point. They will be given more complete conceptual definitions later in this chapter.

Graphic Display of the Model The equation for the model defines a surface that gives the probability of correct response for a test item as a function of the location of examinees in the ability space specified by the 0-vector. The elements of this vector are statistical constructs that may or may not correspond to particular psychological traits or educational achievement domains. When there are only two statistical constructs, the form of the probability surface can be represented graphically. Figures 1 and 2 show the probability surface for

Mark D. Reckase

FIGURE 1. Item response surface for an item with parameters a\ = 0.8, a2 = 1.4, d = -2.0, and c = 0.2. the same item (ai = 0.8, a2 = 1.4, d = -2.0, c = 0.2) using two different methods of representation. Figure 1 uses a three-dimensional surface that emphasizes the monotonically increasing nature of the surface and the lower asymptote. Figure 2 shows the surface as a contour plot of the lines of equal probability of correct response. This representation emphasizes that the equiprobable lines are straight lines and that they are all parallel to each other. This feature of the model is a result of the linear form of the exponent of e in the model equation.

Interpretation of Model Parameters The mathematical expression for the model contains parameters for both the examinees and the test items. These parameters can be interpreted as follows. (1) Person parameters. The person parameters in the model are the elements of the vector 6j. The number of elements required to adequately model the data matrix is open to some debate. The research experience to date (Reckase and Hirsch, 1991) suggests that the number of dimensions is often underestimated and that overestimating the number of dimensions does little harm. Therefore, the order of the 0-vector should be taken to be the maximum interpretable value rather than stressing the data reduction capabilities of the methodology. Of course, the number of dimensions used to model the item-examinee interaction will depend on the purpose of the

16. A Linear Logistic Multidimensional Model

FIGURE 2. Contour plot for the item response surface given in Figure 1. analysis. The ^-dimensions are statistical constructs that are derived to provide adequate fit to the binary N x n data matrix. These dimensions may not have psychological or educational meaning. Whether they do or not is a matter of construct validation. Of course, the space can be rotated in a number of ways to align the #-axes with meaningful points in the space. These rotations may or may not retain the initial covariance structure of the 6dimensions. There is nothing in the model that requires that the dimensions be orthogonal. If the correlations among the ^-dimensions are constrained to be 0.0, then the observed correlations among the item scores will be accounted for solely by the a-parameters. Alternatively, certain items or clusters of items can be constrained to define orthogonal dimensions. Then the observed correlations among the item scores will be reflected both in the a-parameters and in correlated ^-dimensions. (2) Item discrimination. The discrimination parameters for the model are given by the elements of the a-vector. These elements can be interpreted in much the same way as the a-parameters in unidimensional IRT models (Lord, 1980). The elements of the vector are related to the slope of the item response surface in the direction of the corresponding (9-axis. The elements therefore indicate the sensitivity of the item to differences in ability along that 0-axis. However, 'he discriminating power of an item differs depending on the direction that is being measured in the 0-space. This can easily be seen from Fig. 1. If the direction of interest in the space is parallel to the surface, the slope will be zero, and the item is not discriminating. Unless an item is a pure measure of a particular dimension, it will be

16. A Linear Logistic Multidimensional Model

Mark D. Reckase

more discriminating for combinations of dimensions than for single dimensions. The discriminating power of the item for the most discriminating combinations of dimensions is given by MDISCi =

where MDISQ is the discrimination of the item i for the best combination of abilities; p is the number of dimensions in the 0-space; and a.ik is an element of the an vector. For more detailed information about multidimensional discrimination, see Reckase and McKinley (1991). (3) Item difficulty. The di parameter in the model is related to the difficulty of the test item. However, the value of this parameter cannot be interpreted in the same way as the 6-parameter of unidimensional IRT models because the model given here is in slope/intercept form. The usual way to represent the exponent of a unidimensional IRT model is a (6 - 6), which is equivalent to a9 + (—ab). The term —ab in the unidimensional model corresponds to di. A value that is equivalent in interpretation to the unidimensional 6-parameter is given by MDIFFi =

where the symbols are denned as above. The value of MDIFF, indicates the distance from the origin of the #-space to the point of steepest slope in a direction from the origin. This is an analogous meaning to the 6-parameter in unidimensional IRT. The direction of greatest slope from the origin is given by

where a^ is the angle that the line from the origin of the space to the point of steepest slope makes with the fcth axis for item i; and the other symbols have been defined previously. More information can be obtained about multidimensional difficulty from Reckase (1985). (4) Lower asymptote. The Q-parameter has the same meaning as for the three-parameter logistic model. The value of the parameter indicates the probability of correct response for examinees that are very low on all dimensions being modeled.

Derived Descriptive Statistics Along with the parameters described above, the test characteristic curves and item and test information functions (Lord, 1980) have been generalized

for use with this model. The test characteristic curve generalizes to a test characteristic surface. That surface is denned by

where C,{6) is the expected proportion correct score at the point defined by the 0-vector; and Pi(6) is shorthand notation for the probability of correct response to item i given by the model. Item information is given by

where Iia(6) is the information provided by item i in direction a in the space and Va is the operator defining the directional derivative in direction a. The test information surface is given by the sum of the item information surfaces computed assuming the same direction. For more information about multidimensional information, see Reckase and McKinley (1991).

Parameter Estimation Parameters for the model were originally estimated using joint maximum likelihood procedures based on the algorithms operationalized in LOGIST (Wingersky et al., 1982). The goal of the estimation procedure was to find the set of item- and person-parameters that would maximize the likelihood of the observed item responses. The basic form of the likelihood equation is given by N

where Uij is the response to item i by person j, either a 0 or a 1. For mathematical convenience, the computer programs minimize the negative logarithm of L, F = — ln(L), rather than maximize L. Since the function, F, cannot be minimized directly, an interative Newton-Raphson procedure is used, first fixing the item parameters and estimating the person parameters, and then fixing the person parameters and estimating the item parameters. These procedures were implemented in both the MAXLOG (McKinley and Reckase, 1983) and MIRTE (Carlson, 1987) computer programs. McKinley (1987) has also developed a procedure based on marginal maximum likelihood (MULTIDIM). Although the above computer programs were used for a number of applications of model, other programs have been found to be more efficient and to yield some stable parameter estimates (Ackerman, 1988; Miller, 1991). Most of our current work is done using NOHARM (Fraser, 1986; McDonald, this volume), rather than the original estimation algorithms.

16. A Linear Logistic Multidimensional Model

Mark D. Reckase

Goodness of Fit The goal of the model is to accurately explain the interaction between persons and items. To the extent that this goal is attained, the model will be found useful for a particular application. Since all IRT models are implifications of the complex interactions of persons and test tasks, any of SI the models including the one presented here, will be rejected if the sample size is large enough. The question is not whether the model fits the data, but rather whether the model fits well enough to support the application. Detailed analysis of the skills assessed by a set of items may require more dimensions than an analysis that is designed to show that most of the variance in a set of responses can be accounted for by a single dimension. Because the importance of goodness of fit varies with the application, and because goodness-of-fit tests tend to be based on total group performance rather than focused on the item/person interaction, significance tests, as such are not recommended. However, it is important to determine whether the model fits the data well enough to support the application. The approach recommended here is to review carefully the inter-item residual covariance matrix to determine whether there is evidence that the use of the model is suspect for the particular application. The entries in this n x n matrix are computed using the following equation: COVjfc

where i and k represent items on the test. Large residuals may indicate failure of the estimation procedure to converge, too few dimensions in the model or an inappropriate model. The evaluation of such residuals will require informed judgment and experience. No significance testing procedure will simplify the process of evaluating the fit of the model to a set of data. In addition to the analysis of residuals, it is recommended that, whenever possible, the results of an analysis should be replicated either on parallel forms of a test, or on equivalent samples of examinees. Experience with multidimensional IRT analyses has shown that many small effects are replicable over many test forms and may, therefore, be considered real (Ackerman, 1990). Others, however, do not replicate. The only way to discover the full nuances of the relationships in the interactions of a test and examinee population is to screen out idiosyncratic effects through careful replication.

Example Most if not all tests of academic achievement require multiple skills to demonstrate proficiency. The items used on these tests will likely be differentially sensitive to these multiple skills. For example some mathematics

items may require the manipulation of algebraic symbols while others may require spatial skills. Often the skills required by an item may be related to the difficulty of the item. Reckase (1989) showed that the combination of multiple skills and variations in the difficulty of the items measuring those skills can result in different meanings for different points on a score scale. To guarantee common meaning at all points on the score scale used for reporting, the relationship between the skills being measured and the difficulty of the test items must be maintained for all test forms. Multidimensional IRT analysis gives a means for identifying the dimensional structure of a test for a specified population of examinees and for comparing that structure across test forms. The example presented here uses data from a test designed to assess the mathematics achievement of 10th grade students in the United States. The test consists of 40 multiple-choice items covering the areas of prealgebra, elementary algebra, coordinate geometry, and plane geometry. Test forms were produced to match a content and cognitive level specifications matrix, and specifications for difficulty and discrimination. However, there are no specifications that call for items in a specific content area to be at a specific level of difficulty. The data analyzed here consisted of item scores from 1635 students who were selected to be representative of the usual examinee population. The purposes for presenting this example are to demonstrate a typical analysis of test structure using multidimensional item response theory, to show the effects of confounding the difficulty of test items with the content that they assess, and to show the effect of using too few dimensions in analyzing the data matrix. The data were first analyzed using the BILOG program (Mislevy and Bock, 1989) for the purpose of estimating the c-parameters. Those estimates were input to NOHARM, which does not estimate c, and both two-dimensional and six-dimensional solutions were obtained. The two-dimensional solution was obtained so that the results could be plotted and so the effect of underestimation of the dimensionality could be determined. The six-dimensional solution is the highest-dimensional solution that is supported by NOHARM. The item parameter estimates and the estimates of MDISC, MDIFF, and the angles with the coordinate axes for the two-dimensional solution are presented in Table 1. The orientation of the solution in the #-space was set by aligning Item 1 with the #i-axis, therefore its a2-va\ue is 0.0. The information in Table 1 is also presented graphically in Fig. 3. In that figure, each item is represented by a vector. The initiating point of the vector is MDIFF units from the origin of the space, the length of the vector is equal to MDISC, and the direction of the vector is specified by the angles of the item with the 0-axes. This graphic representation has been found to be helpful for visualizing the structure of the test. A pattern in the vectors that has been found quite commonly in the analysis of achievement tests is that the easy items on the test (those at the lower left) tend to measure one dimension, in this case 6\, while some

16. A Linear Logistic Multidimensional Model

Mark D. Reckase

TABLE 1. Model and Item Parameter Estimates: Two Dimensions. Derived 1[tem Statistics

Item Parameter Estimates Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

2.49 1.00 0.66 0.76 0.29 -0.01 0.57 1.24 0.61 0.23 1.11 0.92 0.24 -0.68 -0.44 -0.28 0.59 -0.99 -0.21 -0.69 -0.04 -0.49 -0.60 -0.15 -1.47 -1.08 -0.97 -0.07 -0.84 -1.20 -0.91 -1.35 -0.62 -0.96 -1.89 -0.87 -1.31 -1.53 -3.77 -0.82

2.37 0.58 0.63 0.99 0.60 0.78 0.75 1.64 1.04 1.26 1.17 1.26 1.71 0.69 0.57 0.33 2.10 1.19 0.63 1.11 1.02 0.96 0.60 1.01 0.83 0.81 0.87 1.71 1.12 0.93 0.79 1.87 0.60 0.44 1.15 0.63 0.56 0.31 0.57 0.56

0.00 0.38 0.27 0.47 0.18 0.64 0.41 0.15 0.52 0.51 0.20 0.39 0.48 0.91 0.72 0.43 0.69 1.16 0.40 1.31 1.18 1.26 0.87 0.47 0.79 0.77 0.89 1.76 1.16 1.38 1.36 1.52 0.48 0.41 2.15 0.78 0.97 0.99 2.27 0.46

MDISC 2.37 0.69 0.69 1.10 0.62 1.00 0.86 1.65 1.16 1.36 1.19 1.32 1.78 1.15 0.92 0.54 2.21 1.66 0.75 1.72 1.56 1.58 1.06 1.11 1.15 1.12 1.24 2.45 1.61 1.66 1.57 2.41 0.77 0.60 2.44 1.00 1.12 1.03 2.34 0.73

MDIF -1.05 -1.44 -0.96 -0.70 -0.46 0.01 -0.66 -0.76 -0.53 -0.17 -0.94 -0.70 -0.13 0.59 0.47 0.51 -0.27 0.60 0.28 0.40 0.02 0.31 0.56 0.13 1.28 0.96 0.78 0.03 0.52 0.72 0.57 0.56 0.80 1.60 0.78 0.87 1.17 1.48 1.61 1.13

0.00 33.44 22.75 25.44 17.04 39.86 28.51 5.12 26.39 22.26 9.60 17.17 15.58 52.87 51.62 52.22 18.31 44.15 32.30 49.57 49.06 52.81 55.66 52.81 43.52 43.56 45.56 45.79 46.13 56.02 59.67 39.21 38.39 42.50 61.77 51.26 59.81 72.79 75.88 39.08

90.00 56.56 67.25 64.57 72.96 50.14 61.49 84.88 63.61 67.74 80.40 72.83 74.42 37.13 38.28 37.78 71.69 45.86 57.80 40.44 40.94 37.19 34.34 37.19 46.48 46.44 44.45 44.21 43.87 33.98 30.33 50.79 51.61 47.50 28.23 38.74 30.19 17.21 14.12 50.92

FIGURE 3. Item vectors and primary directions for content clusters from a two-dimensional analysis of a 40 item, 10th grade mathematics test.

of the more difficult items (those to the upper right) tend to measure a different dimension, in this case 62 • To get a better sense of the substantive meaning of this solution, the angles between the item vectors were cluster analyzed to determine which sets of items tended to be discriminating in the same direction in the #-space. The main direction in the space for each of the clusters is indicated by the lines labeled Cl to C5 in Fig. 3. The items closest to the line marked C5 are the geometry items on the test. Those closest to Cl are the algebra items. Since the geometry items are the most difficult items on the test, those examinees who get the top score on the test must function well on the geometry items as well as the rest of the content. Differences in the number correct scores in the range from 38 to 40 are mainly differences in geometry skill, while scores at the lower end of the score scale are mainly differentiated on algebra skill. Clusters 3 and 4 are less well defined in this solution. The values of MDISC, MDIFF, and the angles with the 0-axes for the six-dimensional NOHARM solution are given in Table 2. Note that the orientation of the solution in the space is fixed by aligning each of the first five items with an axis. The cluster analysis solution based on the angles between items in the six-dimensional space is given in Figure 4. Seven clusters are indicated in this figure. Next to the item number is a code that tells the content area of the item: PA - pre-algebra; EA - elementary algebra; PG - plane geometry; and CG - coordinate geometry. The six-dimensional solution given in Fig. 4 represents a fairly fine grained structural analysis of the test as responded to by the sample of tenth grade students. Cluster 5 for example contains items that require computation to

16. A Linear Logistic Multidimensional Model

Mark D. Reckase 0.000

TABLE 2. Item Parameter Estimates: Six Dimensions.

ai MIDISC MDIFF -1.00 0.00 1 3.86 -1.00 53.01 1.44 2 -0.82 42.61 0.88 3 -0.63 38.83 1.40 4 -0.45 39.85 0.65 5 0.01 46.47 1.12 6 -0.62 35.59 0.98 7 -0.73 7.93 1.93 8 -0.51 33.78 1.22 9 -0.17 31.28 1.36 10 -0.93 17.52 11 1.21 -0.69 23.74 1.40 12 -0.13 36.39 2.28 13 0.51 56.43 1.84 14 0.45 59.55 1.03 15 0.29 69.90 1.50 16 -0.26 35.37 2.70 17 0.56 54.87 2.23 18 0.24 39.66 0.99 19 0.39 56.34 1.90 20 0.02 50.76 2.35 21 0.29 55.80 2.13 22 0.49 59.71 1.51 23 0.12 49.77 1.37 24 1.12 60.01 1.68 25 0.91 51.09 1.26 26 0.71 55.99 1.65 27 0.03 46.93 3.60 28 0.50 56.51 1.95 29 0.69 66.29 2.07 30 0.57 61.94 1.63 31 0.54 52.45 3.22 32 0.65 53.08 1.13 33 1.26 72.19 0.87 34 0.77 67.44 2.64 35 0.81 53.07 1.19 36 1.09 66.47 1.34 37 1.29 85.34 1.46 38 1.64 78.77 2.09 39 0.87 57.03 1.19 40

90.00 36.99 90.00 68.48 55.32 74.47 60.96 79.93 64.12 70.67 61.84 71.76 65.40 85.31 86.95 73.13 67.21 80.16 74.53 85.01 79.49 83.51 80.69 84.20 79.51 72.16 63.91 50.62 73.00 34.42 99.97 80.48 80.02 69.03 68.79 89.48 66.48 65.47 61.25 68.17 78.23 61.01 84.03 59.01 88.40 80.09 66.46 79.62 75.84 76.19 74.40 80.84 79.33 66.52 79.99 65.87 61.28 65.66 55.77 64.46 79.85 76.21 75.53 100.11 78.86 76.77 57.91 73.89 67.63 73.30 84.71 84.55 62.15 81.71 62.10 81.50 69.18 105.06 85.55

90.00 90.00 90.00 72.97 81.63 85.55 89.47 85.96 82.20 75.67 87.34 72.21 77.15 94.93 76.19 101.34 76.66 86.92 91.39 73.63 60.33 58.61 89.28 68.45 82.15 83.08 80.17 62.98 65.29 72.99 63.64 69.96 51.29 55.38 64.09 57.51 71.93 69.90 64.92 58.85

65.78 78.05 85.37 86.14 79.93 72.20 77.79 81.07 59.51 82.40 71.54 72.69 61.72 59.06 85.07 73.23

FIGURE 4. Cluster analysis of item directions.

Mark D. Reckase

solve geometry problems. Items 33, 38, 39, and 40 all deal with coordinates in a two-dimensional space. Item 35 has to do with coordinates on a line. Only Item 37 is slightly different, having to do with proportional triangles, but it does require computation. Other geometry clusters, Items 22, 36, 28, and 21, for example, deal with triangles and parallel lines. The important result from these analyses is not, however, the level to which the content structure of the test is recovered, although the recovery was very good and did provide insights into item functioning- The important result is that the two-dimensional solution is a projection of the higher-level solution and information is lost when too few dimensions are used. Furthermore, the parameter estimates are sufficiently accurate with 1635 cases to support detailed analyses in six dimensions. For more details about the analysis of this particular set of data, see Miller and Hirsch (1990, 1992).

Discussion There are many similarities between the model that has been presented and the work of McDonald (1985) and other factor analytic procedures that operate on unstandardized matrices of data (e.g., Horst, 1965). The unique contribution of the formulation presented in this chapter, however, is that it focuses on the characteristics of the test items and the way that they interact with the examinee population. Item characteristics are considered to be worthy of detailed scrutiny, both for the information that scrutiny can give to understanding the functioning of a test, and for the information provided about the meaning of scores reported on the test. Other methods that have a similar mathematical basis tend to focus more on the use of the mathematics to achieve parismonious descriptions of the data matrix or to define hypothetical psychological variables. Thus the model and methods described in this chapter are unique more in purpose and philosophy than in computational procedure or form. This model has proven useful for a variety of applications and has helped in the conceptualizing of a number of psychometric problems including the assessment of differential item functioning and test parallelism (Ackerman, 1990, 1992).

References Ackerman, T.A. (1988). Comparison of multidimensional IRT estimation procedures using benchmark data. Paper presented at the ONR Contractors' Meeting, Iowa City, IA. Ackerman, T.A. (1990). An evaluation of the multidimensional parallelism of the EAAP Mathematics Test. Paper presented at the Meeting of the

16. A Linear Logistic Multidimensional Model

American Educational Research Association, Boston, MA. Ackerman, T.A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement 29(1), 67-91. Carlson, J.E. (1987). Multidimensional item response theory estimation: A computer program (Research Report ONR 87-2). Iowa City, IA: American College Testing. DuBois, P.H. (1970). A History of Psychological Testing. Boston: Allyn and Bacon. Fraser, C. (1986). NOHARM: An IBM PC Computer Program for Fitting Both Unidimensional and Multidimensional Normal Ogive Models of Latent Trait Theory. Armidale, Australia: The University of New England. Horst, P. (1965). Factor Analysis of Data Matrices. New York: Holt, Rinehart and Winston. Lord, F.M. (1980). Application of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: Lawrence Erlbaum Associates. Lord, F.M. and Novick, M.R. (1968). Statistical Theories of Mental Test Scores. Reading, MA: Addison Wesley. McDonald, R.P. (1985). Unidimensional and multidimensional models for item response theory. In D.J. Weiss (Ed), Proceedings of the 1982 Item Response Theory and Computerized Adaptive Testing Conference (pp. 127-148). Minneapolis, MN: University of Minnesota. McKinley, R.L. (1987). User's Guide to MULTIDIM. Princeton, NJ: Educational Testing Service. McKinley, R.L. and Reckase, M.D. (1982). The Use of the General Rasch Model with Multidimensional Item Response Data. Iowa City, IA: American College Testing. McKinley, R.L. and Reckase, M.D. (1983). MAXLOG: A computer program for the estimation of the parameters of a multidimensional logistic model. Behavior Research Methods and Instrumentation 15, 389-390. Miller, T.R. (1991). Empirical Estimation of Standard Errors of Compensatory MIRT Model Parameters Obtained from the NOHARM Estimation Program (Research Report ONR91-2). Iowa City, IA: American College Testing. Miller, T.R. and Hirsch, T.M. (1990). Cluster Analysis of Angular Data in Applications of Multidimensional Item Response Theory. Paper presented at the Meeting of the Psychometric Society, Princeton, NJ. Miller, T.R. and Hirsch, T.M. (1992). Cluster analysis of angular data in applications of multidimensional item-response theory. Applied Measurement in Education 5(3), 193-211.

Mark D. Reckase

Mislevy, R--J- a n d Bock > R D - (1989). BILOG: Item Analysis and Test Scoring with Binary Logistic Models. Chicago: Scientific Software. ReckaSe> M.D. (1985). The difficulty of test items that measure more than o n e ability. Applied Psychological Measurement 9(4), 401-412. Reckase, M.D. (1989). Controlling the Psychometric Snake: Or, How I learned to Love Multimensionality. Invited address at the Meeting of the American Psychological Association, New Orleans. Reckase, M.D. and Hirsch, T.M. (1991). Interpretation of Number Correct Scores When the True Number of Dimensions Assessed by a Test is greater Than Two. Paper presented at the Meeting of the National Council on Measurement in Education, Chicago. Reckase, M.D. and McKinley, R.L. (1991). The discriminating power of items that measure more than one dimension. Applied Psychological Measurement 14(4), 361-373. Samejimai F- (1974). Normal ogive model for the continuous response level in the multidimensional latent space. Psychometrika 39, 111-121. Snow, R - E - ( I " 3 ) - Construct validity and constructed-response tests. In R E - Bennett and W.C. Ward (Eds), Construction Versus Choice in Cognitive Measurement: Issues in Constructed Response, Performance festing, and Portfolio Assessment (pp. 45-60). Hillsdale, NJ: Lawrence Erlbaum Associates. Wm gersky, M.S., Barton, M.A., and Lord, F.M. (1982). LOGIST User's guide. Princeton, NJ: Educational Testing Service.

17 Loglinear Multidimensional Item Response Models for Polytomously Scored Items Henk Kelderman Introduction Over the last decade, there has been increasing interest in analyzing mental test data with loglinear models. Several authors have shown that the Rasch model for dichotomously scored items can be formulated as a loglinear model (Cressie and Holland, 1983; de Leeuw and Verhelst, 1986; Kelderman, 1984; Thissen and Mooney, 1989; Tjur, 1982). Because there are good procedures for estimating and testing Rasch models, this result was initially only of theoretical interest. However, the flexibility of loglinear models facilitates the specification of many other types of item response models. In fact, they give the test analyst the opportunity to specify a unique model tailored to a specific test. Kelderman (1989) formulated loglinear Rasch models for the analysis of item bias and presented a gamut of statistical tests sensitive to different types of item bias. Duncan and Stenbeck (1987) formulated a loglinear model specifying a multidimensional model for Likert type items. Agresti (1993) and Kelderman and Rijkes (1994) formulated a loglinear model specifying a general multidimensional response model for polytomously scored items. At first, the application of loglinear models to the analyses of test data was limited by the sheer size of the cross table of the item responses. Computer programs such as GLIM (Baker and Nelder, 1978), ECTA (Goodman and Fay, 1974), and SPSS LOGLINEAR (SPSS, 1988) require the internal storage of this table. However, numerical procedures have emerged that compute maximum likelihood estimates of loglinear models from their minimal sufficient statistics (Kelderman, 1992). These procedures require much less computer storage. The computer program LOGIMO (Kelderman and Steen, 1993), implementing these methods, has been specially designed to analyze test data. It is the purpose of this chapter to introduce a general multidimensional polytomous item response model. The model is developed, starting from a saturated loglinear model for the item responses of a fixed individual. It is

17. Loglinear Multidimensional Item Response Models

shown that, making the assumption of conditional (or local) independence and imposing certain hom*ogeneity restrictions on the relation between item response and person, a general class of item response models emerges. Item response theory (IRT) models with fixed person parameters do not yield consistent parameter estimates. Several authors have proposed loglinear IRT models that consider the person as randomly drawn from some population. These models are estimable with standard methods for loghnear or loglinear-latent-class analysis. A review of these ideas, as well as an example of their application to polytomous test data is given below. In the example several multidimensional IRT models are estimated and tested, and the best model is chosen.

Presentation of the Models Loglinear models describe the probability of the joint realization of a set of random categorical variables, given the values of a set of fixed categonca variables. The models are called loglinear because the natural logarithms of these (conditional) probabilities are described by a lines*.model. Some useful textbooks introducing these models are Fienberg (1980), Agresti (1984), and Bishop et al. (1975). For more details on loglinear models the reader is referred to these sources. To introduce the ability parameter, loglinear models for a fixed person are considered first. Fixed-person models are the models of interest but cannot be estimated directly. Therefore, randomperson models are introduced; these models are slightly more general than fixed-person models but they have the advantage of being estimable with standard loglinear theory models.

Fixed-Person Loglinear IRT Models Let Y = (Y Yi Y ) be the random variable for the responses on n items and let y = (iV,"' , Vu • • •, Vn) be a ^caX r e a l i z a t i o n - T h e r e s P o n s e , s can be polytomous, where Vi = 0 usually denotes an incorrect response and y% = 1,..., m, denote correct or partially correct responses. Furthermore, let j (= 1,..., N) denote the person. The probability of a person s response pattern can be described by a loglinear model. The model has parameter A* for the main effect of response pattern y, and an interaction parameter \*J describing the effect of the combination of response pattern y and person j. Superscripts are symbolic notation showing the variables whose effect is to be described. Subscripts denote the values that these variable take. The bar over the superscript means that the variable is considered as a random variable in the model. If no ambiguity arises, superscripts will be °mA graeral loglinear model for the probability of a random response pat-

where the parameter fij is a proportionality constant that ensures that the sum of the conditional probabilities over all response patterns is equal to one. Parameters are random parameters if they involve random variables, and fixed parameters otherwise. Equation (1) has indeterminacies in its parameterization. For example, adding a constant c to A^ and subtracting it from fij does not change the model. To obtain a unique parameterization one linear constraint must be imposed on the parameters. The simplest type of constraint contrasts the response patterns y with the zero pattern y = 0. This simple contrast (Bock, 1975, pp. 50, 239) corresponds to the linear constraint X% = 0. In general, the parameters are constrained such that they are zero if one or more of their subscripts takes its lowest value. In the model shown in Eq. (1) these constraints are A^ = 0 and X^j7 = A^/ = 0, where 0 = (0,..., 0) is a vector of zeros. An alternative way to remove the indeterminacies is to constrain the sum of the parameters over one or more subscripts to zero. For Eq. (1) these so-called deviation contrasts (Bock, 1975, pp. 50, 239) are —j

In Eq. (1), all characteristics of the responses common to all persons are described by the term A^, and all characteristics of the responses that are unique to the person are described by the term Ay^7. To obtain a model with an ability parameter, two additional restrictions must be imposed on the random parameters in the model shown in Eq. (1). The first restriction on the parameters of the loglinear model is the conditional independence restriction (Lord and Novick, 1968). This restriction means no interactions between the person's item responses. The parameters for the joint response pattern y can then be written as a sum of parameters for each single t/j: Ay = Xyi + Xy2 + Xy3 + • • • (2) and Ayj = \ij + Ayjjj + Xy3j + • • • . (3) Conditional independence restrictions imply (Goodman, 1970) that Eq. (1) can be written as a product of marginal probabilities, (4) 2=1

17. Loglinear Multidimensional Item Response Models

TABLE 1. Some Multidimensional Polytomous Item Response Models.

fo obtain an IRT model, the interaction between person and response is onstrained to be a linear combination of k person ability parameters 9jq ^ l f c ) (6)

where w are fixed integer-valued weights of response category y of item i with respect to ability q. To ensure that the constraint A^j"7 = 0 is satisfied, the convention is adopted that w^ = 0 for alii = 1 , . . . , n and q = 1 , . . . , k. In cognitive terms, the score y = 0 usually corresponds to a completely ' o r r e c t r e s ponse. The category weights may be further specified to define a particular IRT model. Before considering some meaningful specifications, the formulation of the model is completed first. In applications of IRT, it is important that the difficulty of the item res ponse is measured on the same scale as the person ability (Fischer, igg7). To accomplish this, the item main-effect parameters are written as a linear combination of more basic response difficulty parameters: *%Ai.

where /3iqy is the difficulty of the j/th response of item i with respect to ability q- To use Eq. (7) in applications it is necessary that the parameters Q. have empirical interpretation, and that it does not restrict the values . h (y > 0) in any way. Furthermore, the (3iqy parameters will often be of A s e t equal to zero to ensure that the estimated (3iqy parameters are unique. Thus {Piqy, y > 0} c a n be considered as a reparameterization of {A^*, Substituting Eqs. (6) and (7) into Eq. (5) yields the multidimensional polytomous item response model formulation, = 2/) =

This model is a generalization of Stegelmann's (1982) multidimensional Rasch model for dichotomously scored items. Kelderman and Rijkes (1994) have discussed it in its general polytomous form and called it the MPLT (multidimensional polytomous latent trait) model. This acronym will be used henceforth. It is easily seen that by specifying certain category weights, everal well-known item response models emerge. In Table 1, specifications S f category weights are given for some of these models. For example, in the dichotomous (m, = 1) Rasch model a, there is only one person ability arameter (k = 1), and the weights are equal to the responses (wly =

y= l , . . . , m ; rrii + l q = 1, . . . ,771;

«hi = i, — 1, . . . ,771; = 1 , ^ = 1

f Duncan and Stenbeck's (1987) Multitrait Rasch Model g Partial Credit Model With Several Null Categories (Wilson fc Masters, 1993)

no. of null categories

y). This restriction is equivalent to the restriction that in the loglinear model in Eq. (5), each correct response shows the same interaction with the person (A^1"7 = • • • = X^JJ = dji). That is, the parameters describing the interactions between the persons and the item responses are equal across items. If this common interaction parameter is high (low) for a certain person, his or her probability of getting the items correct is also high (low). Thus, the parameter provides an indication of the persons ability. The well-known partial credit model (PCM) (Masters, 1982, this volume) arises if the category weights are set equal to the scores. In the PCM, there is one incorrect response (y = 0), one correct response {y = rrii), and one or more partially correct responses (y = l,...,rrii - 1). Wilson and Adams (1993) discuss a partial credit model for ordered partition of response vectors, where every response vector corresponds to a level w^L Wilson and Masters (1993) also discuss models where one or more of the possible response categories are not observed in the data. The PCM is then changed such that exp(—(3nv) = 0 and wly = 0 if y is a null category. This

17. Loglinear Multidimensional Item Response Models

measure has the effect of removing the elements from the sample space with the particular null category and keeping the response framework intact (see Model g in Table 1). Models b through h in Table 1 are models for polytomously scored items. Models b, e, and g specify one ability parameter but Models c, d, and f have several ability parameters. In Model e there is one incorrect response (y = 0) but several correct responses (y > 0)- E a c h of t h e s e correct responses is related in the same way to the person, both within (AxjJ = • • • = A^ = 6

consistent maximum likelihood estimates. These models can also be formulated as loglinear models, but now for persons randomly drawn from some population. Inserting Eq. (8) into (4) gives the probability of response vector y of person j :

take higher values, the probability of a correct response becomes higher too. Note, however, that different correct responses may have different response probabilities because their difficulty parameters (pily) may have different values. Model b has one person ability parameter, whereas it gives a response a higher category weight {vj(y) as a function of its level of correctness. Model d has mi ability parameters, where each level of correctness involves a distinct ability parameter. To ensure that the response-difficulty parameters are estimable, the restriction 0jqy = ° if 1 ¥= V must be imposed on this model. Model c is a multidimensional Rasch model in which, except for the zero category, each response category has its own person parameter. It is easily verified in Eq. (8) that Model c is in fact a reparameterization of a n Q f3 9 Model d where 6jq -> 0jl -{ + Vjq ftgg- Dun9jq and d Pjgq f3 -> Pjii /? i + hf /3 Dun can and Stenbeck (1987) have proposed a multidimensional model for the analysis of Likert-type items. A further restriction on a unidimensional (A: = -1) version of the model in Eq. (8) is discussed in Adams and Wilson (1991). They impose a linear structure Pny = a^f on the item parameters so that LLTM-type models can be specified for polytomous data. Following this strategy, for example, rating scale models (Andersen, this volume; Andrich, 1978) for polytomous items can be formulated. The advantage of the MPLT model, however, lies not only in the models of Table 1. One can also go beyond these models, and specify one's own IRT model. In the Example, such an application of MPLT is described.

= -log where tq — w^ + • • • + w%yn is the sum of the weights corresponding to 6qj. Because {Piqy, y > 0} is a reparameterization of {A^, y > 0}, A^ will be used for simplicity. There are two ways to eliminate the person parameters from IRT models: (1) conditioning on their sufficient statistics and (2) integrating them out. Both approaches yield new loglinear models in which persons have to be considered as randomly jdrawn from a population. The joint MPLT model in Eq. (9) is an exponential-family model (Cox and Hinkley, 1974, p. 28), where the statistics tq (q = 1,..., k) are jointly sufficient for the person parameters 0qj (q = 1 , . . . , k). A convenient property of exponential-family models is that, upon conditioning on a jointly sufficient statistic, inference about the item parameters no longer depends on the person parameters (Andersen, 1970, 1980). To see this feature, let t = (ti,... ,tk) be the vector of sums of category weights and let T = (Ti,...,Tfc) be the corresponding random vector. The probability of person j ' s sum of category weights t can be obtained by summing the log probabilities in Eq. (9) over all possible response patterns with the same sum of category weights:

Pj (T = t) = exp Lj + £ tgeqj Random-Person IRT Models In the previous section, all models were formulated as loglinear models for fixed persons. Each of these persons was described by one or more person parameters. A well-known problem in estimating the parameters of these models is that the number of person parameters increases as the number of persons increases. The result is that the usual asymptotic arguments for obtaining maximum likelihood estimates of the parameters do not hold so that the estimates become inconsistent (Andersen, 1973; Neyman and Scott, 1948). In this section, some new models are discussed that do have

Elimination of the item parameters by conditioning is achieved by subtracting the logarithm of Eq. (10) from (9):

logP,(Y = y | T = t) = (11) This result is a loglinear model that contains only random parameters for the item responses and a fixed parameter for the sum of category weights.

17. Loglinear Multidimensional Item Response Models

N t th t the person parameters have vanished. Therefore, the same model , . , r ji persons with the same sum of category weights t. Consequently, ,,° ! Q ° 1 m odel holds also for the probability P(Y = y | T = t) of the y of a randomly sampled person with sum of category weights t: (12) T tVi terminology of loglinear modeling (Bishop et al, 1975, Sec. 5.4; TT u n 1979, Sec. 7.3), the model in Eq. (12) can be viewed as a quasii r model for the incomplete item 1 x item 2 x . . . x item k x sum-of^eights contingency table with structurally zero cells if the sum of weights is not consistent with the response pattern (Kelderman, 198^°Kdderman and Rijkes, 1994). For example, Rasch (1960/1980), An' c 1973), and Fischer (1974) discuss the estimation of item parameters ft ditioning on the sufficient statistics of the person parameters in the dichotomous Rasch model. •yu cOnd approach also assumes that the persons are randomly drawn f me population and respond independently to one another. Further•t -s assumed that the sums of category weights T have a multinomial , ' ,-OI1 w jth number parameter N and with n—1 unrestricted paramf the probabilities. The probability of the person's response pattern , u written as the following loglinear model: can tinen u e logP(Y = y) = log(P(Y = y | T = t ) P ( T = t))

For the case of the dichotomous Rasch model, this approach was proposed by Tiur (1982), Duncan (1984), and Kelderman (1984). The results can readily be generalized to the MPLT. The models in Eqs. (12) and (13) differ in the sampling scheme that is assumed for the data. The former model assumes product-multinomial samDling) whereas the latter assumes multinomial sampling. Bishop et al. (1975 Sec. 3-3) showed that, for both models, the kernel of the likelihood is 'dentical- As a result, both models have the same set of likelihood equations for estimating the parameters and the same statistics for testing their goodness of fit. Both models also disregard the proper sum-of-category-weights distribution in Eq- ( 10 )- T h e m o d e l i n EQ- ( 12 ) d o e s t h i s b v considering the

sum of category weights as fixed and concentrating only on the conditional distribution. The model in Eq. (13) does so by replacing the sum-ofcategory-weights distribution for each person by an unrestricted multinomial sum-of-category-weights distribution for the population. Tjur (1982) calls this model an "extended random model." Obviously, the extended random model and the original model in Eq. (9) are not the same. Note that the individual sum-of-category-weights distribution has a quite restrictive form. This restriction is not respected in the extended random model. Another derivation of the loglinear Rasch model, has shed more light on the difference between both models. Cressie and Holland (1983) arrive at the loglinear Rasch model in Eq. (13) by integrating out the person parameters. They call this model the "extended Rasch model." Introducing more general notation here, let the persons again be randomly drawn from a population and let 0 = ( 0 i , . . . , Qq,..., Ok) be the vector random variable for the person parameters with realization 9 = ( # i , . . . , 6q,..., 0k) and a distribution f{8). Integrating the MPLT model in Eq. (9) over this distribution, the probability of a response pattern y becomes P ( Y = y) = exp | > ; A* | /

(14) where A^ satisfies the constraints imposed by the chosen model. For the case of the dichotomous Rasch model, Cressie and Holland (1983) then replaced the logarithm of the integral by an unrestricted parameter ot {— n + \J) that depends only on the sum of category weights. They showed that the sum-of-category-weights parameters must satisfy complicated constraints, which can be checked (see also Hout et al., 1987; Lindsay et al., 1991) but under which estimation is hard to pursue. If these constraints are ignored, Cressie and Holland's extended Rasch model is obtained. This model is the same loglinear model as Tjur's extended random model in Eq. (13). The extended Rasch model is less restrictive than the original model in Eq. (14) but de Leeuw and Verhelst (1986) and Follmann (1988) proved that, for the case of the dichotomous Rasch model, under certain mild conditions on the distribution /, both models have estimators that are asymptotically equivalent.

Parameter Estimation For the case of the MPLT, an extended loglinear formulation can be used to estimate the item parameters. These parameter estimates correspond to conditional maximum likelihood estimates. The log-likelihood of the model

17. Loglinear Multidimensional Item Response Models

in Eq. (13) is

E ^Al where fy is the observed sample frequency of vector y, ff the frequency of sum of category weights t, and f£, the frequency of response y*. Setting the derivatives with respect to the parameters to zero, the following likelihood equations are obtained: T

for all t and Vi = 0 , . . . , m, (i = 1 , . . . , n), where F = £ ( / ) denotes the expected frequency under Eq. (13) [see Haberman (1979), p. 448)]. Equations (16) can be solved by iterative methods, such as iterative proportional fitting (IPF) or Newton-Raphson. In IPF, the (unrestricted) parameters are updated until convergence by .(new)

for all t and yt = 0, ...,rrii (i = 1, . . . , n ) , where F( o l d ) is the expected frequency computed from the previous parameter values A^old^. Note that this algorithm does no longer require the full contingency table {/ y } or { F y } . For a description of a Newton-Raphson algorithm that operates on the sufficient statistics, see Kelderman (1992). Both algorithms still have the following two practical difficulties left to be solved. First, the marginal expected frequencies {F t } and {Fyi} require summation over all remaining variables. This problem can be solved by writing the model in a multiplicative form and, using the distributive law of multiplication over summation, reducing the number of summations. The reduction is achieved because the summations are only over parameters whose indices contain the summation variable. Second, for an arbitrary MPLT model, marginal frequencies {Ft} may be hard to compute because the response vectors x over which summation has to take place to obtain {F t } from {F x } must be consistent with the vector of sums of category weights t. This problem can be solved by keeping track of the sum of the category weights t while summing over the xt variables. For example, suppose there are four items and an MPLT model with multiplicative parameters ip = exp(/x), <j>t = exp(A t ), 4>yi = exp(A yi ), (f>y2 = exp(A y2 ), 4>y3 = exp(A y3 ), and 0V4 = exp(A y J. Furthermore let t,' = w* H \-w.'qy< be the sum of category weights computed over the first i variables and let t(») = u(i) tf) be the vector of these partial sums of category weights. The expected marginal frequency for the sums of category weights can now

where J2a b,c means summation over all values of a and b that are consistent with c. The summation can be obtained in two steps. First, the products in the summand as well as their partial sums of category weights, e.g., {(02/102/2)) t(2) ; Vi = l,---."ii; V2 = l , . . . , m 2 } , are computed. Second, these products (4>yi

y2) are summed only if the partial sums of category weights (t(2') are identical. This step yields a sum for each distinct value of the sum of category weights. If the model is the dichotomous Rasch model (Table 1, Model a), this method to compute marginal expected frequencies is identical to the so-called summation algorithm for the computation of elementary symmetric functions (Andersen, 1980). See Verhelst et al. (1984) for a thorough discussion of the accuracy of the computation of elementary symmetric functions. In the program LOGIMO (LOglinear Irt MOdeling, Kelderman and Steen, 1993), the above method is used to compute the marginal frequencies {F z }, where z may be any vector of item or sum-of-category-weights variables. LOGIMO solves the likelihood equations of any loglinear model with sum-of-category-weights variables using iterative proportional fitting or Newton-Raphson methods. It is written in Pascal and runs on a 386 PC/MS-DOS system or higher and can be obtained from iec ProGAMMA, Box 841, 9700 AV Groningen, The Netherlands. The program OPLM (Verhelst et al., 1993) estimates the parameters in Eq. (8) for the unidimensional case (k = 1). It can be obtained from the authors at Cito, Box 1034, 6801 MG, Arnhem, The Netherlands.

Goodness of Fit The overall fit of MPLT models can be tested by customary statistics for loglinear models, such as Pearson's statistic (X2) and the likelihood-ratio statistic (G2) (Bishop et al., 1975; Haberman, 1979). However, the usual asymptotic properties of these statistics break down if the number of items is large (Haberman, 1977; Koehler, 1977; Lancaster, 1961), albeit that X2 seems to be more robust than G2 (Cox and Placket, 1980; Larnz, 1978). In this case, the following two alternative test statistics can be used. One approach is to compare the model of interest (M) with a less restrictive model (M*) using the likelihood-ratio statistic —2[(Z(M) - l(M*)] with degrees of freedom equal to the difference in numbers of estimable parameters of both models (Rao, 1973, pp. 418-420). If one model is not a special case of the other, Akaike's (1977) information criterion (AIC), (2[(number of parameters) —/]), can be compared for both models. The model with the


17. Loglinear Multidimensional Item Response Models

Henk Kelderman

TABLE 2 Sp ec ifi ca ti° n of Cognitive Processes Involved in the Responses on the Items of the ACSP Medical Laboratory Test. I Applies Knowledge

Item a 1 2 3 4 5 6 7 8 9 10 11

1 2 2 1 1 1 1 1 1


c 2

d 1

II Calculates a b c d

111 Correlates a

Data b e d

IV Correct a b c d 1 1

2 1 1 1 1 1 1 1

1 1 1 1 1

1 2







1 1




1 2 1


1 1

1 1


TABLE 3. AIC Statistics for Eleven-Item and Nine-Item Data Sets. Eleven-Item Data Set Model Independence A B C D

Ability None I, II, III IV I, II, III, IV I, III, IV


Nine-Item Data Set AIC-


18000 34 262 43 645

704 265 104 124

28 219 36 446 191

AIC18000 1223 1088 854 996 942

1 1


1 2



smallest AlC is chosen as the best fitting model. The second approach is to consider the Pearson X2 statistics for marginal contingency tables rather than the full contingency table. For example, for the dichotofflous Rasch model, one could calculate a Pearson statistic for each combination of an item i and a sum of category weights q (van den Wollenberg, 1979, 1982). In the next section, an example will be presented in which both approaches are used to check model fit.

Example The American Society of Clinical Pathologists (ASCP) produces tests for the certification of paramedical personnel. In this section the responses of 3331 examinees to 11 four-choice items from a secure data base were re-analyzed. The items measure the ability to perform medical laboratory tests The composite test score used was equal to the number of correct responses given by the examinee. Although each item had one correct alternative and three incorrect alternatives it was hypothesized by content experts that incorrect responses might often have been chosen after cognitive activities similar to those necessary to arrive at the correct response. Therefore, it was felt that the incorrect responses might also contain valid information about the examinee It was hypothesized that three cognitive processes had been involved a answering the items: "Applying Knowledge" (I), "Calculating" (II), and in Correlating Data" (III). Table 2 provides the weights that content experts assigned to each of the item responses for each cognitive process. For example the correct response in Item 4 involved two calculations, whereas

Response c involved only one. The 11 items were selected from the data base such that a balanced representation of the three cognitive processes was obtained. The correct responses (IV) are also given. It was hypothesized that giving the correct response required an additional metacognitive activity that went beyond Cognitive Processes I, II, and III. Finally, it was assumed that for each examinee distinct parameters were needed to describe the ability to perform each of the Cognitive Processes I, II, III, and IV. To study whether the items provided information about the hypothesized abilities I, II, and III, several MPLT models from Eq. (9) were specified. Item responses were described by A rather than /? parameters so that responses with zero scoring weights could have different probabilities. To obtain an interpretable and unique set of parameters, two restrictions had to be imposed on the A parameters. First, to make the parameters of the incorrect responses comparable, a reparameterization was done to guarantee that their sum was equal to zero in each item. In an item i, this result was achieved by subtracting their mean from all parameters of item i and adding it to the general mean parameter /j,. Second, to fix the origin of the ability scale, the sum of the item parameters corresponding to the correct responses was set equal to zero. This was done by subtracting their mean from the correct response parameters and next subtracting t4 times this value from Xf. The second column of Table 3 shows which ability parameters were specified in each model. The weights for each of these ability parameters are given in Table 2. In the complete independence model, no traits were specified. The corresponding loglinear model had main effects of Items 1 through 11 only. Model C contained all ability parameters, Model A only the ability parameters for the correct response. Model D had all ability parameters except the one for "Calculating." The model will be discussed later on. From the AIC values in Table 3, it is seen that the complete independence model did not explain the data as well as the other models. Furthermore, Models B and C, which also contained an ability for Cognitive Process


17. Loglinear Multidimensional Item Response Models

Henk Kelderman

TABLE 4. Goodness-of-Fit Statistics for Grouped Item x Sum-of-Category Weights IV (Model B; 11-Item Data Set). 2




Item 6 7

10 11

























IV, had better fit to the data than Model A, which contained only ability parameters for Cognitive Processes I, II, and III. Surprisingly, the more restrictive model (Model B) provided a slightly better fit than the most inclusive model (Model C). This result suggests that the contribution of Cognitive Processes I, II, and III to explaining test behavior in this situation is quite small—in fact, probably smaller than the errors associated with the model parameter estimates. In this situation, Model B, which was based on the number-correct scoring, probably described the response behavior to a sufficient degree. To test whether the items fitted Model B, Pearson statistics were computed for each of the grouped item-response x sum-of-category-weights IV contingency tables. Because the sum of category weights is a sufficient statistic for the ability parameter, Pearson fit statistics are sensitive to lack of fit of the responses of an item with respect to the particular ability parameter. Note that the overall goodness-of-fit statistics X2 of G2 could not be used in this case because the contingency table had too many cells (411) relative to the number of examinees (3331). It is seen from Table 4 that Item 3 and 7 did not fit the model very well. Therefore, it was decided to set these items aside in the analysis and analyze the remaining nine-item data set. Models A, B, and C were fitted again. Table 3 shows that the nine-item data set had the same pattern of fit as the 11-item data set: Model B fitted better than Models A or C. In this case, again, the data could not identify any of the cognitive-process abilities. To consider one more alternative model, Model D was suggested. It was conjectured that too many cognitive processes were specified in Models A and C. In particular, the "Calculation" process (II) could have been too easy to discriminate between examinees. Therefore, in Model D, this ability was removed. Although this model fitted better than Model C, it does not fit better than Model B, indicating that the remaining abilities I and III could not be detected in the data. In Table 5, grouped-Pearson X2 statistics and parameter estimates are given for Model B for the nine-item data set. It is seen that the fit of the individual items to the model was reasonable. The estimates of the


TABLE 5. Parameter Estimates and Goodness of Fit Statistics of Model B in Nine-Item Data Set. (Asterisk Denotes a Correct Response.) 1





5 32






Item 6 21










0.88 0.37 0.01* 0.14 -1.13



Response a


Response b

0.95 -0.02 0.66* -0.07 0.93 0.07* -0.60 -1.05


0.30 0.46* -0.97* 1.29

Response c -0.92* -0.50* -1.00 0.48 0.47* 0.58 -0.21* -0.16 -0.05* Response d -1.50 -0.85 0.63 -0.41 -1.07 0.54

0.30 0.90 -0.31

A parameters are also given in Table 5. Items 1 and 2 turned out to be difficult but Items 4, 6, and 10 were relatively easy. Parameter estimates for the incorrect alternatives showed, for example, that Alternative d of Item 1 was more attractive than Alternatives a or b.

References Adams, R.J. and Wilson, M. (1991). The random coefficients multinomial logit model: A general approach to fitting Rasch models. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, April. Agresti, A. (1984). Analysis of Ordinal Categorical Data. New York: Wiley. Agresti, A. (1993). Computing conditional maximum likelihood estimates for generalized Rasch models using simple loglinear models with diagonals parameters. Scandinavian Journal of Statistics 20, 63-71. Akaike, H. (1977). On entropy maximization principle. In P.R. Krisschnaiah (Ed), Applications of Statistics (pp. 27-41). Amsterdam: North Holland. Andersen, E.B. (1970). Asymptotic properties of conditional maximum likelihood estimators. Journal of the Royal Statistical Society B 32 283301. Andersen, E.B. (1973). Conditional inference and multiple choice questionnaires. British Journal of Mathematical and Statistical Psvcholoav 26 31-44. ' Andersen, E.B. (1980). Discrete Statistical Models with Social Science Applications. Amsterdam: North Holland.


Henk Kelderman

Andrich, D. (1978)- A rating scale formulation for ordered response categories. Psychometrika 43, 561-573. Baker, R.J. and Nelder, J.A. (1978). The GLIM System: Generalized Linear Interactive Modeling. Oxford: The Numerical Algorithms Group. Bishop, Y.M.M., Fienberg, S.E., and Holland, P.W. (1975). Discrete Multivariate Analysis. Cambridge, MA: MIT Press. Bock, R.D (1975)- Multivariate Statistical Methods in Behavioral Research. New York: McGraw Hill. Cox M.A.A. and Hinkley, D.V. (1974). Theoretical Statistics. London: Chapman and Hall. Cox M.A.A. and Placket, R.L. (1980). Small samples in contingency tables.

Biometrika 67, 1-13. Cressie, N. and Holland, P.W. (1983). Characterizing the manifest probabilities of latent trait models. Psychometrika 48, 129-142. de Leeuw J. and Verhelst, N.D. (1986). Maximum likelihood estimation in generalized Rasch models. Journal of Educational Statistics 11, 183196. Duncan, O.D. (1984). Rasch measurement: Further examples and discussion. In C.F. Turner and E. Martin (Eds), Surveying Subjective Phenomena (Vol. 2, pp. 367-403). New York: Russell Sage Foundation. Duncan O.D. and Stenbeck, M. (1987). Are Likert scales unidimensional? Social Science Research 16, 245-259. Fienberg, S.E. (1980). The Analysis of Cross-Classified Categorical Data. Cambridge, MA: MIT Press. Fischer, G.H. (1974). Einfiihrung in die Theorie psychologischer Tests [Introduction to the Theory of Psychological Tests}. Bern: Huber. (In German.) Fischer, G H. (1987). Applying the principles of specific objectivity and generalizability to the measurement of change. Psychometrika 52, 565587. Follmann D.A. (1988). Consistent estimation in the Rasch model based on nonparametric margins. Psychometrika 53, 553-562. Goodman L.A. (1970). Multivariate analysis of qualitative data. Journal of the American Statistical Association 65, 226-256. Goodman, L.A. and Fay, R. (1974). ECTA Program, Description for Users. Chicago: Department of Statistics University of Chicago. Haberman S.J. (1977). Log-linear models and frequency tables with small cell counts, Annals of Statistics 5, 1124-1147. Haberman S.J. (1979). Analysis of Qualitative Data: New Developments (Vol. 2). New York: Academic Press.

17. Loglinear Multidimensional Item Response Models


Hout, M., Duncan, O.D., and Sobel, M.E. (1987). Association and heterogeneity: Structural models of similarities and differences. Sociological Methodology 17, 145-184. Kelderman, H. (1984). Loglinear Rasch model tests. Psychometrika 49, 223-245. Kelderman, H. (1989). Item bias detection using loglinear IRT. Psychometrika 54, 681-697. Kelderman, H. (1992). Computing maximum likelihood estimates of loglinear IRT models from marginal sums. Psychometrika 57, 437-450. Kelderman, H. and Rijkes, C.P.M. (1994). Loglinear multidimensional IRT models for polytomously scored items. Psychometrika 59, 147-177. Kelderman, H. and Steen, R. (1993). LOGIMO: Loglinear Item Response Modeling [computer manual]. Groningen, The Netherlands: iec ProGAMMA. Koehler, K.J. (1977). Goodness-of-Fit Statistics for Large Sparse Multinomials. Unpublished doctoral dissertation, School of Statistics, University of Minnesota. Lancaster, H.O. (1961). Significance tests in discrete distributions. Journal of the American Statistical Association 56, 223-234. Larnz, K. (1978). Small-sample comparisons of exact levels for chi-square statistics. Journal of the American Statistical Association 73, 412-419. Lord, F.M. and Novick, M.R. (1968). Statistical Theories of Mental Test Scores. Reading, MA: Addison Wesley. Lindsay, B., Clogg, C.C., and Grego, J. (1991). Semiparametric estimation in the Rasch model and related exponential response models, including a simple latent class model for item analysis. Journal of the American Statistical Association 86, 96-107. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika 47, 149-174. Neyman, J. and Scott, E.L. (1948). Consistent estimates based on partially consistent observations. Econometrica 16, 1 32. Rao, C.R. (1973). Linear Statistical Inference and Its Applications (2nd ed.). New York: Wiley. Rasch, G. (1960/1980). Probabilistic Models for Some Intelligence and Attainment Tests. Chicago: The University of Chicago Press. SPSS (1988). SPSS User's Guide (2 ed.). Chicago, IL: Author. Stegelmann, W. (1983). Expanding the Rasch model to a general model having more than one dimension. Psychometrika 48, 257-267. Thissen, D. and Mooney, J.A. (1989). Loglinear item response theory, with applications to data from social surveys. Sociological Methodology 19, 299-330.


Henk Kelderman

Tjur, T. (1982). A connection between Rasch's item analysis model and a multiplicative Poisson model. Scandinavian Journal of Statistics 9, 23-30.. van den Wollenberg, A.L. (1979). The Rasch Model and Time Limit Tests. Unpublished doctoral dissertation, Katholieke Universiteit Nijmegen, The Netherlands. van den Wollenberg, A.L. (1982). Two new test statistics for the Rasch model. Psychometrika 47, 123-140. Verhelst, N.D., Glas, C.A.W., and van der Sluis, A. (1984). Estimation problems in the Rasch model. Computational Statistics Quarterly 1, 245-262. Verhelst, N.D., Glas, C.A.W., and Verstralen, H.H.F.M. (1993). OPLM: Computer Program and Manual. Arnhem, The Netherlands: Cito. Wilson, M. and Adams, R.J. (1993). Marginal maximum likelihood estimation for the ordered partition model. Journal of Educational Statistics 18, 69-90. Wilson, M. and Masters, G.N. (1993). The partial credit model and null categories. Psychometrika 58, 87-99.

18 Multicomponent Response Models Susan E. Embretson Introduction Cognitive theory has made significant impact on theories of aptitude and intelligence. Cognitive tasks (including test items) are viewed as requiring multiple processing stages, strategies, and knowledge stores. Both tasks and persons vary on the processing components. That is, the primary sources of processing difficulty may vary between tasks, even when the tasks are the same item type. Processing components have been studied on many tasks that are commonly found on aptitude tests. For example, cognitive components have been studied on analogies (Mulholland et al., 1980; Sternberg, 1977), progressive matrices (Carpenter et al., 1990), spatial tasks (Pellegrino et al., 1985), mathematical problems (Embretson, 1995a; Mayer et al., 1984) and many others. These studies have several implications for measurement since persons also vary in performing the various processing components. First, item response theory (IRT) models that assume unidimensionality are often inappropriate. Since tasks depend on multiple scores of difficulty, each of which may define a source of person differences, multidimensional models are theoretically more plausible. Although unidimensional models can be made to fit multidimensional tasks, the resulting dimension is a confounded composite. Second, the same composite ability estimate may arise from different patterns of processing abilities, thus creating problems for interpretation. Third, a noncompensatory or conjunctive multidimensional IRT model is theoretically more plausible than a compensatory model for multicomponent tasks. That is, solving an item depends on the conjunction of successful outcomes on several aspects of processing. The Multicomponent Latent Trait Model (MLTM) (Whitely, 1980) was developed to measure individual differences in underlying processing components on complex aptitude tasks. MLTM is a conjunctive multidimensional model in which task performance depends on component task difficulties and component person abilities. A generalization of MLTM, the General Component Latent Trait Model [GLTM; Embretson (1984)]; permits the stimulus features of tasks to be linked to component task difficulties.


Susan E. Embretson

18. Multicomponent Response Models

Such stimulus features are typically manipulated in experimental cognitive studies to control the difficulty of specific processes. Thus, GLTM permits a more complete linkage to cognitive theory than the original MLTM. MLTM and GLTM have many research applications, especially if the sources of performance differences is a foremost concern. Perhaps the most extensive application is for construct validation research. The fit of alternative MLTM or GLTM models and the component parameter estimates provide data that are directly relevant to the theoretical processing that underlies test performance. Furthermore, MLTM and GLTM are also quite useful for test design, including both item development and item selection. The model parameters can be used to develop tests with specified sources of processing difficulty. Additionally, applications of MLTM and GLTM to research on individual and group differences provide greater explanatory power. That is, the sources of performance differences, in terms of underlying processes, can be pinpointed. These applications will be illustrated or discussed in this chapter.

Presentation of the Model MLTM specifies the relationship between the response to the item and responses to underlying processing components as conjunctive. That is, the underlying components must be executed successfully for the task to be solved. Thus, the probability of solving the task is given by the product of the probabilities of solving the individual tasks, as follows: P(UijT = 1 | i^bi) = (s - g) JJ P{Uijk = 1 I 0jk, bik) + g,


where djk is the ability for person j on component k, bik is the difficulty of item i on component k, Uijk is the response for person j on the fcth component of item i, UijT is the response for person j on the total task T for item i, g is the probability of solving the item by guessing and s is the probability of applying the component outcomes to solve the task. In turn, the processing components are governed by the ability of the person and the task difficulty. The person and item parameters of MLTM are included in a logistic model that contains the component ability djk and component difficulty bik such that the full MLTM model may be written as follows: + 9,


where djk and bik are defined as in Eq. (1). An inspection of Eq. (2) shows that the component probabilities are specified by a Rasch model. For example, verbal analogies have been supported as involving two general components, rule construction and response evaluation (see Whitely,


1981). In a three-term analogy, the task is to complete the analogy with a term that relates to the third term in the same way as the second term relates to the first term. For example, Event : Memory :: Fire : 1) Matches 2) Ashes* 3) Camera

4) Heat

would require: (1) determining the analogy rule "something that remains afterwards"; and (2) evaluating the four alternatives as remainders of "Fire." MLTM would give the probability of solving the task as the probability of constructing the rule correctly times the probability of response evaluation by the rule. (Handling data with sequential outcomes will be discussed below.) The item response curves show the relationships between components. The degree to which solving the item depends on a given component (e.g., 0i) depends on the logit (dijk = 0jk - bik) of the other components. Figure 1 presents some item response curves for an item with two components, for simplicity assuming that s = 1.0 and g = 0.0. Note that when dijk is high, such that the item is very easy or the person's ability is high, the probability of passing the item is well approximated by the probability of passing the first component. However, when dijk is low, then the regression on 9\ decreases sharply.Furthermore, the asymptotic value is not 1.0; it is the second component probability, as specified by the logit. A variant of MLTM is appropriate for sequentially dependent strategies (Embretson, 1985), each of which may contain one or more components. In this case, the alternative strategies are attempted, in turn, only if the primary strategy fails. Thus, the probability of solving the item is represented as the sum of the conditional strategy probabilities. For example, a multiple strategy theory of verbal analogy solving could assume that the main strategy is the rule-oriented as described above. If that strategy fails, then an associated strategy, finding an alternative that has high associative strength to the third term to the alternatives, is attempted. If that fails, then random guessing is attempted. The following strategy model represents this example: PT = (sr - g)PxP2 + {sa - g)Pa{\ -


where PT, Pl, P2, and Pa are probabilities that depend on persons and items, as in Eq. (1), but for brevity omitting the full notation. PT is the total-item probability, Px and P2 are component probabilities for the ruleoriented strategy, Pa is a component probability for the associational strategy, sr is the probability of applying the rule-oriented strategy, sa is the conditional probability of applying the associational strategy and g is the conditional probability of guessing correctly. GLTM, a generalization of MLTM, permits constraints to be placed on item difficulty, similar to the Linear Logistic Latent Trait Model (LLTM)


18. Multicomponent Response Models

Susan E. Embretson


(Fischer, 1983; this volume). Unlike LLTM, in which only the total item response is modeled, GLTM sets constraints within components. These constraints permit a cognitive model of task difficulty, based on the item stimulus features, to be embedded in the measurement model. The full GLTM may be written as follows:

- J2 2^mkl

+ + Oik)

+9, (4)


> 5.0


where 6jk is denned as in Eq. (2), Vimk is a score representing the difficulty of item 1 on stimulus factor m in component fc, /3mk is the weight of stimulus complexity factor m in item difficulty on component k, a^ is the normalization constant for component fc, s is the probability of applying the component outcomes.

Parameter Estimation .6

p(x T -i) .4


FIGURE 1. Regression of total item response on Component 1 ability as a function of Component 2 logit.

The original MLTM and GLTM estimation method (Embretson, 1984; Whitely, 1980) required subtasks to identify the component parameters. The subtasks operationalize a theory of task processing. For example, the rule-oriented process for solving the verbal analogy presented above would be operationalized as follows: (1) the total task, with standard instructions, (2) the rule construction subtask, in which the stem (Event:Memory::Fire:?) is presented, with instructions to give the rule in the analogy, and (3) the response evaluation subtask, the total task plus the correct rule, with instructions to select the alternative that fulfills the rule. There are 2K+1 possible responses patterns for an item, where K is the number of postulated components. For example, Table 1 presents a sample space for two components and a total task, with eight possible response patterns. MLTM, from Eq. (2), would give the probabilities of the response pattern as shown in the last column. Now, if U_p is the vector of responses Uijii Uij2, U^T to the two components and total task for person j on item i, s, P1: and P2 are denned as in Eqs. (1) and (2) and Qi and Q2 are (1 — Pi) and (1 — P2), respectively, the probability of any response pattern can be given as

l-Uk], (5a) where the subscripts i and j are omitted in this expression, for brevity. Note that s, the probability of applying component outcomes, enters into the probability only if the components are passed (Y[ Wfc = 1), while g enters only if the components are failed (1 - JTufc = 1). Furthermore, the status of the total task determines if s and g or their compliments enter into the equation. Assuming local independence, the probabilities given in Eq. (5)


18. Multicomponent Response Models

Susan E. Embretson

TABLE 1. Response Patterns, MLTM Probabilities and Observed Frequencies for a Two-Component Verbal Analogy Task.


d 0 1 0 1 0 1 0 1

0 0 1 1 0 0 1 1



0 0 0 0 1 1 1 1

(1 -9) (1 -g) (1 -g) (1 -*) g g g s

Probabilities Qi Qi Pi Pi Qi Qi Pi Pi

Observed Frequencies 202 539 97 352 64 539 57

Qi Pi Qi

P2 Qi Pi Qi Pi


can be multiplied over items and persons to obtain the joint likelihood of the data, L, as follows:

Since neither s nor g vary over items or persons, well-known theorems on the binomial distribution (e.g., Andersen, 1980) can be applied to give the maximum likelihood estimators as their relative frequencies, as follows: (6a)

s =


£ j

i \

( k

( J


H 3






Taking the partial derivative of L with respect to bik, yields the following estimation equation for items: exp(6ifc dlnL/dblk =


The original maximum likelihood estimation algorithm for MLTM and GLTM (Embretson, 1984), as described above, is feasible if subtasks can be used to identify the components. However, the algorithm is limited if: (1) subtask data are impractical or (2) metacomponent abilities are involved in task performance. Metacomponents are not readily represented by subtasks, because they involve guiding the problem solving process across items to assure that effective strategies are applied. The population parameter s does represent metacomponent functioning, but it is constant over persons and items. Recently, Maris (1992) has developed a program for estimating latent item responses which is based on the Dempster et al. (1977) EM algorithm. The EM algorithm is appropriate for all kinds of missing data problems. In the Maris (1992) implementation of the EM algorithm for conjunctive models, the probabilities for components in MLTM, P(Uijk = 1 | Ojk,t>ik), are treated as missing data. Thus, component parameters can be estimated directly from the total task, without subtask data. Maris (1992) has operationalized both maximum likelihood and modal a posteriori (Bayesian) estimators for conjunctive models, such as MLTM, in his program COLORA (Maris, 1993). Although only a partial proof for finite parameter estimates in conjunctive models is available, simulation data do support the convergence of the estimates. The current version of COLORA, however, is somewhat limited in handling large data sets, since person parameters must be estimated jointly with item parameters. A planned MML estimation procedure will greatly expand data handling capacity. It should be noted that the Maris (1992) program for conjunctive models, if applied without subtask data, estimates components that are empiricallydriven to provide fit. The components do not necessarily correspond to any specified theoretical component unless appropriate restrictions are placed on the solution. Although Maris (1992) recommends using subtask data to place the appropriate restrictions, another method is to place certain restrictions on the parameters, in accordance with prior theory. This will be shown below in the Examples section.

Goodness of Fit Setting the derivatives to zero gives the I estimation equations for item difficulties of the K components. Note that the estimation equations for each component involves only data from the corresponding subtask. Thus, the joint maximum likelihood estimates may be obtained independently from the subtasks. If a CML procedure is applied, the conditions for the existence and uniqueness of the estimates are well established (Fischer, 1983- this volume). Similarly, the person component parameters can be estimated, given the item component difficulties, by maximum likelihood. Program MULTICOMP (Embretson, 1984) may be used to provide the estimates for both MLTM and GLTM.

Goodness of fit for items in MLTM or GLTM may be tested by either a Pearsonian or likelihood ratio chi-square. The 2K+1 response patterns for each item, including the total task and K components, can be treated as multinomial responses. To obtain a sufficient number of observations in each group, persons must be grouped by ability estimates, jointly on the basis of K component abilities. The number of intervals for each component depends on the ability distributions and the sample size. In general, for the goodness-of-fit statistic to be distributed as x2, n° fewer than five observations should be expected for any group. The Pearsonian goodness-


18. Multicomponent Response Models

Susan E. Embretson

of-fit statistic for items can be given as follows: igp




where P{U_igp) is the probability of the response pattern from Eq. (5), rigp is the number of observations on pattern p in score group g for item i, Nig is the number of observations in score group g on item i. Alternatively, the likelihood ratio goodness-of-fit test can be given as follows: G

l =


E E rigp(In{rigp/NigP(JLigp))). 9



the null model. The numerator compares the information in the proposed model with the null model. Alternatively, the fit index in Eq. (11) may also be formulated as a ratio of G2N the goodness-of-fit statistics, from various pairs of models, as follows: A2 = (G 2 / s - G2m/S)/Gl/S.


Embretson (1983a) finds that A2 is very close in magnitude to the squared product moment correlation of the Rasch model item difficulties with the estimated value for item difficulty from the complexity factor model [obtained by evaluating the logistic function for items in Eq. (4)].


Bock (1975) notes that small sample simulations show that G\ is more powerful than G%. If G is the number of groups and P is the number of item categories, the degrees of freedom for Eqs. (8) or (9) can be given the number of cells minus the number of constraints from the expected values. Following Bock and Mislevy (1990) in assuming no constraints for score groups, the degrees of freedom are G(P — 1). These values can be summed over items for a global test of fit. In the case of GLTM, alternative models of component item difficulty (from the stimulus complexity factors vikm) can be compared by a likelihood ratio x 2 ! based on the estimation algorithm, if the models are hierarchially nested. If Lx and L2 are the likelihoods of two models, such that L\ is nested within L2, a chi-square goodness-of-fit test can be given as follows: G2v = - 2 ( l n L 1 - l n L 2 ) , (10) wit which can be assumed to be asymptotically distributed as x h degrees of freedom equal to the difference in the number of parameters between the two models. Since likelihood ratio comparison tests, like Eq. (10), are sensitive to sample size, a fit index is useful to guide interpretations. Fit indices have often been proposed in the context of structural equation modeling (e.g., Bentler and Bonnett, 1980). Embretson (1983a) developed a fit index for LLTM and GLTM that utilized comparisons to a null model (in which all items are postulated to be equally difficult) and a saturated model (in which separate difficulties are estimated for each item). Thus, within a component, the null model, Lo, would be specified as a constant value for item difficulties, while the saturated model, Ls, would be the Rasch model. The model to be evaluated, Lm, contains at least one less parameter than the saturated model. The fit index, A2, is given as follows: A2 = (lnLo - lnL m )/(lnLo -


Roughly, the denominator of Eq. (11) indicates the amount of information that could be modeled, since the saturated model is compared witn

Examples Component Processes in Verbal Analogies Historically, verbal analogies have been often considered the best item type for measuring "g" (Spearman, 1927). More recently, they have been studied intensively by cognitive component analysis (e.g., Sternberg, 1977). Understanding the processing components that are involved in solving verbal analogies contributes to construct representation aspect of construct validity [see Embretson (1983b)]. Whitely and Schneider (1980) present several studies on the components of verbal analogy items. In these studies, both the rule construction and the response evaluation subtask are used to identify components. Additionally, however, subtasks are also used to identify alternative processing strategies. For example, the success of an associative strategy is measured by presenting the third term (Fire: ) and the response alternatives. The person attempts to find the alternative that has the highest associative relationship. In the analogy above, this strategy would not be very successful since two distractors (Matches, Heat) have strong associative relationships to Fire. Furthermore, a working backwards strategy has been measured with a subtask for rule construction after the person has viewed the response alternatives. MLTM and its variants, such as Eq. (5) can be compared by examining their fit, using the chi-square tests described above, plots of observed and predicted values, and the fit of the subtasks to the Rasch model (implicit in MLTM and GLTM). In one study, Whitely and Schneider (1980) fit the original MLTM (Whitely, 1980), which is Eq. (3) without the s or g parameters. They found that fit was substantially increased by adding a guessing parameter to the model, while adding the working backwards strategy resulted in only some improvement in fit. However, this study preceded the development of the full MLTM and GLTM models that contain both the s and g parameters.


Susan E. Embretson

Table 1 shows the probabilities for the various response patterns in a reanalysis of the Whitely and Schneider (1980) data, which contained 35 items and 104 subjects. Response patterns that are inconsistent in the original Whitely (1980) model are quite frequent. That is, the item is often solved when the components are failed, and the item is sometimes failed when the components are solved. Thus, both g and s, to represent guessing and strategy application, respectively, are appropriate for the data. MLTM, from Eq. (3) was fit to the data. The s parameter was found to be 0.84, while the g parameter was 0.39. The latter is higher than the expected 0.25 from random guessing in the multiple choice items. However, the two components were relatively independent sources of item difficulty as indicated by a moderately low correlation of 0.44. Model fit was examined by a Pearson chi-square, as described above. Since the sample size was small, the data were collapsed into two response patterns, on the basis of the total item outcome. Subjects were categorized into four groups, by splitting jointly on the two components. Gp was computed for each item and then summed over items. The observed value of Gp was 200.64, which with 140 degrees of freedom is significant at the 0.05 level. Although the model failed to fit, useful predictions were made by the model. The total item probabilities correlated 0.70 with the MLTM predictions. Furthermore, the person's accuracy on the total item correlated 0.73 with MLTM predictions. It should be noted that five items failed to fit the model. By eliminating these items, and reanalyzing the data, Gp was 107.81, which did not exceed X120 at the 0-05 level. Thus, the poor overall fit was due only to a few items. GLTM was estimated since stimulus complexity factor scores were available for 18 of the 35 analogy items. The available scores represent aspects of rule construction. Embretson and Schneider (1989) find substantial evidence for a contextualized inference process, in which inferences are modified by the third analogy term, but not by the response alternatives. Table 21 shows the GLTM parameter estimates for the stimulus complexity factor estimates for the rule construction component. It can be seen that difficult items are more likely to involve new inferences on the additional of the full stem context (i.e., the third term). Easy items, in contrast, are more likely to have many inferences available initially and to involve the selection of an inference from those initially available. The loglikelihood of the GLTM model, with these stimulus complexity factors, was -1065.91. The loglikelihoods of the null model (constant values for the first component only) and the saturated model were -1375.99 and -1003.13, respectively. Thus, compared to the null model, the model with five stimulus complexity factors fit significantly better (x§ = 620.16, p < 0.001), but did not fit as well 1

Note to Table 2: ** means p < 0.01.

18. Multicomponent Response Models


TABLE 2. Stimulus Complexity Factor Weights for Rule Construction Component. Variable Number of initial inferences Probability correct inference in stem context New inferences in stem context Select inferences in stem context Modify inferences in stem context


-0.871'"* P< .01 -1.298*" P < .01

0.273 0.376

4.708"' P < • 01 -6.435* * P< .01 - 2 .069

1.084 1.785 1.369

as the saturated Rasch model (x?2 = 62.78, p < 0.001). However, the fit index A2 of 0.83 was relatively high, showing the usefulness of the stimulus complexity factor model in explaining rule construction difficulty. The various results on model fitting indicate that the rule-oriented strategy accounts fairly well for solving verbal analogies, thus supporting the task as measuring inductive reasoning. However, some items involve different strategies, such as association (Embretson 1985). Eliminating such items will improve the construct representation of the test. The component person parameters from verbal analogies have also been used in several studies on nomothetic span (e.g., Embretson et al., 1986). These will not be illustrated here because the next empirical application focuses on nomothetic span.

Working Memory versus General Control Processes Another empirical application (Embretson, 1995b) uses the Maris (1992) EM algorithm to distinguish between two competing explanations of fluid intelligence. Many theories (Brown, 1978; Sternberg, 1985) suggest that general control processes, or metacomponents, are the most important source of individual differences. These control processes are used by the examinee to select, maintain and evaluate a strategy for item solving. Alternatively, other theories emphasize the role of working memory capacity in inference processes. For example, both Carpenter et al. and Kyllonen and Christal (1990) suggest that working memory capacity is the primary basis of individual differences in fluid intelligence. Working memory can be distinguished from general control processing on a measuring task for general intelligence by GLTM if some constraints are set to identify the hypothesized components. A bank of progressive matrices items, which are well established measures of fluid intelligence, had been specially constructed to represent the Carpenter et al. (1990) theory. Thus, the memory load of each item, according to the theory, was known. Several forms of the Abstract Reasoning Test were administered to 577 Air Force recruits. Additionally, subtest scores from the Armed Services Vocational


Susan E. Embretson 1

18. Multicomponent Response Models Total Task Probabilities

Total Task Probabilities















FIGURE 2. Scatterplot of total task probabilities by predicted values from MLTM model for items. Aptitude Battery (ASVAB) were available to examine the impact of the two components on nomothetic span. To distinguish general control progressing from working memory capacity, a model with two latent (i.e., covert) response variables was postulated. The first latent response variable, Uiji, is strategy application. This variable reflects the implementation of the rule strategy in task processing. The second latent response variable, Uij2, is the success of the inference strategy. The following model may be postulated for the total task, UijTP(UijT = 1) = P{Uin)P{Uij2).



MLTM Probabilities

MLTM Probabilities


FIGURE 3. Scatterplot of total task probabilities by predicted values from MLTM model for persons.

Component 2 Item Difficulty 8


To identify working memory capacity in Eq. (13), indices of memory load are required for each item. No item information is required for general control processes since items are assumed to have a constant value. Thus, the full model, with the person and item parameters that govern the latent response variables is the following: - ai) exp(0,-i -

exp(6>j2 - (32Vj2 - ai) ) exp(0 ,-2 (14) where fyi is the ability for person j on general control processes, ax is an intercept which expresses the (constant) difficulty of items on strategy application, Qj2 is working memory capacity for person 'j, vi2 is the score for item i on memory load, p2 is the weight of memory load in rule inference difficulty, and a2 is an intercept for rule inference difficulty. The parameters for the model in Eq. (14) were estimated with the COLORA program (Maris, 1993). A Pearsonian chi-square test indicated that






Component 1 Item Difficulty FIGURE 4. Scatterplot of two component item difficulties.



18. Multicomponent Response Models

Susan E. Embretson •.24'



FIGURE 5. Structural equation model of working memory and general control processes and reference tests. the data fit the model (p > 0.05). Furthermore, the average prediction residuals were small (0.06), for predicting total task probabilities from the model. In summary, the MLTM model with latent response variables separated the contribution of two postulated components to abstract reasoning. The results suggested that although both components had a role in construct validity, general control processes are a more important source of individual differences than working memory capacity.

Discussion MLTM and GLTM have been applied in construct validation studies on many item types, including verbal analogies and classifications (Embretson, 1985; Embretson et al., 1986), mathematical problems (Bertrand et al., 1993), progressive matrices (Embretson, 1993), vocabulary items (Janssen et al., 1991) and spatial tasks (McCollam and Embretson, 1993). MLTM and GLTM have special advantages in contemporary construct validation studies, as illustrated in the two examples above, because they provide a complete set of parameters to represent both person and task differences in processing. The cognitive processing paradigm in aptitude


research permits two aspects of construct validity, construct representation and nomothetic span (Embretson, 1983b), to be separated. Construct representation research concerns what is measured by identifying the underlying processes, knowledge structures and strategies that are involved in task performance. Mathematical modeling is a primary technique in such research. Nomothetic span research concerns the utility of tests, through correlations with other measures. Another application is test design to measure specified cognitive components. If only one component is to be measured, the item difficulties on the other components must be well below the level of the target population. Ideally, the other component probabilities should approach 1.0. This can be accomplished in two ways. First, items with low fr^'s on the other components can be eliminated during item selection. Second, and perhaps more efficient, the stimulus complexity factors in GLTM can be used to guide item writing. That is, items can be constructed to vary on those features that predict the bik on the target component but not on the other component. Thus, inappropriate items can be readily eliminated before item tryout. Furthermore, MLTM and GLTM are useful in studies on individual differences. For example, sex differences are often found in mathematical problem solving. Bertrand et al. (1993) measured two underlying components of mathematical problem solving so as to pinpoint the source of sex differences, if any. Although no sex differences were observed on the computational component (involving strategic and algorithmic knowledge to solve systems of equations), boys did show higher abilities on the mathematization component (involving factual, linguistic and schematic knowledge to set up the equations). Bertrand et al. (1993) also examined the impact of lifestyle differences between students (number of hours spent watching television, reading, and so on) on the two components and on the total task. It should be noted that research applications of MLTM and GLTM are especially active now, due to advances in the algorithm and the development of new computerized tests. Thus, continued activity can be expected both in aspects of the estimation algorithm and in applications. Note to References: Susan E. Embretson has also published as Susan E. Whitely.

References Andersen, E.B. (1980). Discrete Statistical Models with Social Science Applications. Amsterdam: North Holland. Bentler, P.M. and Bonnett, D.G. (1980). Significance tests and goodness of fit in the analysis of covariance structures. Psychological Bulletin 88,


Susan E. Embretson

588-606. Bertrand, R., Dupuis, F.A., and Garneau, M. (1993). Effets des caracteristiques des items sur le role des composantes impliquees dans la performance enresolution de problemes mathematiques ecrits: une etude de validite de construit. Rapport Final Presente au Fonds pour la Formation de Chercheurs et I'Aide a la Recherche. Universite Laval, Santa-Foy, Quebec, Canada: Mars. Bock, R.D. (1975). Multivariate Statistical Methods in Behavioral Research. New York, NY: McGraw-Hill. Brown, A.L. (1978). Knowing when, where, and how to remember. In R. Glaser (Ed), Advances in Instructional Psychology (Vol. 1). Hillsdale, NJ: Erlbaum. Carpenter, P.A., Just, M.A., and Shell, P. (1990). What one intelligence test measures: A theoretical account of processing in the Raven's Progressive Matrices Test. Psychological Review, 97. Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood estimating with incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1-38. Embretson, S.E. (1983a). An incremental fix index for the linear logistic latent trait model. Paper presented at the Annual Meeting of the Psychometric Society, Los Angeles, CA. Embretson, S.E. (1983b). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin 93, 179-197. Embretson, S.E. (1984). A general multicomponent latent trait model for response processes. Psychometrika 49, 175-186. Embretson, S.E. (1985). Multicomponent latent trait models for test design. In S. Embretson (Ed), Test Design: Developments in Psychology and Psychometrics (pp. 195-218). New York, NY: Academic Press. Embretson, S.E. (1994). Applications to cognitive design systems to test development. In C. Reynolds (Ed), Advances in Cognitive Assessment: An Interdisciplinary Perspective (pp. 107-135). Plenum. Embretson, S.E. (1995a). A measurement model for linking individual learning to processes and knowledge: Application and mathematical reasoning. Journal of Educational Measurement 32, 277-294. Embretson, S.E. (1995b). Working memory capacity versus general control processes in abstract reasoning. Intelligence 20, 169-189. Embretson, S.E. and Schneider, L.M. (1989). Cognitive models of analogical reasoning for psychometric tasks. Learning and Individual Differences 1, 155-178. Embretson, S.E., Schneider, L.M., and Roth, D.L. (1986). Multiple processing strategies and the construct validity of verbal reasoning tests. Journal of Educational Measurement 23, 13-32.

18. Multicomponent Response Models


Fischer, G.H. (1983). Logistic latent trait models with linear constraints. Psychometrika 48, 3-26. Janssen, R., Hoskens, M., and DeBoeck, P. (1991). A test of Embretson's multicomponent model on vocabulary items. In R. Steyer and K. Widaman (Eds), Psychometric Methodology (pp. 187-190). Stuttgart, Germany: Springer-Verlag. Kyllonen, P. and Christal, R. (1990). Reasoning ability is (little more than) working memory capacity? Intelligence 14, 389-434. Maris, E.M. (1992). Psychometric models for psychological processes and structures. Invited paper presented at the European Meeting of the Psychometric Society. Barcelona, Spain: Pompeau Fabre University. Mayer, R., Larkin, J., and Kadane, P. (1984). A cognitive analysis of mathematical problem solving. In R. Sternberg (Ed), Advances in the Psychology of Human Intelligence, Vol. 2 (pp. 231-273). Hillsdale, NJ: Erlbaum Publishers. McCollam, K. and Embretson, S.E. (1993). Components or strategies? The basis of individual differences in spatial processing. Unpublished paper. Lawrence, Kansas: University of Kansas. Mislevy, R.J. and Bock, R.D. (1990). BILOG 3: Item Analysis and Test Scoring with Binary Logistic Models. In: Scientific Software. Mulholland, T., Pellegrino, J.W., and Glaser, R. (1980). Components of geometric analogy solution. Cognitive Psychology 12, 252-284. Pellegrino, J.W., Mumaw, R., and Shute, V. (1985). Analyses of spatial aptitude and expertise. In S. Embretson (Ed), Test Design: Developments in Psychology and Psychometrics (pp. 45-76). New York, NY: Academic Press. Spearman, C. (1927). The Abilities of Man. New York: Macmillan. Sternberg, R.J. (1977). Intelligence, Information Processing and Analogical Reasoning: The Componential Analysis of Human Abilities. Hillsdale, NJ: Erlbaum. Whitely, S.E. (1980). Multicomponent latent trait models for ability tests. Psychometrika 45, 479-494. Whitely, S.E. (1981). Measuring aptitude processes with multicomponent latent trait models. Journal of Educational Measurement 18, 67-84. Whitely, S.E. and Schneider, L.M. (1980). Process Outcome Models for Verbal Aptitude. (Technical Report NIE-80-1). Lawrence, Kansas: University of Kansas.

19 Multidimensional Linear Logistic Models for Change Gerhard H. Fischer and Elisabeth Seliger1 Introduction The chapter presents a family of multidimensional logistic models for change, which are based on the Rasch model (RM) and on the linear logistic test model (LLTM; see Fischer, this volume), but unlike these models do not require unidimensionality of the items. As will be seen, to abandon the unidimensionality requirement becomes possible under the assumption that the same items are presented to the testees on two or more occasions. This relaxation of the usual unidimensionality axiom of IRT is of great advantage especially in typical research problems of educational, applied, or clinical psychology, where items or symptoms often are heterogeneous. [See Stout (1987, 1990) for a quite different approach to weakening the strict unidimensionality assumption.] Consider, for example, the problem of monitoring cognitive growth in children: A set of items appropriate for assessing intellectual development will necessarily contain items that address a number of different intelligence factors. If we knew what factors there are, and which of the items measure what factor, we might construct several unidimensional scales. This is unrealistic, however, because the factor structures in males and females, above- and below-average children, etc., generally differ, so that there is little hope of arriving at sufficiently unidimensional scales applicable to all children. Therefore, a model of change that makes no assumption about the latent dimensionality of the items is a very valuable tool for applied research. The family of models to be treated in this chapter comprises models for two or more time points and for both dichotomous and polytomous items. The common element in all of them is that repeated measurement designs

This research was supported in part by the Fonds zur Forderung der Wis^nschaftlichen Forschung, Vienna, under Grant No. P19118-HIS. I'


G.H. Fischer and E. Seliger

19. Multidimensional Linear Logistic Models for Change

are posited and that testees are characterized by parameter vectors 8j = (6\j,... ,6nj) rather than by scalar person parameters 6y, the components of Gj are associated with the items I\,... ,In, and thus a separate latent dimension is assigned to each and every item in the test. It should be stressed from the beginning, however, that no assumptions are made about the mutual dependence or independence of these latent dimensions, which means that they can be conceived, for instance, as functions of a smaller number of ability factors, or even of a single factor. The models are therefore very flexible and applicable to various areas of research. Fischer (this volume) shows how the (unidimensional dichotomous) LLTM can be used for measuring change: The same item 7, presented on different occasions is considered as a set of "virtual items" I*t, t = 1,2,... ,z, characterized by "virtual item parameters" f3*t. These j3*t are then decomposed, e.g., as

where ft is the difficulty parameter of item 7, in a Rasch model, qju is the dosage of treatment Bi given to persons Sj up to time point Tt (with qin =0 for all Sj and all £?/), and r\i is the effect of one unit of treatment Bj. (The latter can be educational treatments, trainings, instruction, clinical therapies, psychotherapies, or any experimental conditions that influence the response behavior of the testees.) The probabilistic item response model, as in any LLTM, is logit P(+\SJ,Ii,Tt)

= 6j-


that is, the difference between the person parameter and the (virtual) item parameter is the argument in a logistic IRF. We shall now relax the unidimensionality of the ^-dimension by assuming that each person Sj may have n different abilities Ojj, i = 1,... , ro, corresponding to the items I\,... ,In of the test. For time point Ti, qjn = 0 implies (3*x = ft, so that the /3*j equal the difficulty parameters of the items. Inserting 6ij instead of 6j in Eq. (2), we get for Ti that logit P{+ | Sj,Ii, Xi) = % -



which obviously is overparameterized: The item parameters ftcan immediately be absorbed into the Oij, giving new person parameters dij = Oij —ft; for simplicity, however, we shall drop the tildes and write S],Ii,Tl) = 6i],


which is equivalent to setting ft = 0 for all items It for normalization. Note that this is admissible because there are n such normalization conditions for the n independent latent scales. Similarly, for time point T2 we get logit

Sj,h,T2) = On -



The model defined by Eqs. (4) and (5) has been denoted the "linear logistic model with relaxed assumptions" [LLRA; Fischer (1977b)], which expresses that the assumption of the unidimensionality of the items has been abandoned.

Presentation of the Models The LLRA for Two Time Points Equations (4) and (5) are equivalent with «•>





where qjt is used for simplicity as an abbreviation of qjt2 if only two time points, Ti and T2, are considered. This is the form in which the LLRA usually appeared in the literature (Fischer, 1976, 1977a, 1977b, 1983a, 1983b, 1995b; Fischer and Formann, 1982). For convenience, we summarize the notation as follows: 0^ is person Sj's position on latent dimension A measured by item Ii at time point Xi, Sj is the amount of change in Sj between Ti and T2, qji is the dosage of treatment Bj given to Sj between Tx and T2, r]i is the effect of one unit of treatment Bt, I = 1,..., m. Equation (8) is often specified more explicitly as S

i = I

+ Kk



where (in addition to the "main effects" T)e) Pik are interaction effects between treatments, Bt x Bk, and T is a "trend effect," which comprises all causes of change that are unrelated to the treatments (e.g., general maturation of children and/or effects of repeated use of the same items).


19. Multidimensional Linear Logistic Models for Change

G.H. Fischer and E. Seliger

Note that, upon appropriate redefinition of the qji, Eq. (9) always can be written in the simpler form of Eq. (8), because Eq. (9) is linear in all effect parameters r]i, pik, and r. Equation (8) is therefore the more general decomposition of the change parameters 6j, and it is more convenient for mathematical manipulations. The LLRA for measuring change might on first sight seem rather arbitrary, because no justification for the choice of the logistic IRF and for the additivity of the effect parameters was given. Both can be derived, however, from the postulate that "specifically objective" comparisons of treatments via their effect parameters r/i be feasible, where specifically objective is synonymous with "generalizable over subjects with arbitrary parameters 6j." [Some technical assumptions are also required, though, that cannot be discussed here. The model derivable from these assumptions essentially is of the form of Eqs. (6) and (7), however, the argument of the logistic IRF in Eqs. (6) and (7) is multiplied by an unspecified (discrimination) parameter a > 0, which here has been equated to a = 1.] Hence, if a researcher is inclined to consider the person parameters 6j as incidental nuisance parameters and desires to free his results about the treatments of interest of uncontrollable influences of these 0j, then the LLRA is the model of choice. For a formal treatment of this justification of the LLRA, the interested reader must be referred to Fischer (1987a), see also Fischer (1995b).

Environmental Effects The LLRA describes changes of reaction probabilities between two occasions, T\ and T2, and hence makes use of pairs of related observations, u^i and Uij2- It is obvious that this model can also be employed, for instance, for describing environmental effects on the probabilities of solving items, based on the test data of pairs of twins reared apart (Fischer, 1987b, 1993; Fischer and Formann, 1981). The LLRA therefore is an interesting alternative to the common heritability approach to the nature-nurture discussion. Let dij be a parameter describing, for one pair of monozygotic twins (denoted Mj), the genotypic part of that ability that is required for solving item ij, and let m be a generalized effect of enviroment £7 (or of an environmental factor Ei) on all latent abilities measured by a set of items, together representing "intelligence." Then the LLRA for measuring environmental effects on intelligence assumes that T


\ -



where P(+ \ Mj,IuT2l) is the probability that twin Tjx of pair Mj solves item It, given that Tjx has received the dosages qjii,... ,qjmi of environments (or environmental factors) Ei, and Hij is the argument of the


exponential function in Eq. (10). Similarly,


exp [9ij


"'" -> = TTi5iR exp


is the probability that twin Tj2 of pair Mj solves item h, where Tj2 has received dosages qjX2,..., qjm2 of environments (or environmental factors) Ei. Comparing Eqs. (10) and (11) with (6) and (7), it is seen that, if the model is interpreted as an LLRA, in a purely formal sense the parameters Hij = Oij + ]T; qjiirji play the role of "person parameters," and the differences qji2 - qjn that of the treatment "dosages." [Note that it is completely irrelevant which of the two twins of Mj is denoted Tji and which Tj2, because interchanging them automatically changes the sign in all differences qji2 — qjiii I = 1) • • •, rn, and also adapts the definition of Hij accordingly. Alternatively, the model can be understood as an LBTL for paired comparisons of different combinations of environmental conditions, see Fischer and Tanzer (1994).] That the differences qji2-qjn are now interpreted as dosages in an LLRA has an important consequence: If the sums YliiQjU — qjii) are constant across pairs of twins, fdr instance, if each twin has grown up in just one environment, so that YIM3I2 ~ qjii) = 0 for all Mj, the rji are no longer measured on a ratio scale, but only on an interval scale [see Fischer (1993)]. Therefore, one environment (or environmental condition) Ea has to be taken as the reference point for the measurement of environmental effects, e -g-; Va = 0, and all other r\i are calibrated with respect to r\a. This is a logical, even a necessary, consequence, because both genotypic ability and environmental effects on ability can never be determined relative to an absolute zero point. We cannot observe the genetic component of an ability without an environmental influence, and no environmental effect without the presence of a genetic component. Hence, the measurement of environmental effects must be carried out relative to some standard environment. The present LLRA for environmental effects has many more interesting properties which, however, cannot be further discussed here. The interested reader is referred to Fischer (1993, 1995b). We only mention that the uniqueness conditions for the CML equations of this model imply that even small samples (as are common for data of twins reared apart) suffice for at least a rough estimate of the effects of interest and, especially, for testing hypotheses about such effects. These tests are essentially those described below in the section on hypothesis testing.


19. Multidimensional Linear Logistic Models for Change

G.H. Fischer and E. Seliger

Parameter Estimation

The LLRA for Several Time Points The advantageous properties of the LLRA (see the sections on estimation and hypothesis testing below) encouraged to generalize this model to designs with any number of time points (Fischer, 1989). Assume that the same items are presented repeatedly to the persons at time points T i , . . . ,Tt, • • . ,T Z , while treatments are given in the time intervals; and let the respective item score matrices be denoted by \Jt, with elements Uijt. Assuming that the item set is heterogeneous (multidimensional), as in the LLRA for two time points, individual differences are described again in terms of an n-vector of person parameters 0j = (6\j, • • •, 0nj) per person Sj. Change of the persons is modeled via Lhe item parameters, that is, to item It there corresponds a sequence of "virtual items," I*t, with item parameters depending on the treatments given to the person up to time point Tt. Assuming that the persons are grouped by same treatment combinations and same dosages, the virtual item parameters may be written (3*gt, for i = 1,..., n items, g = 1 , . . . , w treatment groups Gg, and t = 1 , . . . , z time points. Now let these item parameters be decomposed, for instance, as -(Tt-Ti)T,

logit P(+ | Sj, h, Tt)


equivalently, we may write the probability of a positive response as

1 +exp{9ij -

LLRA for Two Time Points It is an easy matter to show that the LLRA for two time points is a special case of an LLTM with a structurally incomplete design and rij = 2 items per person. [Alternatively, the LLRA can be seen as a "Linear BradleyTerry-Luce" model, LBTL, for the comparison of virtual items, see Fischer and Tanzer (1994)]. The CML estimation equations [see Fischer, this volume, Eq. (7)] can therefore be used for estimating the effect parameters. Nevertheless, it pays to look at the conditional likelihood function of the LLRA separately, because the present special case of only two time points leads to a considerable simplification of the estimation equations. Let the item score matrices for time points 7\ and T2 be Ui = ((ujji)) and U2 = {(Uij2)), respectively. The likelihood function, conditional on all marginal sums uij+ = utji + uij2, is 6j)

= P(U 1 ,U 2 |U 1 +U 2 =


where the ft are item difficulty parameters, qgu is the dosage of treatment Bi given to persons of group Gg up to time point Tt, r\i is the effect of one unit of treatment Bu and r is the trend. Inserting the 6ij and (3*gt in the logistic IRF, absorbing the ft in the ^-parameters as before, denoting the new person parameters again Oij for simplicity, and writing ]T ; qgitr]i f° r the sum of effects for short, we obtain as generalization of Eq. (5)

P(+\Sj,Ii,Tt) =



for i = 1 , . . . , n items, j = 1 , . . . , TV persons, and t = \,...,z time points. Formally, Eq. (14) can be interpreted as an LLTM, where there are Nn "virtual persons" S*j with person parameters 0^, and w(z - 1) + 1 "virtual items" I*t with parameters (3*t = ^eQgitVe- (Remember that qgn = 0 for all g and I, so that (3*gl = 0 for all groups Gg.)

(15) where 6j is subject to Eq. (8). This is easily recognized as the likelihood function of a logit model (Cox, 1970) for a subset of the data; the term (uiji — Uij2)2 acts as a filter eliminating all pairs of responses with Uij\ = Uij2, which are uninformative with respect to measuring change (measuring treatment effects). Taking logarithms of the likelihood in Eq. (15) and differentiating with respect to rji yields the CML estimation equations, Uij2 ~


= 0,


for I = l , . . . , m . Equations (16) are unconditional estimation equations of a logit model (for a subset of the data as mentioned above); they are very easy to solve—much easier than those of the more general LLTM, see Fischer, this volume—using the well-known Newton-Raphson method. The latter requires the second-order partial derivatives, - uij2)

12 '


for /, k = 1,..., m. These derivatives are the negative elements of the information matrix I. Equations (15) through (17) were given in different, but equivalent, form in some earlier publications (Fischer, 1976, 1977b, 1983b). The estimation 'Eqs. (16) illustrate the central property of the LLRA: They contain only


G.H. Fischer and E. Seliger

19. Multidimensional Linear Logistic Models for Change

the parameters 6j (weighted sums of the effect parameters 77/), but are independent of the person parameters %, which have been eliminated via the conditions u^i + utj2 = Ujj+. Hence, whatever the endowment of the subjects in terms of person parameters Gj = {6\j,... ,0nj)—for instance, whatever the intellectual abilities of the children or the proneness of the patients to particular patterns of symptoms—the 77; will be estimated consistently under mild regularity conditions (cf. Fischer, 1983, 1995b). Two questions regarding the CML approach in the LLRA are of practical interest: (i) What are the minimum conditions for Eqs. (16) to have a unique solution 77? (ii) What are the asymptotic properties of 77 under regularity conditions? Both questions are easily and satisfactorily answered by the following results: (i) There exists a unique solution 77 = (7)1,..., fjm) of the CML Eqs. (16) under the restrictions in Eq. (8) if (a) the design comprises at least m treatment groups Gg, g = 1, ..., 7n, with linearly independent dosage vectors, q9 = (qgi,..., ggm), and (b) for each of these treatment groups Gg, there exists at least one item Ii and one person Sj 6 Gg, such that Uyi = 1 and Mj.,2 = 0, and one item Ih and one person Sk G Gg, such that Uhki — 0 and Uhk2 = 1This result is a special case of a more general, but much more complicated necessary and sufficient condition for logit models (Fischer, 1983a; Haberman, 1974; Schumacher, 1980). The present sufficient condition is indeed easily satisfied: The linear independence (a) of the dosage vectors is an obvious requirement analogous to a well-known necessary condition in regression analysis and experimental design; and (b) means that, within each treatment group, there is at least one change from a positive to a negative response to some item, and one change from a negative to a positive response to some item (which may be the same or a different item). Hence, we may expect that the CML equations will possess a unique solution even in most small-sample applications. (ii) If the design comprises at least m treatment groups Gg, g = 1,. • •, 771, with linearly independent dosage vectors q5, and the group sizes 00, such that the sequence of person parameters % within Ng each group satisfies the condition (18)


then f) is asymptotically multivariate normal around 7j with asymptotic covariance matrix I" 1 (where I is the information matrix). The formulation of the condition in Eq. (18) is due to Hillgruber (1990; referenced in Pfanzagl, 1994) and prevents that, within some treatment group Gg, almost all \8Zj\ become infinitely large, which would render the data uninformative. Condition (ii) again is a requirement that is well approximated in most realistic applications. Although we cannot go into the details of deeper uniqueness results here, one important aspect has to be mentioned: The above statements (i) might mislead the reader to think that the effect parameters 77; are measured on an absolute scale. Such a conclusion is not justified because, as was mentioned above, the derivation of the LLRA from a set of meaningful and testable assumptions leads to a more general logistic model with an unspecified discrimination parameter a, so that the scale for the 6-parameters is a ratio scale. If, as is often the case in applications, qji G {0,1}, that is, subjects receive a treatment, undergo a training, take a course, etc., or do not do so, then it can be concluded that the respective 77; are also measured on a ratio scale; in such cases statements like "treatment Bi is 1.5 times as effective as treatment Bk" do make sense. If, however, the qji are dosage measures like "psychotherapy sessions per month," "milligrams of a medicine," or the like, the quotients 17; : rjk are not invariant under changes of the dosage units and can be interpreted only in relation to the latter.

LLRA for Several Time Points The task of estimating the model in Eq. (14) via CML is not as formidable as it may appear at first sight, because each virtual person S*j responds only to z of the w(z— 1) +1 virtual items I*t, so that the person parameters 9ij can be eliminated only by conditioning on the raw scores ^t Uijt = Uij+. The resulting CML equations are basically the same as those of the LLTM [Fischer (1989); see also Fischer, this volume, CML Eq. (15)]; adapted to the present notation, they are n


z lt

i = l g=l t = l

"* *9'r

Sigt \

= 0,



where qgit is defined as before; moreover, Sigt = X)s eG uijt is the total number of correct responses to (real) item Ii given by persons Sj G Gg at time point Tt, riigr is the number of persons Sj £ Gg who respond correctly to (real) item Ii at exactly r time points,


19. Multidimensional Linear Logistic Models for Change

G.H. Fischer and E. Seliger


= 7 r (e JS i,. - -, £igz) is the elementary symmetric function of order r of the (virtual) item parameters eigi,... ,eigz within group Gg, corresponding to (real) item /,, and


= 7 r _i(ei g i,...,e ig ,t-i,ei 9 ,t+i.---.eiflz) is t h e elementary symmetric function of order r - 1 of the same (virtual) item parameters, excluding, however, £igt.

with df — m — mo = r, see Fischer (1995b). Some typical hypotheses testable in this way are listed in the following [for the sake of concretization, the notation of Eq. (9) is used here again]:



The summation over r in Eqs. (19) runs from 1 to z for the following reasons: Cases with r = 0 can be ignored, because 7>^0 = 0 for all i, g, t; cases with r = z are statistically uninformative but need not be removed from the data, because they add 1 to each sigt, for £ = 1,..., z, and symmetrically, eign$r-iftigr = 1 for all t. The numerical methods developed for CML estimation (see Fischer, this volume) can be used here again, indeed it suffices to rearrange the data accordingly. The basic idea is that the z responses of one (real) person Sj to one (real) item U, and those of the same person to another (real) item ifc, are considered as responses of different virtual persons S*j and S^, with parameters Oij and 6kj, to the same virtual items Jt*, t = l,...,z, respectively, such that all person parameters 0^ are projected on the same continuum. This means that LLTM programs allowing for incomplete data can be used for estimating the present model. [For further details, see Fischer (1989, 1995a).] The generalizability of the present model to several heterogeneous subsets of unidimensional items and to designs where different items are administered at different time points, is obvious, but cannot be discussed within the scope of this chapter. Such extensions lead to nontrivial uniqueness problems, which have been solved satisfactorily at least for typical cases (cf. Fischer, 1989).

Goodness of Fit Testing Hypotheses in the LLRA The objective of LLRA applications is not just to estimate treatment effects, but also to test hypotheses about these effects. Conditional likelihood ratio tests turn out to be very general and convenient tools for that purpose. Let an LLRA for two time points be denned by Eqs. (6), (7), and (8), comprising m effect parameters; this model which is believed to be true will be referred to as hypothesis Hx. Most null-hypotheses relevant in practical research can be formulated as linear constraints imposed on the effect parameters in Eq. (8), fia(vi, • • •, Vm) = 0, a = 1,... ,r. Let Ho be a nullhypothesis so defined, comprising m0 = m - r independent parameters,

-21nA = - 2 In

L(Ui,U2|U1 L(Ui,U 2


(i) No change; Ho: rji = T = p = 0 for all treatments £?;. (ii) No trend effect; HQ:T = 0. (This test with df = 1 is very powerful, as it combines all the information in the data regarding a single parameter.) (iii) No treatment-interaction effects; Ho: ptk = 0 for all ordered pairs (BhBk). (iv) Ineffective treatment(s); H0:r)i =0 for some or all B;. (v) Equality of treatment effects; Ho: r\\ — rjk = r\h = • • • for certain treatments £?;, Bk, Bh, (vi) Generalizability of effects over items (or symptoms); HQ:T]I — rfi ' = rfl = ... for some or all Bi, where A, B, C,... denote subsets of items or symptoms. (Often, these subsets are the single items or items forming subscales of the test.) (vii) Generalizability of effects over person groups; H0:r]i

= VL


rji = ... for some or all Bi, where U, V, W,... denote subgroups of persons. (If person groups are defined by levels of some variable Z, this test amounts to testing for interactions between treatments and Z.) The conditional likelihood tests are very convenient because the likelihoods needed are part of the CML estimation, and the test statistic is so simple that it can immediately be calculated by hand. However, there exist also other test statistics that are especially practical for testing the significance of single effect parameters: The inverse of the information matrix I" 1 , which is needed in the Newton-Raphson method anyway, yields the asymptotic variance-covariance matrix of the estimators and can be used for assessing the significance of any parameter of interest. Many concrete examples of applications from education, applied, and clinical psychology are mentioned in Fischer (1991). One example will be given below.


19. Multidimensional Linear Logistic Models for Change

G.H. Fischer and E. Seliger

Example To illustrate how the LLRA can be applied to designs with more than two time points, a small subset of data is selected from a study on the effects of a certain communication training (Koschier, 1993). For the present example, the responses of Nx — 40 persons of a Treatment Group (TG; participants of a one-week communication training seminar) and N2 = 34 persons of a Control Group (CG; participants of other one-week seminars on topics unrelated to communication training) to n = 21 questionnaire items from the domains "Social Anxiety" (8 items), "Cooperation" (= democratic vs. authoritarian style; 5 items), and "Nervosity" (8 items), are reanalyzed. The items were presented to the resDondents at three time points, namely, before (Ti), immediately after (T2), and one month after the seminar (T3). The response format of all items consisted in a 4-categorical rating scale with categories "fully agree" (response 3), "rather agree" (response 2), "rather disagree" (respone 1), "fully disagree" (response 0). The complete set of data is given in Table 1. In order to apply the LLRA, the graded responses have to be dichotomized (categories 3 and 2 vs. 1 and 0); this entails some loss of information, but, as will be seen later, the main results about the effects of interest are not greatly affected. There is no justification to consider the items as unidimensional, as they stem from three different domains. It should be mentioned at this point that the IRT models for change would allow one to consider the items from each domain as a separate unidimensional scale, where differences between items are explained by means of item difficulty parameters, and to combine the three scales in one analysis (cf. Fischer, 1989). General experience with questionnaire items, however, indicates that even items from the same domain, like "Social Anxiety," often do not satisfy strict hom*ogeneity/unidimensionality requirements; therefore it is prudent to treat the total item set as multidimensional. As has been stressed already, the LLRA does not make any assumptions about relations between the different latent dimensions and hence is not at variance with a possible unidimensionality of the subscales. While our model thus is quite complex and very flexible regarding interindividual differences, we make rather simple assumptions about the treatment effects: The concrete questions of the study of Koschier (1993) were whether there are specific effects of the communication training that differ from general effects of other seminars and/or of a repeated presentation of the questionnaire. Such effects, if they exist, could be specific for each and every item, or generalizable over items within domains, or generalizable over all items. These are thus the hypotheses to be formalized and tested by means of the model. Each hypothetical effect will be represented by one parameter: a communication training effect between Tj and T2, by rji; a "sleeper" effect of the training producing a change of response behavior between T2 and T3, by r?2; and a "trend" (effect of the other seminar


Table 1: Responses of JVj = 40 Testees of the Treatment Group to Items 7i through I2i, at 3 Time Points, Taken from the Study of Koschier (1993). Treatment Group b.

1 9*

3 4 5 6 7 8* 9*

10 11* 12* 13 14* 15 16 17 18* 19 20 21 22* 23 24* 25 26 27 28 29 30* 31 32 33* 34* 35* 36 37 38* 39 40

Anxiety h -It 3 2 3 2 1 3 3 3

2 12 3 12 3 10 10 2 3 2 12 1 1 0 2 3 10 3 13 10 1 1 1 1 2 2 0 12 0 2 2 3 0 0 0 1 1 2 3 112 111

33333033 23222222 12 2 12 1 1 3 2 3 11110 1 3 3332322 3 3223333 33223332 3 3 2 12 1 1 2 33333023 2 110 221 2222 122 2 2 2 2 3 12 2 32211322 2 3 2 1 2 2 12 3 3 11 2 2 2 1 3 3 3 13 2 2 2 3 3 12 1 1 2 2 33 112 2 22 2 3 13 3 1 2 1 2 0 1 2 10 11 2 3 0 1 2 0 11 33 112232 22 32 112 2 3 3 1 1 3 13 2 112 1 1 1 1 2 22222222 3 2 112 112 3 3 2 13 2 2 2 2 3 10 2 2 3 2 2 2 12 2 12 1 3 3 2 12 12 3 3 3 12 3 0 0 2

Time Point 1 Cooperation h - hi 2 2 2 11 11113 11111 3322 1 2 3 112 22 113 2 3 113 22112 2321 3 2 2 11 2 3 3 111 0 3 113 31101 22220 11220 12 1 1 3 3301 21112 21111 2 2 111 12 12 2 12 1 1 2 12 1 1 2 33223 33002 33222 2 12 11 21101 3 3 113 0 12 3 0 22 113 22221 2 2 112 22121 2 2 11 1 2 2 11 1 01211 21 22 1 2 112 2 3 3 10 2

Time Point 2 JNervosity hi - hi 2 1 1 2 2 12 1 11100003 1 1 1 1 2 12 2 2 12 0 12 2 1 1 1 1 0 10 12 0 10 10 0 0 2 0 0 11110 3 1112 1112 3 0 13 2 3 0 1 01232111 1 0 2 1 0-0 2 2 0 0 12 10 0 3 2 0 3 12 12 1 32232231 112 1113 1 22121002 1 1 1 2 10 0 1 12202112 1 1 1 1 2 12 2 11112122 11232 112 1 1 1 0 12 12 10 1 1 1 1 1 2 1 1 1 0 2 10 1 11132112 1112 1112 12 2 12 12 3 3 2 10 2 12 3 10 10 0 0 0 3 3 2 13 2 3 3 3 1 1 2 2 2 10 3 1 1 2 12 12 2 1 1 1 2 10 13 1 1 1 1 2 12 1 11121122 11112 2 2 2 3 13 1 1 3 2 3 21112321 12 10 1 1 1 3 10 0 0 0 0 12

Anxiety h -It

10 0 10 12 2 3 3 10 10 11 3 2 2 12 2 2 2 322232 12 110 11111 2 2 1 2 2 0 11 10 1 1 1 1 1 1 3 3 112 111 3 3 3 13 0 2 2 2222212 2 13 1 1 3 10 3 12 2 2 1 1 1 2 33333323 32222230 3 3 2 13 2 2 2 31212111 33033033 20212222 22222 112 22222122 2 2 111 3 1 2 3 2 12 2 1 1 1 2 2 112 2 1 1 3 2 3 13 2 2 2 3 3 13 3 0 12 3 3 1 1 2 2 11 2 3 12 3 0 2 3 13 1 1 1 0 2 1 12 0 10 0 11 3 3 1 1 2 13 1 2 2 111112 3 3 1 1 2 12 2 3 2 111112 22222222 112 3 2 12 3 3 2 1 123 2 2 12 12 12 2 3 111110 3 3 2 12 12 2 33222002

Note: Responses 3 denote "fully agree", responses 2 "rather agree", responses 1 "rather disagree", and responses 0 "fully disagree"; asterisks denote female sex. Continued on following page


19. Multidimensional Linear Logistic Models for Change

G.H. Fischer and E. Seliger

Table 1 (continued)




Table 1 (continued): Responses of 7V2 = 34 Testees of the Treatment Group to Items h through I2\, at 3 Time Points, Taken from the Study of Koschier (1993).

Treatment Group Time Point 3

Time Point 2 ' Nervositv



Anxiety h 78

Cooperation 7 3 • / i :3 '

Control Group




1 1 1 2 2 0 9 9 3 i 10 2 1 1 2 3 2 1 2 2 2 1 2 3 2 1 2 2 1 1 2 3 1 0 0 3 1 1 3 1 0 0 0 22 2 2 1 2 12 2 1 22 1 13 3 3 0 1 2 2 1 1 1 2 2 2 1 1 2 2 2 1 2 1 2 2 1 3 1 1 2 1 1 2 2 1 1 1 2 3 2 2 2 2 3 3 0 0 2 3 0 2 1 2 1 1 1 1 1 2 2 1 0 2 3 3 0 0 3 0 1 2 2 1 2 3 1 1 2 2 2 1 2 1 2 2 1 1 9 2 2 1 2 2 2 2 1 1 2 1 2 1 1 1 0 0 2 2 1 2 2 2 2 1 2 1 1 2 2 3 2 1 0 1

1 3 2 2 2 2 2 2 3 3 3

2 1 1 2 2 2 2 1 1 1 1 0 0 0 0 3 2 1 2 1 2 1 12 9 1 2 1 2 2 1 1 ] 1 1 0 0 1 3 3 1 1 0 1 1 1 12 1 0 1 0 1 1 12 1 1 1 1 1 1 19 3 0 2 3 1 2 1 2 1 1 2 2 2 2 1 1 2 1 1 3 0 0 3 3 0 0 1 1 1 0 0 3 2 2 3 1 2 1 3 1 2 2 2 3 3 2 2 2 1 1 1 1 1 12 1 2 2 2 1 1 1 0 3 2 1 1 2 1 1 1 1 1 2 2 0 2 1 12 1 1 1 2 2 1 1 2 2 1 2 1 2 1 2 2 1 1 1 1 1 0 2 1 1 1 1 1 1 1 19 1 1 1 1 1 1 1 2 1 1 1 0 2 1 10 1 0 1 3 2 1 12 1 1 1 0 1 0 13 1 2 1 1 2 0 2 0 2 1 1 0 2 2 13 1 0 0 0 0 1 0 3 2 2 1 2 2 2 2 2 2 1 1 1 2 1 12 1 1 2 1 2 1 2 2 1 1 1 2 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 2 2 2 2 2 2 2 1 1 2 9 2 2 1 1 12 2 9 1 1 1 1 1 2 1 13 1 1 1 1 1 1 2 1

1 0 0 1 0 1 2 2 3 3 0 0 0 0 0 1 2 2 2 2 2 1 2 2 3 2 2 2 2 2 12 1 0 0 1 1 0 0 2 2 2 0 1 1 0 12 2 2 1 1 2 1 12 3 3 2 1 3 2 2 3 3 3 3 3 3 0 3 3 2 2 2 2 1 2 2 1 2 2 2 1 2 2 2 3 2 2 0 1 1 1 2 2 2 1 2 3 2 3 3 3 3 0 2 2 2 2 22 3 2 1 2 2 2 12 2 3 1 1 1 0 0 1 2 1 3 3 3 0 1 1 3 1 2 1 2 1 12 3 3 2 2 2 1 2 2 2 2 3 2 3 1 2 2 3 2 2 1 2 3 2 2 2 0 2 2 2 1 1 1 2 2 1 1 2 2 1 1 3 2 3 1 3 3 2 2 3 3 1 3 3 1 22 3 3 2 1 1 1 12 2 3 1 2 2 1 23 1 2 0 1 1 1 2 1 2 2 0 0 1 0 12 3 3 1 12 1 3 3 1 2 0 0 1 0 1 1 3 3 1 2 2 1 3 3 2 2 1 1 2 1 2 2 2 2 2 2 2 12 2 22 1 12 112 3 3 3 1 3 2 3 3 2 3 0 1 1 2 2 2 3 3 1 0 1 0 1 1 3 3 3 1 2 12 2 2 3 1 1 2 10 2

2 1 1 2 0 3 1 1 1 1 2 2 1 0 3 1 1 3 1 1 2 3 1 3 2 1 3 1 1 2 1 0 1 1 1 2 1 0 1 1 1 1 1 1 2 2 1 23 1 13 33 0 12 22 1 12 22 1 12 22 12 1 2 1 12 1 1 2 1 1 2 22 1 12 32 2 12 33 10 3 32 2 12 212 2 1 22 1 12 3 3 1 1 3 0 12 2 1 23 1 12 23 1 11 22 12 2 12 12 1 22 1 12 22 12 2 13 2 2 1 222 2 3 2112 3 3 3 1 0 2

21 0 1 11 32 22 22 22 22 22 22 33 33 33 22

2 1 1 1 2 1 2 2 0 10 0 0 0 0 3 1 12 2 2 1 2 1 2 12 0 1 1 1 1 0 1 1 0 1 0 0 3 1 10 0 0 0 0 2 1 1 1 1 1 1 1 2 1 0 0 0 2 0 0 2 3 0 1 3 2 1 0 0 1 12 2 1 2 1 1 1 1 1 3 1 1 2 3 0 1 1 1 0 0 1 3 1 0 3 0 1 1 2 1 2 2 2 3 2 2 2 2 1 1 1 1 2 1 2 1 1 12 1 0 0 0 2 1 0, 1 1 1 0 1 1 2 12 1 1 2 1 2 2 1 1 1 1 1 1 2 2 2 2 0 1 1 2 2 1 12 1 1 1 2 1 1 1 1 1 1 1 1 2 1 0 1 1 1 1 1 2 1 2 1 1 2 1 1 1 1 0 0 3 1 0 0 2 1 1 1 1 1 1 1 3 1 2 2 1 2 0 2 0 2 12 0 2 2 1 2 1 0 0 1 1 0 0 3 2 1 1 2 1 2 2 2 2 1 1 1 1 0 1 3 1 0 2 2 2 0 2 3 1 1 1 2 2 0 1 3 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 2 2 1 1 2 2 2 1 2 2 2 1 2 1 1 2 2 3 1 1 0 2 2 0 1 1 1 1 0 1 0 0 3 1 0 0 3 0 1 1 2

Time Point 2

Time Point 1 \nxiety


2 12 1 1 11 1 13



1* 2 3* 4* 5 6* 7 8* 9* 10 11 12 13 14 15 16 17 18 19 20 21* 22 23 24* 25 26 27 28* 29 30 31* 32 33* 34*


3 3 2 2 3 1 3 1 2 2 1 2 0 1 2 2 2 3 2 2 2 2 1 2 3 3 2 2 2 2 2 2 1 2 10 1 0 1 3 1 1 2 1 0 0 2 3 0 0 0 1 2 0 3 2 3 3 1 1 3 2 1 3 3 2 1 1 2 2 1 2 3 2 1 1 1 0 2 3 3 3 3 0 3 0 2 2 2 1 1 1 2 1 0 2 1 1 1 1 1 2 1 1 3 3 2 0 1 0 2 2 3 0 1 1 10 1 2 2 3 1 1 1 1 1 2 2 2 1 1 2 0 1 2 1 1 2 1 1 0 1 2 3 3 2 1 2 10 3 2 3 0 0 1 0 0 3 2 3 0 0 3 1 2 3 3 3 2 2 2 3 1 2 2 2 2 1 2 1 2 2 1 2 1 2 1 1 2 2 3 3 0 1 3 0 2 3 2 2 1 1 2 2 1 2 1 1 1 1 1 1 1 1 2 1 3 3 3 0 2 3 3 2 3 1 3 1 3 1 3 3 2 1 2 1 1 2 3 3 2 2 3 2 3 2 3 3 2 1 2 1 1 2 2 3 0 2 2 2 2 3 3 3 3 3 3 0 1 2

Cooperation 7, h:i 1 2 2 1 2 1 1 1 2 1 ] 1 2 2 2 2 2 1 2 1 2 3 2 1 2 1 2 1 1 1 3 3 2 1 2 3 1 2 0 0 1 3 2 1 1 3 3 1 0 3 2 2 1 1 2 1 2 1 2 2 1 2 1 12 2 3 2 2 2 3 3 1 1 3 3 2 0 0 3 r* 3 1 0 2 2 2 1 1 2 2 1 2 2 2 2 3 1 0 3 2 1 2 2 1 1 1 3 10 2 2 1 1 1 0 0 1 2 1 2 3 0 0 2 2 1 2 2 2 1 1 1 2 1 2 2 1 10 1 1 2 2 2 1 3 2 3 1 2 2 1 12 3 3 1 12 1 0 2 2 1 2 3 1 0 2

INervosit\ / 14


1 3 1 0 1 0 1 3 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 0 2 1 1 1 1 1 9 9 1 1 1 2 0 0 1 2 2 1 1 2 2 1 2 2 13 2 2 0 0 1 2 1 1 1 2 2 2 3 2 1 0 2 1 1 1 2 2 1 1 1 1 1 0 0 3 2 2 2 3 2 1 1 3 1 1 1 2 1 1 1 2 2 1 0 2 1 2 1 1 1 0 2 0 9 1 1 3 2 1 0 3 1 1 0 3 1 1 1 0 1 0 0 3 1 0 1 1 1 1 ] 3 2 1 1 1 1 1 1 3 2 1 1 1 1 1 1 2 2 1 2 3 0 2 0 2 2 0 2 0 2 0 2 3 3 2 2 2 2 2 3 2 2 1 2 1 1 1 2 1 2 1 1 1 2 1 2 2 1 1 3 0 2 2 1 3 22 1 12 2 12 1 1 1 2 1 1 2 2 3 3 3 3 2 2 3 1 2 0 0 2 2 1 1 1 2 1 3 1 2 1 2 3 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 2 2 2 0 1 1 0 1 0 1 1

Anxiety h 78 3 2 2 1 2 1 2 2 2 3 0 1 1 0 2 2 9 2 2 2 2 2 2 2 9 9 9 9 2 2 2 2 0 1 0 0 1 0 1 0 3 2 1 1 2 1 1 1 1 3 2 2 2 0 2 2 3 3 3 2 3 2 1 3 3 2 1 1 2 1 2 2 2 2 10 2 0 1 2 3 3 3 0 3 2 2 3 1 1 1 1 2 2 1 2 1 2 1 1 1 2 1 1 3 3 2 0 2 0 3 2 2 1 1 0 1 0 0 1 2 1 1 1 1 1 1 1 2 2 1 0 1 0 1 2 1 3 1 0 1 0 2 1 2 2 2 1 2 1 1 1 2 2 2 0 1 0 0 2 2 3 3 0 3 2 2 2 2 2 1 1 2 2 2 1 3 3 0 2 0 1 2 2 2 2 1 1 2 1 2 2 3 3 1 2 2 1 2 3 2 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 3 2 3 3 3 0 3 2 3 2 3 1 3 2 2 2 2 3 2 1 1 1 2 2 3 3 2 2 3 2 3 2 3 3 2 1 2 1 2 2 22 12 222 2 3 3 2 3 3 0 2 2

Continued on following page


19. Multidimensional Linear Logistic Models for Change

G.H. Fischer and E. Seliger

Table 1 (continued)

Control Group

Time Point 2' Uooperation



h - /13

hi - hi

h -Is

112 2 3 112 3 3 11 221 22111 33 112 12 2 2 1 23 111 2 110 1 12 112 3 3 11 3 12 1 1 1 12 12 2 2 2 2 12 13 13 1 22113 33002 23112 2 12 2 1 112 2 2 13 10 3 2 12 2 1 12 2 2 2 22112 112 2 2 23112 112 2 2 2 12 2 1 11110 13 2 2 3 22222 12 1 1 2 23112 11331 22113

1 3 0 0 0 10 3 210 1110 1 222 2 2 2 2 2 111 1112 2 0 1001113 211 2 112 2 2 1 2 2 10 2 3 2 1 0 2 13 2 2 1122 1112 101 11002 22 1 3 3 12 2 111 2 1112 2102 1112 11202023 2 1 0 3 0 10 2 1 1 1 1 0 0 11 101 11113 0 1 1 12 1 2 3 111 11112 2 11 30 102 111 02122 231 2 1 1 1 2 212 121 11 2 2 2 12 2 11 1 1 1 0 2 0 13 221 11111 1 11 2 2 12 2 32 232332 1 0 0 2 10 0 1 21322 112 111 11112 111 11111 1 2 2 12 2 2 2 0 0 1 0 10 0 3

TABLE 2. Matrices QTG and QCG of Weights for 771, 772, and r, for 3 Time Points Time Point 3

3 3 2 2 2 2 12 2 2 1 12 2 2 2 2 2 2 2 2 12 2 2 2 2 2 2 12 2 2 3033 003 3 2 2 12 1 1 2 3 1 302 032 3 3 2 2 3 10 3 3 2 222 222 2 2 2 2 2 12 2 3 3 233 322 1 3 1 2 2 2 11 1 1 1 1 1 2 11 3 3 202 022 2 2 1 110 2 2 2 2 1 12 1 1 1 2 2 1 2101 2 111 11111 2 3 1 12 11 1 3 3 03 3 112 3 3 3 0 3 12 3 3 2 223 232 3 3 2 2 2 12 2 3 3 2 2 3 12 2 3 3 122 222 2 2 2 12 2 12 2 2 1 11111 3 11 3 3 3 1 3 2 3 2 323 222 2 3 1 12 12 2 2 3 222 232 3 3 2 12 1 1 2 2 3 1 110 3 3 3 2 3 13 0 2 2


Cooperation h - 7i3

1 1 2 12 112 2 1 1112 1 2 12 12 3 3 12 3 2 1112 22 112 3 1113 22 111 2 2 12 2 2 1111 12 2 2 2 12 2 12 112 2 1 22112 33002 3 3 10 2 2 2 12 2 112 2 2 22 113 2 2 2 12 112 2 0 2 2 10 2 02222 3 2112 1 112 2 1 12 2 2 12 1 1 0 112 2 1 22221 12 1 1 2 33 113 10 3 2 1 2 2 10 2

Wervosity hi ~ hi

13 10 2 0 12 2 2 11 2 12 1 22212222 2111 1112 0 0 0 00 0 0 3 2 2 12 1 1 1 2 3 12 1 1 1 1 3 1 1 1 2 2 10 1 2 12 2 1 1 2 1 1112 2112 3 3 11 2 2 2 1 21121112 2111 1111 1 1 2 0 2 12 3 2103 0 110 0 12 0 10 12 1111 0 113 1111 1112 2111 1112 1 1 1 1 0 10 3 2 111 2 111 2 22 22232 2 12 1 1 1 1 1 2 2 11 2 1 1 0 1110 1112 2 2 11 12 12 1111 1112 32333331 00002 0 2 1 2 2 2 1 2 12 2 1111 1111 0 0 11 1 1 0 2 12 2 1 2 2 2 2 10 10 1 1 1 3


Time Point 7\ T2 T3








0 1 1

0 0 1

0 1 2

0 0 0

0 0 0

0 1 2

plus the effect of repeatedly responding to the questions), by r. As a basis of all tests, a "saturated" model is required that is (trivially) true. In our case, a saturated model would have to comprise individual effects for all persons and would therefore hardly be manageable. We therefore use a "quasi-saturated" model that assumes generalizability of effects over individuals, but provides parameters for differential effects with regard to each item. Estimating this model is easily done by fitting one separate model to each item /;, with three effect parameters, 77^, r]i2, n. The matrices QTG and QCG of weights qgn, qgn, and qgiz, for Sj e TG and Sj € CG, respectively, are given in Table 2. This model will be denoted Model 1. We refrain from giving the 21 x 3 = 63 parameter estimates; they obviously have very little statistical precision (each triplet of estimates being based on a single item), and they will not be needed later. The log-likelihood at the point of the CML solution of Model 1 is In Li = -491.84, which will be used as a yardstick for the assessment of other hypotheses. First we test the hypothesis of no change, setting all effects to zero, Wn = V12 — n = 0 for all U. This null-hypothesis is denoted Model 0. The log-likelihood results as lnL 0 = -537.22, and the likelihood ratio test statistic is -21nA = -

=-2(491.84-537.22) =90.76,

with df = 21x3 = 63, which is significant (x%5 = 82.53). Hence, Model 0 is rejected, and it is concluded that some change does occur in the responses of the testees. In a next step, we test the hypothesis that effects generalize over items within domains, assuming effects 77,41, TJA2, TA for all Social Anxiety items, Vci, VC2, TC for all Cooperation items, and 77^1, VN2, TN for all Nervosity items. This amounts to applying the LLRA with the matrices QTG and QCG separately to the items from each domain, denoted Model 2. The estimates so obtained are given in Table 3,2 the log-likelihood is lnZ-2 = 2 Note to Table 3: Positive parameters reflect an increase, negative ones a decrease in the respective behavior; asterisks denote one-sided significance at a = 0.05.


G.H. Fischer and E. Seliger

19. Multidimensional Linear Logistic Models for Change

TABLE 3. Results of the Estimation of Model 2 (Dichotomous LLRA). • Domains Social Anxiety Cooperation Nervosity

Parameter Estimates & Standard Deviations -1.26 0.28 VAl VAl -0.18 0.28 TA 0.50 0.13 0.39 0.40 Vci 0.43 0.41 VC2 0.00 0.15 re 0.29 0.32 0.29 -0.08 VN2 0.13 TN -0.35

Significance p = 0.00* p = 0.25 p = 0.00* p = 0.16 p = 0.15 p = 0.50 p = 0.13 p = 0.39 p = 0.00*

-516.82. To test Model 2 against Model 1, we consider -21nA = - 2 ( l n L 2 - l n £ i ) = -2(491.84-516.82) =49.96, with df = 21 x 3 — 3 x 3 = 54, which clearly is nonsignificant (x295 = 72.15). Hence it is concluded that the effects generalize over items within domains. Now we test the hypothesis that effects generalize over all items of all domains. This hypothesis provides only three effect parameters, 771, V2, T, and is denoted Model 3; since the sign of the effects on Cooperation are opposite to those on Social Anxiety and Nervosity, we have the restrictions VCI = " I / I , VAl = VNI = VU VC2 = -»72, VA2 = VN2 = V2, TC = -T,

T~A = T~N = T. The log-likelihood results as Ini3 = —531.07, and the test statistic as -21nA = - 2 ( l n L 3 - l n L i ) = -2(491.84-531.07) = 78.46, with df = 21 x 3 - 3 = 60, which is nonsignificant (x295 = 79.08), if only by a narrow margin. Hence, Model 3 can be retained. The parameter estimates under this model are rh = -0.47, ,7(171) =0.18, j)2 = -0.16, .7(772) = 0.18, t = 0.05, &{f) = 0.08. The only significant effect is that of the communication seminar at time point T2, which has the expected direction. However, we still remain somewhat skeptical about Model 3, because it just failed to depart significantly from Model 1, so that probably it would have to be rejected if the samples had been somewhat larger. We therefore view Model 2 more closely, see the parameter estimates in Table 3. At least two of the three significant effects are readily interpretable: As expected, the communication training immediately reduces Social Anxiety (time point T2), but there is no after-


(or "sleeper"-)effect (time point T3). For both groups, there is a trend for Nervosity to decrease over time (significant rjv). Somewhat unexpectedly, there is a similar trend for Social Anxiety to increase over time (significant TA)\ this might be due to a growth of consciousness of social relations in both groups. For the sake of completeness, Model 2 is finally tested for fit by splitting the sample of testees by gender and re-estimating the model separately in each group; this yields \nLr(f)2J> = -182.67 (for females) and lnL 2 m; = —330.11 (for males). The likelihood ratio statistic for testing whether the effect parameters generalize over the two gender groups is 2 In A = -2(lnL 2 - lnL m ) ) = -2(-516.82 + 182.67 + 330.11) = 8.08, with df = 2 x 9 — 9 = 9, which clearly is nonsignificant (x%5 = 16.92). We conclude that the effects generalize over the two gender-groups and take this as an indication of fit of Model 2. A final word about the asymptotics of the conditional likelihood ratio statistics in this example seems to be indicated: The asymptotics rest on condition (ii) in the section on parameter estimation for the LLRA, that is, each of the two treatment groups should be "large." In our case, what counts is the number of* "virtual" persons, N*, per group Gg. Since there are n = 21 items, we have N^G = 21 x 40 = 840 for the Treatment Group, and N£G = 21 x 34 = 714 for the Control Group; these numbers should be sufficient for justifying the application of asymptotic results.

Discussion This chapter has shown the usefulness of the dichotomous LLRA for two or more time points for measuring treatment effects and testing hypotheses about such effects. A polytomous extension of this model (for items with more than two response categories and two time points) was suggested already in Fischer (1972) and further developed in Fischer (1974, 1976, 1977a, 1977b, 1983b). This extension is a straightforward generalization of Eqs. (6) and (8), namely, P{h\Sj,Ii,T{) P(h\


Q T T \ Dj, li, ±2) —


exp(0ijh) 6jh) —r + 6jk)

(21) (22) (23)


19. Multidimensional Linear Logistic Models for Change

G.H. Fischer and E. Seliger

where h = 0 , 1 , . . . , d denotes the response categories Ch of the items. The parameterization in Eqs. (21) through (23) was inspired by that of the so-called "polytomous multidimensional Rasch model" (Rasch, 1961, 1965; Fischer, 1974), which assigns a separate latent dimension to each response category Ch- In this model, multidimensionality has a meaning different from that in the LLRA (above), because latent dimensions are assigned to response categories rather than to items. The model in Eqs. (21)-(23), on the other hand, combines both notions of multidimensionality and describes individual differences in terms of a separate vector of trait parameters for each person x item combination. Similarly, the model assigns a separate treatment effect rjih to each treatment Bt per category Ch, and one trend effect Th per category. (Owing to necessary normalization conditions, however, the true dimensionality of these vectors is one less than the number of response categories.) The LLRA extension in Eqs. (21)-(23) was used in just a few applications (e.g., Hammer, 1978; Kropiunigg, 1979a, 1979b), but the treatment effects turned out to be of little stability and interpretability, probably caused by the too generous parameterization of the model. Therefore this line of development was not pursued further. More recent promising ways of dealing with multidimensionality of polytomous items are reinterpretations of the (unidimensional) "linear rating scale model" [LRMS; Fischer and Parzer (1991); see Fischer, this volume] or of the "linear partial credit model" [LPCM; Fischer and Ponocny (1994); see Fischer, this volume]. Again, the unidimensional person parameters 9j of these models can be replaced by vectors Qj = (6ij,... ,6nj) of different parameters for each item, and the responses of one person 5, to the n items of a test can be considered as responses of n "virtual persons" S*j. The multidimensional extension of the LRSM for change then becomes (24)

Efc=o e x P lk (% + Ei QgltVl) with 9ij for the position of person Sj, at time point T\, on the latent dimension Di measured by item 7^, qgit the dosage of treatment to time point Tt,

given to persons Sj of group Gg up

r)i the effect of one unit of treatment Bt, ujh = E/Li Kh for h = 2 , . . . , d, where m, I = 1 , . . . , d, are threshold parameters measuring the difficulty of a transition from category Cj_i to Ci, with UJQ = u>i = Ki = 0 for normalization (cf. Andrich, 1978), and


d the number of response categories Ch (the u>h and d are assumed to be the same in all items). The estimation in this model is technically more complicated than that of the dichotomous LLRA versions and thus cannot be treated within the scope of this chapter. The same goes for multidimensional extensions of the LPCM. [The interested reader finds all details of the algorithms in Fischer and Parzer (1991) and Fischer and Ponocny (1994, 1995), respectively. A computer program for PCs under DOS can be obtained from the authors. It is capable of estimating all the models discussed in this chapter.] Hypothesis testing via conditional likelihood ratio tests, however, is as simple as in the dichotomous LLRA; the procedures are in fact identical to those outlined above in the section on hypothesis testing. For the sake of illustration, we give some selected results from the 4categorical analysis of the data in Table 1. The likelihood of Model 1 (see above) resulted as lnLi = -1407.51, that of Model 2 as lnL 2 = -1511.55. When estimating Model 1 from the 4-categorical data, different attractiveness parameters uiih are assumed for each item; for Model 2, parameters u>h are assumed constant across item domains. The parameter estimates under Model 2 are given in Table 4. Model 2 is tested against Model 1 by means of the statistic -21nA = -2(lnZ, 2 .-lnZa) = -2(-1511.55 + 1407.51) = 208.08, with df = 21 x 5- (3 x 3 + 2) = 94, which is significant (x%5 = 117.63). This comes as a surprise because Model 2 was retained under the dichotomous analysis. The reasons for this discrepancy are that (a) now differences of the attractiveness parameters Wih between items (which seem to be quite considerable) contribute to the significance, and (b) the 4-categorical analysis is based on more statistical information than the dichotomous one. Even if Model 2 has now to be rejected, it is interesting to look at the parameter estimates in Table 4: All effects that had been significant in the dichotomous analysis are again significant, and two of the formerly nonsignificant ones (77.42 j 77c2) have become significant owing to the increase in statistical information. All these five parameter estimates have the same sign as those in Table 2. The two analyses can therefore be considered reasonably concurrent. The estimates of the two parameters o;2 and u>3 show that the transitions from category CQ to C\ is relatively easy compared to a transition from C\ to C2, and that the latter is much easier than that from Ci to C3. Finally, a word of warning is in order: Firstly, it is known that dichotomous and categorical models can fit the same data only under very artificial conditions (Jansen and Roskam, 1986; Roskam and Jansen, 1989). The 4categorical analysis is mentioned here, however, only for demonstration, even if the LRSM does not fit. Second, while the polytomous analysis exploits more information, nontrivial problems arise through the attractiveness parameters w/,: If the items show different u>ih, a response like "rather


G.H. Fischer and E. Seliger

19. Multidimensional Linear Logistic Models for Change

TABLE 4. Results of the Estimation of Model 2 (Under the Polytomous LLRA). Domains Social Anxiety Cooperation Nervosity

Parameter Estimates & Standard Deviations -0.91 0.18 T]Al -0.29 0.18 r\A2 0.33 0.09 TA -0.13 0.22 r)ci 0.50 0.23 VC2 -0.03 0.10 re 0.26 0.17 r\N\ -0.25 0.17 r]N2 -0.22 0.08 TN Ui 0.13 -3.33 0.28 W -9.68 3

Significance (one-sided) p = 0.00* p = 0.05* p = 0.00* p = 0.28 p = 0.01* p = 0.38 p = 0.06 p = 0.07 p = 0.00* p = 0.00* p = 0.0O*

agree" no longer has the same meaning across items—so that there remains hardly any meaningful use of the data, except maybe in very large samples. Therefore, one should not be overly optimistic about the advantages of polytomous response data.

References Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika 43, 561-573. Cox, D.R. (1970). The Analysis of Binary Data. London: Methuen. Fischer, G.H. (1972). A measurement model for the effect of mass-media. Ada Psychologica 36, 207-220. Fischer, G.H. (1974). Einfiihrung in die Theorie psychologischer Tests. [Introduction to Mental Test Theory. In German.] Berne: Huber. Fischer, G.H. (1976). Some probabilistic models for measuring change. In D.N.M. de Gruijter and L.J. Th. van der Kamp (Eds), Advances in Psychological and Educational Measurement (pp. 97-110). New York: Wiley. Fischer, G.H. (1977a). Some probabilistic models for the description of attitudinal and behavioral changes under the influence of mass communication. In W.F. Kempf and B. Repp (Eds), Mathematical Models for Social Psychology (pp. 102-151). Berne: Huber, and New York: Wiley. Fischer, G.H. (1977b). Linear logistic latent trait models: Theory and applications. In H. Spada and W.F. Kempf (Eds), Structural Models of Thinking and Learning (pp. 203-225). Berne: Huber. Fischer, G.H. (1983a). Logistic latent trait models with linear constraints.


Psychometrika 48, 3-26. Fischer, G.H. (1983b). Some latent trait models for measuring change in qualitative observations. In D.J. Weiss (Ed), New Horizons in Testing (pp. 309-329). New York: Academic Press. Fischer, G.H. (1987a). Applying the principles of specific objectivity and generalizability to the measurement of change. Psychometrika 52, 565587. Fischer, G.H. (1987b). Heritabilitat oder Umwelteffekte? Zwei verschiedene Ansatze zur Analyse von Daten getrennt aufgewachsener eineiiger Zwilinge. [Heritability of environmental effects? Two attempts at analyzing data of monozygotic twins reared apart. In German.] In E. Raab and G. Schulter (Eds), Perspektiven Psychologischer Forschung. Vienna: Deuticke. Fischer, G.H. (1989). An IRT-based model for dichotomous longitudinal data. Psychometrika 54, 599-624. Fischer, G.H. (1991). A new methodology for the assessment of treatment effects. Evaluacion Psicologica — Psychological Assessment 7, 117-147. Fischer, G.H. (1993). The measurement of environmental effects. An alternative to the estimation of heritability in twin data. Methodika 7, 20-43. Fischer, G.H. and Formann, A.K. (1981). Zur Schatzung der Erblichkeit quantitativer Merkmale. [Estimating heritability of quantitative trait variables. In German.] Zeitschrift fur Differentielle und Diagnostische Psychologie 3, 189-197. Fischer, G.H. and Formann, A.K. (1982). Veranderungsmessung mittels linear-logistischer Modelle. [Measurement of change using linear logistic models. In German.] Zeitschrift fur Differentielle und Diagnostische Psychologie 3, 75-99. Fischer, G.H. and Parzer, P. (1991). An extension of the rating scale model with an application to the measurement of change. Psychometrika 56, 637-651. Fischer, G.H. and Ponocny, I. (1994). An extension of the partial credit model with an application to the measurement of change. Psychometrika 59, 177-192. Fischer, G.H. and Tanzer, N. (1994). Some LBTL and LLTM relationships. In G.H. Fischer and D. Laming (Eds), Contributions to Mathematical Psychology, Psychometrics, and Methodology (pp. 277-303). New York: Springer-Verlag. Haberman, S.J. (1974). The Analysis of Frequency Data. Chicago: The University of Chicago Press. Hammer, H. (1978). Informationsgewinn und Motivationseffekt einer Tonbildschau und eines verbalen Lehrvortrages. [Information gain and mo-


G.H. Fischer and E. Seliger

tivation by video show and a verbal teacher presentation. In German.] Unpublished doctoral dissertation. University of Vienna, Vienna. Hillgruber, G. (1990). Schdtzung von Parametern in psychologischen Testmodellen. [Parameter estimation in psychological test models. In German.] Unpublished master's thesis. University of Cologne, Cologne. Jansen, P.G.W. and Roskam, E.E. (1986). Latent trait models and dichotomization of graded responses. Psychometrika 51, 69-91. Koschier, A. (1993). Wirksamkeit von Kommunikationstrainings. [The efficacy of communication training seminars. In German.] Unpublshed master's thesis. University of Vienna, Vienna. Kropiunigg, U. (1979a). Wirkungen einer sozialpolitischen Medienkampagne. [Effects of a sociopolitical media campaign. In German.] Unpublished doctoral dissertation, University of Vienna, Vienna. Kropiunigg, U. (1979b). Einstellungswandel durch Massenkommunikation. [Attitude change via mass communication. In German.] Osterreichische Zeitschrift fur Soziologie 4, 67-71. Pfanzagl, J. (1994). On item parameter estimation in certain latent trait models. In G.H. Fischer and D. Laming (Eds), Contributions to Mathematical Psychology, Psychometrics, and Methodology (pp. 249-263). New York: Springer-Verlag. Rasch, G. (1961). On general laws and the meaning of measurement in psychology. Proceedings of the IV Berkeley Symposium on Mathematical Statistics and Probability (Vol. 4). Berkeley: University of California Press. Rasch, G. (1965). Statistisk Seminar. [Statistical Seminar.] University of Copenhagen, Department of Mathematical Statistics, Copenhagen. (Notes taken by J. Stene.) Roskam, E.E. and Jansen, P.G.W. (1989). Conditions for Rasch-dichotomizability of the unidimensional polytomous Rasch model. Psychometrika 54, 317-332. Schumacher, M. (1980). Point estimation in quantal response models. Biometrical Journal 22, 315-334. Stout, W. (1987). A nonparametric approach for assessing latent trait dimensionality. Psychometrika 52, 589-617. Stout, W. (1990). A new item response theory modeling approach with applications to unidimensionality assessment and ability estimation. Psychometrika 55, 293-325.

Part IV. Nonparametric Models

The origins of nonparametric IRT are found in Guttman's (1947, 1950a, 1950b) early papers on scalogram analysis published before any interest existed in parametric IRT. In scalogram analysis, the response functions of test items (or statements in the case of attitude measurements) are modeled to have the shape of a step curve. Such curves assume that the relation between success on an item and the underlying ability is deterministic; that is, they assume that, with probability one, up to a certain unknown threshold on the scale examinees will be unsuccessful (disagree) with the item (statement). At and beyond the threshold, examinees with probability one will be successful on the item (agree with the statement). The technical problem is to locate items (statements) on the scale along with examinees to maximize the fit between the model and the data. Guttman's model is refuted as soon as one response pattern with a wrong response is met on either side of the threshold. If this decision rule is followed strictly, and each item for which a wrong response is found is labeled as "unscalable," it is an empirical fact that hardly any items (statements) can be maintained if a sample of examinees of realistic size is used to analyze a test. Of course, in practice, some tolerance for misfit is allowed. Prom the beginning of scalogram analysis, it was felt that a stochastic model with a continuous response function would be more realistic for many applications, and several attempts were made to formulate such functions in a nonparametric fashion. These attempts culminated in the work by Mokken (1971), who not only gave a nonparametric representation of item response functions in the form of a basic set of formal properties they should satisfy, but also provided the statistical theory needed to test whether these properties would hold in empirical data. Though interest in nonparametric IRT has existed for almost 50 years, in particular in applications where sets of behavior or choices have to be shown to follow a linear order, parametric IRT models have been applied to nonparametric IRT models. Logistic models for response functions were introduced, and these models offered solutions to such practical problems as test equating, item banking, and test bias. In addition, they offered the technology needed to implement computerized adaptive testing. In light of such powerful models and applications, it seemed as if the utility of nonparametric IRT was very limited. In the early 1980s, however, mainly through the pioneering work of Hoi-


Part IV. Nonparametric Models

land (1981), Holland and Rosenbaum (1986), and Rosenbaum (1984, 1987a, 1987b), the topic of nonparametric IRT was re-introduced into psychometric theory. The idea was no longer to provide nonparametric models as an alternative to parametric models but to study the minimum assumptions that have to be met by any response model, albeit nonparametfic or parametric. Interest in the basic assumptions of IRT has a theoretical purpose but serves diagnostic purposes as well. Since for each assumption the observable consequences are clearly specified, aberrancies in response data can be better interpreted. A search for such aberrancies should be a standard procedure to precede the formal statistical goodness-of-fit tests in use in parametric IRT [see, for example, Hambleton (1989)]. In the same vein, nonparametric statistical techniques can be used to describe the probabilities of success on test items as monotonic functions of the underlying ability. As such functions are based on minimum assumptions, they can be assumed to be closer to the "true response functions" than those provided by any parametric model. Therefore, they are useful, for example, as a first check on whether a certain parametric form of response function would be reasonable. This section offers three chapters on nonparametric IRT. The chapter by Mokken summarizes the early results in nonparametric IRT for dichotomous items and relates them to the concepts developed in the recent approaches to nonparametric IRT. Molenaar generalizes the theory of dichotomous items to polytomous items. Ramsay discusses techniques to estimate response functions for dichotomous items based on ordinal assumptions. Plots of these estimated response functions can be used as graphical checks on the behavior of response functions. Additional reading on nonparametric IRT is found in Cliff (1989) who proposed an ordinal test theory based on the study of rank correlations between observed and true test scores. A nonparametric treatment of various concepts of the dimensionality of the abilities underlying a test is given by Junker (1991, 1993) and Stout (1987, 1990).

References Cliff, N. (1989). Ordinal consistency and ordinal true scores. Psychometrika 54, 75-92. Guttman, L. (1947). The Cornell technique for scale and intensity analysis. Educational and Psychological Measurement 7, 247-280. Guttman, L. (1950a). The basis for scalogram analysis. In S.A. Stouffer, L. Guttman, E.A. Suchman, P.F. Lazarsfeld, S.A. Star,and J.A. Clausen (Eds), Measurement and Prediction: Studies in Social Psychology in World War II (Vol. 4). Princeton, NJ: Princeton University Press. Guttman, L. (1950b). Relation of scalogram analysis to other techniques.



In S.A. Stouffer, L. Guttman, E.A. Suchman, P.F. Lazarsfeld, S.A. Star, and J.A. Clausen (Eds), Measurement and Prediction: Studies in Social Psychology in World War II (Vol. 4). Princeton, NJ: Princeton University Press. Hambleton, R.K. (1989). Principles and selected applications of item response theory. In R.L. Linn (Ed), Educational Measurement (pp. 147200). New York: Macmillan. Holland, P.W. (1981). When are item response models consistent with observed data? Psychometrika 46, 79-92. Holland, P.W. and Rosenbaum, PR. (1986). Conditional association and unidimensionality in monotone latent variable models. Annals of Statistics 14, 1523-1543. Junker, B.W. (1991). Essential independence and likelihood-based ability estimation for polytomous items. Psychometrika 56, 255-278. Junker, B.W. (1993). Conditional association, essential independence and monotone unidimensional item response models. Annals of Statistics 21, 1359-1378. Mokken, R.J. (1971). A Theory and Procedure of Scale Analysis, with Applications in Political Research. New York/Berlin: Walter de Gruyter Mouton. Rosenbaum, P.R. (1984). Testing the conditional independence and monotonicity assumptions of item response theory. Psychometrika 49, 425435. Rosenbaum, P.R. (1987a). Probability inequalities for latent scales. British Journal of Mathematical and Statistical Psychology 40, 157-168. Rosenbaum, P.R. (1987b). Comparing item characteristic curves. Psychometrika 52, 217-233. Stout, W.F. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika 52, 589-617. Stout, W.F. (1990). A new item response theory modeling approach with applications to unidimensionality assessment and ability estimation. Psychometrika 55, 293-325.

20 Nonparametric Models for Dichotomous Responses Robert J. Mokken Introduction The development of nonparametric approaches to psychometric and sociometric measurement dates back to the days before the establishment of regular item response theory (IRT). It has its roots in the early manifestations of scalogram analysis (Guttman, 1950), latent structure analysis (Lazarsfeld, 1950), and latent trait theory (Lord, 1953). In many situations, such as attitude scaling, the analysis of voting behavior in legislative bodies, and market research, items and indicators are either difficult or scarce to obtain, or the level of information concerning item quality is not sufficiently high to warrant the use of specified parametric models. Researchers then have to assess attitudes, abilities, and associated item difficulties at the more lenient ordinal level, instead of the interval or ratio representations required by more demanding parametric models. The primary reference to this approach was Mokken (1971). Later sources are Henning (1976), Niemoller and van Schuur (1983), Mokken and Lewis (1982), Sijtsma (1988), and Giampaglia (1990). The probabilistic models in this chapter are nonparametric alternatives to most of the current parametric ability models for responses to dichotomous items. A generalization of the initial model and procedures to polytomous items was developed by Molenaar and will be treated elsewhere (Molenaar, this volume).

Presentation of the Models A set J of N persons {j = 1,..., N} is related to a set / of n items {i = 1,..., n} by some observed response behavior. Each set may be considered as a selection by some sampling procedure from some (sub)population of persons or (more rarely) of items. Each person j responds to each item i in the item set. Responses are supposed to be dichotomous, and, hence, can be scored as (0,1) variables. The responses and scores, represented by


20. Nonparametric Models for Dichotomous Responses

Robert J. Mokken

u,= 0 subject:'

u,= 1 high




item i


FIGURE 1. Deterministic model (Guttman scalogram).

variates Uij with values Uij, respectively, for each person j and item i in sets J and 7,1 are supposed to be related in the same way to an underlying, not observable or latent ability, which "triggers" the positive response. An illustration for the deterministic model is given in Figure 1. For a dichotomous item i, there is only one step separating response "0" from response "1." For the deterministic case, this step marks a boundary on the ability variable 9: Persons with ability Qj to the left of the boundary are not sufficiently high on 9 to pass item i, and score 0, whereas persons with 9i, to the right of the boundary, dominate item i in the required amount of 9, and score 1. In actual applications, scoring items uniformly with respect to the direction of 9 is important. In probabilistic ability models, responses Uij of person j to item i are assumed to be generated by probabilities, depending on the values of persons and items on the ability 6. For the simple deterministic case of Fig. 1, these probabilities can be given as:



e , b, bk e, e FIGURE 2. Two singly monotonic (MH) IRF's. Two related nonparametric models for dichotomous cumulative items can be used. The basic elements of these models are treated next. 2

Monotonic hom*ogeneity Some requirements or assumptions concerning the form of the item probabilities TT(0) imply certain types of hom*ogeneity between the items. A first requirement is the following one of monotonic hom*ogeneity (MH): For all items i £ I, TTJ(0) should be monotonically nondecreasing in 6, that is, for all items i £ I and for all values (9j, 9i) it should hold that A related requirement is the one of similar ordering (SO) of the persons in J by the items in / in terms of their IRF's:

P{UlJ=0;9l} = l.

A set of items or IRF's is similarly ordering a set of persons 0x,92,...,eN if, whenever ^(6*0 < T T * ^ ) < ••• < i^i{9N) for some i e I, then irk(9i) < 7rfc(^2) < • • • < TTfc(^Ar) for all k e I.

More generally the probabilities are given by functions depending on the item and the value 9 of the ability of a given person j: and

The functions TTi(9) are called item response functions (IRF's). For an example, see Fig. 2. Other well-known names are "trace lines" or "trace functions" in latent structure analysis and "item characteristic curves" (ICC) in IRT. The values of TTJ(0) may be considered as local difficulties, measuring the difficulty of item i for a person located at point 9 along the ability continuum. Sometimes an item difficulty parameter b is added, such as in 7Ti(0, bi), to denote the point on 9 where the probability of passing the item is equal to 0.50, that is, where -iTi(bi,bi) = 0.50. 1 According to the conventions for this Handbook, indices correspond to the transposed person-by-items data matrix.

The SO property of a set of items with respect to a set of persons reflects the possibility of a unidimensional representation of the persons in terms of an ability supposed to underlie their response behavior. It can be shown that SO and MH are equivalent properties. MH sets of IRF's correspond to the cumulative or monotonic trace lines familiar in ability analysis (see Fig. 2). The SO property emphasizes their particular appropriateness for unidimensional representation.

Double Monotonicity A set of SO or MH items suffices for the purpose of a joint ordering of persons on an ability. However, it is not sufficient for a uniform ordering of 2

In the sequel, sequel, me the mam main resuius results anu and piuperiies properties win will De be given without derivain uie tions, for which we may refer to Mokken (1971, Chap. 4). -*n Pn-r* TtrV^ l ^ V l T i m m r*% r t * / \ f r t v + C\ l \ A f * l / r LS'OI'l I 1 O 7 1






Robert J. Mokken


20. Nonparametric Models for Dichotomous Responses


(local) item difficulties, iti(0) for persons as they vary in ability 9. Again, this can be seen for the two MH IRF's in Fig. 2. Item k is more difficult than item i, but as their IRF's are intersecting, the local difficulty order of the items is reversed for persons on different sides of the point of intersection. So although MH has the items similarly ordering the persons, it does not warrant persons similarly ordering the items. An additional or stronger requirement is necessary to ensure also similar ordering of items by persons. This is the requirement of double monotonicity (DM) or strong monotonicity hom*ogeneity (Mokken, 1971)3: A MH set of items, /, satisfies the condition of double monotonicity (DM) with respect to an ability 9 (or a set of persons, J) if for all pairs of items (i, k) g / it holds that, if for some 90--Ki(90) < 7rfc(6»0), then for all 9 and j g J: ^(Oj) < 7rfc(0j), where item i is assumed to be the more difficult item. Figure 3 gives an illustration for a set of three DM items. The item difficulty parameters b\, b2, b3, reflect the order of the local item difficulties (with item 1 being most difficult), which is the same for all persons, that is, &i > b2 > b3 <=> for all 9: n(9, h) < ir(9, b2) < TT(0, b3).

Sets of two-parameter logistic items with different discrimination parameter are examples of MH sets (Birnbaum, 1968). Sets of two-parameter logistic items with equal discrimination parameter are examples of DM sets. These are equivalent to models based on the single item parameter logistic IRF associated with the Rasch (1960) model.

Two Types of Independence Assumptions In IRT two types of independence assumptions are basic for the models: 1. Local or conditional independence or independence of responses within persons. This condition assumes that for every single person from J (i.e., for any fixed value of 9) the response to any item from / is independent of the responses to any other item in /. This assumption is basic to most probabilistic theories of measurement, implying that all systematic variation in the responses of persons to items is solely due to the variation of persons over 9. All variation in a single point 9 is random and residual, as the sole source of systematic variation is then kept constant. 3

Rosenbaum (1987b) introduced an equivalent concept where one item is uniformly more difficult than another when its IRF lies on or above the surface of the other one.

FIGURE 3. Double monotony: Classes and class scores.

2. Sampling independence or independence of responses between persons. This condition assumes that the selection of respondents, as well as the procedure of test administration, imply that the responses of any person from set J to the items in set / are not related to responses to / of any other person in J. These two basic assumptions determine the mathematical expression of the joint probability of persons responding to items. Let Uj = {uij,..., unj), Uij = 0,1, denote the response vector of person j (with value 9j) for n items. Obviously, u^ can take 2™ possible values. Assumption 1 implies that for any person j the probability of a response pattern u, is given by the product of the probabilities of the individual item responses in the pattern. Also, Assumption 2 can be used to calculate the probability of a matrix with response data for N persons as the product of the probabilities of the response vectors of the individual persons. Let TTjfc denote the joint probability of a correct response on items i and k. In addition, the simple score tv of a response pattern uv is defined as the number of items correct in that pattern: tv = £\ uiv. This notation is used to discuss the following properties. Simple Monotonicity (MH). An early result (Mokken, 1971) states that for MH sets of items, under the assumption of local independence, all item pairs are nonnegatively correlated for all subgroups of subjects and all subsets of items: MH1: for an MH set of items, 7, it holds for every distribution F{0) that TTifc > TTjTrfc, for all (i, k) 6 /.


20. Nonparametric Models for Dichotomous Responses

Robert J. Mokken

An immediate implication is that the expected value of the Guttman error 7Tjfc(l,0) (i < k; i more difficult than k) is smaller than expected under conditional marginal (or posterior) independence, given the manifest marginals. Rosenbaum (1984) [see also Holland and Rosenbaum (1986)] proved the following more general result from the same set of assumptions: Cov{/(U1),ff(U1)|h(U2)}>0, where u = (ui, u 2 ) denote a partition of response vector u in two subsets Ux and u 2 and / ( u ) and

The order of the manifest probabilities TTJ reflects the uniform difficulty ordering of the items for every (sub)group of persons. This is not always so for just single monotonicity (MH) because then the IRF's of the two items can intersect. For two groups of persons with an ability distribution F{9) on different sides of the intersection point the manifest probabilities (7Tj,7Tfc) will indicate different difficulty orders. Another DM property concerns the joint probabilities of pairs of manifest responses for triples of items (i, k, I). Let 7i"ifc(l, 1) and 7Tjfc(0, 0) denote the probabilities of items i and k both correct and items failed, respectively. Let k < I denote the item ordering for k more difficult than I. Then DM2: for all i € /, and all pairs (k, I) e I,k < I: (1) (2)

7r ifc (l,l) < T T « ( 1 , 1 ) and 7rifc(0,0)>7r«(0,0).


MH2: every pair of items (i, k) e / is CA in every subgroup of persons with a particular subscore t on the remaining set of items;

(Mokken, 1971). The two inequalities can be visualized by means of two matrices.

MH3: for every item i the proportion correct TT, is nondecreasing over increasing score groups t, where score t is taken over the remaining n — 1 items.

Let n = [7Tjfc(l, 1)] of order nxn (i,k = l , . . . , n and i < k) be the symmetric matrix of manifest joint probabilities of scoring both correct, TTjfc. According to the first part of Eq. (1), for any row i of n its row elements will increase with k. By symmetry, the elements of column k will increase with i. Note that the diagonal elements, nu, are based on replicated administrations of the item, and generally cannot be observed directly. Let n.(°) = [7r,jfc(0,0)] of order n x n be the symmetric matrix, with diagonal elements not specified, containing the manifest probabilities of both items failed, then the second relation in Eq. (1) implies a reverse trend in rows and columns: The elements in row i will decrease with k and, by symmetry, the elements in any column will decrease with i. As the two matrices II and n^0^ can be estimated from samples of persons, these results allow for empirically checking for DM. Rosenbaum (1987b) derived related results, considering for item pairs the event that exactly one of the two items is answered correctly. The corresponding probabilities are considered as conditional probabilities given some value of an arbitrary function fo(u2) of the response pattern u2 of the remaining n — 2 items. It is convenient to take the simple score i(u 2 ) or score groups based on t(u2) (Rosenbaum, 1987b).

MH4 (ordering of persons): the simple score t(u;6) has a monotonic nondecreasing regression on ability 9 (Lord and Novick, 1968). Moreover, for all 6, i(u; 9) is more discriminating than scores on single items (Mokken, 1971). The score t(uj) of persons j thus reflects stochastically the ordering of the persons according to that ability. MH5: the simple score i(u; 9) has monotone likelihood ratio, that is,





is nondecreasing in 9 (Grayson, 1988). This result implies that t(u; 9) is an optimal statistic for binary (two-group) classification procedures. Double Monotonicity ( D M ) . Only for sets of items which are DM, the difficulty order of the items is the same for all persons. An immediate consequence is: DM1: for a DM set of items for every pair of items (i, k) the order of their manifest (population) difficulties (TTJ, ?Tfc) is the same irrespective of the distribution F(9) of 9. Hence, if i < k (that is, if item i is more difficult than item k), then TT, < n^ for all groups of persons (Mokken, 1971).

DM3: If item i is more difficult than k (i < k), then P{Ui = \\Ui + Uk = l; t(u2)} < 0.50 or equivalently P{Ui = 1 | t(u 2 )} < P{Uk =


that is, item i will be answered correctly by fewer persons than k in any score group £(u2) on the remaining n — 1 items u 2 .


20. Nonparametric Models for Dichotomous Responses

Robert J. Mokken

(Schriever, 1985).

Once it has .been established that a set of items can be used as a unidimensional (MH or DM) scale or test, the major purpose of its application is to estimate the 8 values of the persons. In parametric models, this can be done by direct estimation of 6 in the model in which it figures as a parameter. As nonparametric models do not contain these parameters, more indirect methods are needed to infer positions or orderings of respondents along the ability continuum. One simple method is to use the simple score t(u). According to MH4, the score is correlated positively with 6, and according to MH5 it is an optimal statistic for various types of classification. Schriever (1985) advocated optimal score methods derived from multiple correspondence analysis (MCA), where the first principal component Y\ of the correlation matrix of the items U = \U\ ,••., Un] optimally fits the ability 6. This first principal component can be written as (3)

where a correct and an incorrect response score are recoded as 1-7T,



respectively, and a\ = [ a n , . . . , a n i], with ajai = 1, is an eigenvector for the largest eigenvalue of the correlation matrix of the items. Even stronger order conditions are valid for a special case of DM.4 Let a rescored correct response be denoted by 7a and a rescored incorrect response by Wi\. Let /„ be a MH set of n items. Finally, let Im be a DM subset of m items (i, k), with i < k, satisfying the following monotone likelihood ratio type of conditions: 7rfc(0)/7r;(#) and (1 - Kk{6))/{1 - •Ki{6)) are nonincreasing in 9. Then, assuming the (i, k) £ Im to be ordered according to difficulty (that is, item 1 is the most difficult item, etc.), the following result holds: DM4: the scores 7a and un reflect the order of the DM items in the set of items: 7n > • • • > 7mi > 0; and 0 > wn > • • • > wmi 4 More precisely, the IRF's of (i, fc) G Im are required to be DM and to demonstrate total positivity of order two (TP2) (Karlin, 1968; Schriever, 1985).

This type of scoring seems intuitively satisfactory: Answering a difficult item correctly contributes heavily to the total score, whereas an incorrect response is not penalized too much. Moreover, together with DM1-DM3, this property can be used for testing a set of items for DM. The method of optimal scoring can be generalized to polytomous item responses. Lewis (Lewis, 1983; Mokken and Lewis, 1982), using the ordering assumptions of the DM model, introduced prior knowledge for use with a Bayesian method of allocating persons to n + 1 ordered ability classes 6* (i: 0,..., n; see Fig. 3 for three items and four score classes) on the basis of their observed response patterns. Working directly with posterior distributions and using exact, small sample results, this method enables one: (1) to obtain interval estimates for 6; (2) to make classification decisions based on utilities associated with different mastery levels; and (3) to analyze responses obtained in tailored adaptive testing situations.

Goodness of Fit Parametric models are formulated in terms of fully specified response functions defined on a parameter structure. Suitable methods for estimation and testing their goodness of fit to the data at hand are necessary for their effective application. Nonparametric models are more general, hence less specific with respect to estimation and testing. Different statistical methods have to be applied to make inferences. In this section, some statistics will be considered which are pertinent to a set of items as a scale (MH or DM).

Scalability: H-Coefficient For MH sets of items to exist, positive association is a necessary condition (MH1). For dichotomous items, coefficient H of Loevinger (1947, 1948) was adapted by Mokken (1971) to define a family of coefficients indicating MH scalability for pairs of items within a set, a single item with respect to the other items of a set and, the set of items as a whole.5 One intuitive way to define coefficient H is in terms of Guttman error probabilities 71^(1,0) (i < k; TTJ < TTfc). For any pair of items (i,k), let the 5 The H coefficient is due to Loevinger (1947, 1948). However, she did not devise an item coefficient Hi of this type but proposed another instead.


Robert J. Mokken

Guttman error probabilities be defined by eiit = 7rifc(l, 0);

e^ = 7^(1 - nk) if i < k;

eifc = 7Tjfc(0,1);

e^ = (1 - TTJ)^ if i > k;


where eik and eik' denote the probabilities of observed error and expected error under for marginal (bivariate) independence. Then the coefficient for an item pair (i, k) is defined by6 (5)

In the deterministic case eik = 0 and Hik = 1, so that high values of Hik correspond to steep (discriminating) IRF's. In the case of marginal independence, eik = e^ and Hik = 0. So low values of Hik are associated with at least one nondiscriminating (flat) IRF. Negative correlations (Hik < 0) contradict MH1 because error probabilities e^ are even larger than expected for completely random responses. In a similar way, for any single item i, coefficient Hi can be given with respect to the other n - 1 items of / as a linear combination (weighted sum) of the Hk involved. Let a = J2kjH eik a n d e|0) = Ylkfr eT > t h e n t h e coefficient for item i is defined by:


An MH scale is defined as a set of items which are all positively correlated (MH1) with the property that every item coefficient (Hi) is greater than or equal to a given positive constant c (0 < c < I). 7 From Eq. (7) we have that H, the coefficient testing the scalability of the set of items as a whole, will then also be greater than c, which can be designated as the scale-defining constant. This suggested the following classification of scales: 0.50 < H : 0.40 < H < 0.50 0.30 < H < 0.40

strong scale; medium scale; weak scale.

Experience based on a few decades of numerous applications has shown that in practice the lower bound c = 0.30 performs quite satisfactorily, delivering long and useful scales. Meijer et al. (1990) compared the results for c = 0.30 with c = 0, finding that the former restriction yielded sets of items that discriminated better among persons. Sample estimates of Hik, Hi, and H can be obtained by inserting the usual ML estimates for TT(U), irik, 7r», etc., into the relevant equations. Asymptotic sampling theory for these estimates was completely developed by Mokken (1971, Chap. 4.3). Generally, coefficients of scalability satisfy the following two requirements: 1. The theoretical maximum is 1 for all scales. 2. The theoretical minimum, assuming MH, is zero for all MH-scales.




The Hi values can be used to evaluate the fit of item i with respect to the other items in /. Finally, a coefficient can be defined the same way for the full set of items, /, in terms of an error ratio and as a linear combination of the Hi or Hik. Let e = ^2k t n e n t n e coefficient for the scale is given by: v-»n • * • * •







(0) I









TT -





(0) .






7 T - L .

TT J^*-'i.k:




Hu OIH<1.

1. Testing theoretically interesting hypotheses about H and Hi. 2. Constructing (population specific) confidence intervals for H and Hi. 3. Evaluating existing scales as a whole with H, as well as the scalability of individual items with the item coefficients Hi. 4. Constructing a scale from a given pool of items.



. 7T,ik

For MH sets of items, property MH1 implies that the coefficients are all nonnegative: 0 < Hik,

In addition, they can be used for the following goals:


Moreover, let min(7fi; i = 1,..., n) = c, then H > c, 0 < c < 1, and H is at least as large as the smallest Hi, H can also be written as a normalized variance ratio of the simple score t(u) (Mokken, 1971). 6 It should be noted that Hik is equal to 4>/4>ma.x, where cf> and 0 m a x denote Pearson's correlation coefficient in a 2 x 2-table and its maximum value, given the marginals -Ki and itk, respectively.

5. Multiple scaling: the construction of a number of scales from a given pool of items. 6. Extending an existing scale by adding new items from a pool of items. For more details, see Mokken (1971), Mokken and Lewis (1982), or Molenaar et al. (1994). 7 The term "scale" (or, for that matter, "test") is used here, as it is a familiar concept in social research and attitude scaling. It has no immediate connection with the basic concept of scale in axiomatic theories of measurement, where it is used to denote the triple of an empirical system, a numerical system, and a set of mapping rules.


Robert J. Mokken


Reliability For the case of DM sets of items, Mokken (1971) showed a method of estimating item' and test reliabilities based on using DM2 for an interpolation for the diagonal elements of the matrix II of Eq. (1). Sijtsma and Molenaar (1987) investigated this approach further, developing improved estimates. Meijer and Sijtsma (1993) demonstrated the use of item reliability as a measure of discriminating power and slope for nonparametric IRF's in person fit analysis.

Software In the seventies, a FORTRAN package (SCAMMO) was available on Control Data Cyber (Niemoller and van Schuur, 1980) and IBM 360 mainframes (Lippert et al., 1978). This package handled dichotomous items only. In the eighties, PC versions (MSP) were developed for use under MS DOS, incorporating the options and additional possibilities of Molenaar's generalization of the model to polytomous items (Debets et al., 1989). Recently, the latest version, MSP 3.0, distributed by ice ProGAMMA, has been totally redesigned (Molenaar, et al., 1994).

TABLE 1. A Local Efficacy Scale (Eight Items).





Example An early example taken from the author's electoral studies (Mokken, 1971, Chap. 8) will give a first illustration of the method for dichotomous items. A set of items designed to measure "sense of political efficacy," a familiar attitudinal variable in electoral research, is considered. The first version of the items, referring to politics at the national level, had been proven to be scalable in the USA and the Netherlands. A new set of eight items was designed to refer to this attitude at the local level of community politics in the city of Amsterdam. There were good reasons to suppose these items would form a scalable set as well. The test of the items as well as the results of a test of their scalability (c = 0.30; a = 0.05) are given in Table I. 8 As expected, Table 1 shows the items to be reasonably (MH) scalable. The item coefficients range from 0.31 (L.6) to 0.48 (L.8), the H coefficient for the total score was .41. An analysis across various subgroups confirmed MH in the sense of positive correlation between pairs of items, although due to sampling variability, an occasional H coefficient showed values below c (0.30). 8

Note on Table 1. Sample taken from Amsterdam electorate (n = 1,513); H = 0.41 (95% confidence interval: [0.38 < H < 0.43)). Source: Mokken (1971, p. 259).





Items If I communicate my views to the municipal authorities, they will be taken into account (Positive alternative: "agree")





The municipal authorities don't care much about the opinions of people like me (Positive alternative: "disagree")



Members of the City Council don't care much about the opinions of people like me (Positive alternative: "disagree")



People like me don't have any say about what the city government does (Positive alternative: "disgree")



If I communicate my views to members of the City Council, they will be taken into account (Positive alternative: "agree")



Sometimes city policies and governments in Amsterdam seem so complicated that a person like me can't really understand what's going on (Positive alternative: "disagree")



In the determination of city politics, the votes of people like me are not taken into account (Positive alternative: "disagree")



Because I know so little about city politics, I shouldn't really vote in municipal elections (Positive alternative: "disagree")




Robert J. Mokken

Systematic inspection of the matrices II, and II^0) (see Tables 2- 3)9 showed the eight item set not to be doubly monotonic (DM). Removal of item LI would likely have improved the scale in this respect. The example also shows that visual inspection, always a necessary aid of judgement, is hardly a sufficient tool in itself. In recent versions of MSP, more objective tests of DM2 and DM3 have been implemented (Molenaar; this volume). TABLE 2. ft Matrix of Local Efficacy Scale (Eight Items). Item Li


L2 L3 L4 L5 L6 L7 Ls Sample Difficulties

0.20 0.20 0.17 0.24 0.16 0.26 0.27 0.33

L2 0.20 -

L3 0.20 0.24

0.24 0.23 0.22 0.21 0.29 0.29 0.34


L4 0.17 0.23 0.21

0.21 0.22 0.22 0.29 0.29 0.35


L5 0.24 0.22 0.22 0.21

0.21 0.21 0.32 0.31 0.37


L6 0.16 0.21 0.22 0.21 0.19

0.19 0.28 0.30 0.37


L7 0.26 0.29 0.29 0.32 0.28 0.32

0.32 0.36 0.41


Ls 0.27 0.29 0.29 0.31 0.30 0.36 0.50

0.50 0.62



TABLE 3. n ( 0 ) Matrix of Local Efficacy Scale (Eight Items). Item


L2 L3 L4 L5 L6 L7 Ls Sample Difficulties

Lx -

0.52 0.51 0.47 0.54 0.41 0.30 0.26 0.33








0.52 -

0.51 0.55

0.55 0.52 0.51 0.45 0.32 0.27 0.34


0.47 0.52 0.50

0.50 0.50 0.46 0.32 0.26 0.35

0.54 0.51 0.50 0.47

0.47 0.43 0.33 0.26 0.37


0.41 0.45 0.46 0.43 0.41

0.41 0.29 0.25 0.37


0.30 0.32 0.32 0.33 0.29 0.28

0.28 0.27 0.41


0.26 0.27 0.26 0.26 0.25 0.27 0.20

0.20 0.62



Discussion Over the years, the methods sketched above have proven their usefulness in numerous applications in many countries as well as such contexts as health research, electoral studies, market research, attitude studies, and 9

Source for Tables 2 and 3 is Mokken (1971, p. 280).


labor studies (e.g., Giampaglia, 1990; Gillespie, et al., 1988; Heinz, 1981; Henning and Six, 1977; Lippert et al., 1978).

References Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F.M. Lord and M.R. Novick, Statistical Theories of Mental Test Scores (pp. 397-479). Reading, MA: Addison-Wesley. Debets, P., Sijtsma, K., Brouwer, E., and Molenaar, I.W. (1989). MSP: A computer program for item analysis according to a nonparametric IRT approach. Psychometrika 54, 534-536. Giampaglia, G. (1990). Lo Scaling Unidimensionale Nella Ricerca Sociale. Napoli: Liguori Editore. Gillespie, M., TenVergert, E.M., and Kingma, J. (1988). Secular trends in abortion attitudes: 1975-1980-1985. Journal of Psychology 122, 232341. Grayson, D.A. (1988). Two-group classification in latent trait theory: Scores with monotone likelihood ratio. Psychometrika 53, 383-392. Guttman, L. (1950). The basis for scalogram analysis. In S.A. Stouffer et al. (Eds.), Measurement and Prediction (pp. 60-90). Princeton, NJ: Princeton University Press. Heinz, W. (1981). Klassifikation und Latente Struktur. Unpublished doctoral dissertation. Rheinischen Friedrich-Wilhelms-Universitat, Bonn, Germany. Henning, H.J. (1976). Die Technikder Mokken-Skalenanalyse. Psychologische Beitrdge 18, 410-430. Henning, H.J. and Six, B. (1977). Konstruktion einer MachiavellismusSkala. Zeitschrift fur Sozial Psychologie 8, 185-198. Holland, P.W. and Rosenbaum, P.R. (1986). Conditional association and unidimensionality in monotone latent variable models. Annals of Statistics 14, 1523-1543. Karlin, S. (1968). Total Positivity I. Stanford, CA: Stanford University Press. Lazarsfeld, P.F. (1950). The logical and mathematical foundation of latent structure analysis. In S.A. Stouffer et al. (Eds), Measurement and Prediction (pp. 362-412). Princeton, NJ: Princeton University Press. Lewis, C. (1983). Bayesian inference for latent abilities. In S.B. Anderson and J.S. Helmick (Eds.), On Educational Testing (pp. 224-251). San Francisco: Jossey-Bass.


Robert J. Mokken

Lippert, E., Schneider, P., and Wakenhut, R. (1978). Die Verwendung der Skalierungsverfahren von Mokken und Rasch zur Uberprufung und Revision von Einstellungsskalen. Diagnostica 24, 252-274. Loevinger, J. (1947). A systematic approach to the construction and evaluation of tests of ability. Psychological Monographs 61, No. 4. Loevinger, J. (1948). The technic of hom*ogeneous tests compared with some aspects of "scale analysis" and factor analysis. Psychological Bulletin 45, 507-530. Lord, F.M. (1953). An application of confidence intervals and of maximum likelihood to the estimation of an examinee's ability. Psychometrika 18, 57-77. Lord, F.M. and Novick, M.R. (1968). Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley. Meijer, R.R. and Sijtsma, K. (1993). Reliability of item scores and its use in person fit research. In R. Steyer, K.F. Wender, and K.F. Widaman (Eds.), Psychometric Methodology: Proceedings of the 7th European Meeting of the Psychometric Society (pp. 326-332). Stuttgart, Germany: Gustav Fischer Verlag. Meijer, R.R., Sijtsma, K., and Smid, N.G. (1990). Theoretical and empirical comparison of the Mokken and the Rasch approach to IRT. Applied Psychological Measurement 11, 283-298. Mokken, R.J. (1971). A Theory and Procedure of Scale Analysis with Applications in Political Research. New York, Berlin: Walter de Gruyter, Mouton. Mokken, R.J. and Lewis, C. (1982). A nonparametric approach to the analysis of dichotomous item responses. Applied Psychological Measurement 6, 417-430. Molenaar, I.W., Debets, P., Sijtsma, K., and Hemker, B.T. (1994). MSP, A Program for Mokken Scale Analysis for Polytomous Items, Version 3.0, (User's Manual). Groningen, The Netherlands: iec ProGAMMA. Niemoller, B. and van Schuur, W.H. (1980). Stochastic Cumulative Scaling. STAP User's Manual, Vol. 4- Amsterdam, The Netherlands: Technisch Centrum FSW, University of Amsterdam. Niemoller, B. and van Schuur, W.H. (1983). Stochastic models for unidimensional scaling: Mokken and Rasch. In D. McKay, N. Schofield, and P. Whiteley (Eds.), Data Analysis and the Social Sciences (pp. 120170). London: Francis Pinter. Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Copenhagen: Nielsen and Lydiche. Rosenbaum, P.R. (1984). Testing the conditional independence and monotonicity assumptions of item response theory. Psychometrika 49, 425435.

Rosenbaum, P.R. (1987a). Probability inequalities for latent scales. British Journal of Mathematical and Statistical Psychology 40, 157-168. Rosenbaum, P.R. (1987b). Comparing item characteristic curves. Psychometrika 52, 217-233. Schriever, B.F. (1985). Order Dependence. Unpublished doctoral dissertation, Free University. Amsterdam, The Netherlands. Sijtsma, K. (1988). Contributions to Mokken's Nonparametric Item Response Theory. Unpublished doctoral dissertation, Free University, Amsterdam, The Netherlands. Sijtsma, K. and Molenaar, I.W. (1987). Reliability of test scores in nonparametric item response theory. Psychometrika 52, 79-97. Sijtsma, K. and Meijer, R.R. (1992). A method for investigating the intersection of item response functions in Mokken's nonparametric IRT model. Applied Psychological Measurement 16, 149-157. Stouffer, S.A., Guttman, L., Suchman, E.A., Lazarsfeld, P.F., Star, S.A., and Clausen, J.A. (1950). Measurement and Prediction. Studies in Social Psychology in World War II, vol. IV. Princeton, NJ: Princeton University Press.

21 Nonparametric Models for Polytomous Responses Ivo W. Molenaar Introduction Mokken (this volume) has argued that his nonparametric IRT model for dichotomous responses can be used to order persons with respect to total score on a monotone hom*ogeneous (MH) set of n items, such that apart from measurement error, this reflects the order of these persons on the property measured by the item set (ability, attitude, capacity, achievement, etc.). If the stronger model of double monotonicity (DM) holds, one can also order the items with respect to popularity. In a majority of cases, the respondents giving a positive reply to a difficult item will also answer positively to all more easy items. It has been explained how Loevinger's Hcoefficient per item pair, per item and for the scale can be used to express the extent to which this Guttman pattern holds true, and to search for hom*ogeneous scales from a larger pool of items. Not only in attitude measurement but also in measuring achievement, it will often be the case that an item is scored into a limited number (say three to seven) ordered response categories, such that a response in a higher category is indicative of the possession of a larger amount of the ability being measured. The more this holds, the more one can expect to gain by using a model for polytomously scored items. For the same number of items presented, one will have a more refined and more reliable instrument than obtained from either dichotomously offered or post hoc dichotomized items. The latter also has other drawbacks (see Jansen and Roskam, 1986). This leads naturally to the question of whether Mokken's nonparametric IRT model can be extended to cover the case of m > 2 ordered response categories. It will be argued here that this can be done for the most important aspects of the model.


Ivo W. Molenaar

Presentation of the Model

< TTig(9) for all 9 by definition. Indeed it is obvious, in terms of our item step indicators Vijh defined above, that

Just as in the case of dichotomous items (Mokken, this volume), the basic data being modeled form an N by n data matrix. In order to follow the notation of other chapters, the elements of its transpose are the responses Uij of person j to item i for j — 1,2,..., N and i = 1,2,..., n. Now, however, item i is either offered, or post hoc scored, into a set C of mi ordered response categories, for which one mostly uses the integers 0 , 1 , . . . , m* — 1. The number of categories will be denoted by TO if it is the same across items; this is often desirable, and will be assumed here for ease of notation. Next, each item i is decomposed into m — 1 dichotomous item steps denoted by V^h (h = 1,2,..., TO - 1): Vi Viihih = 1


> hotherwise.



Thus the score of person j on item i equals the number of item steps passed: m-1




The polytomous Mokken model is essentially equal to the dichotomous Mokken model applied to the (TO - l)*n item steps. It will become clear below, however, that a few modifications are required. As a first step of the model specification, the item step response functions (ISRF's) are denned for each ability value 9 by nih{0) = P(Vijh =


> h;0).


Just as in the dichotomous case, the requirement of monotone hom*ogeneity (MH) postulates that all ISRF's Trih(6) are monotonic nondecreasing functions of the ability value 9. This property is again equivalent to that of similar ordering (Mokken, this volume), now for the (m - l)*n item steps. Note that our definition of ISRF's, if they were not only nondecreasing but even logistic, would refer to the so-called cumulative logits \n[P{Ulj > h)/P(Uij < h)], as opposed to the continuation-ratio logits \n[P(Uij = h)/P{Uij < h)] or the adjacent-categories logits ln[P(Uij = h)/P(Uij = h - 1)] (see Agresti, 1984, p. 113). The requirement of double monotonicity (DM) now means that the difficulty order of all (m - l)*n item steps is the same for each ability value, or equivalently that the ISRF's may touch but may not cross: If for i,k € I and for g, h G {1,2,..., m - 1}, there exists a value 00 for which Kih{8o) < W ^ o ) , t h e n ^{0) < Trkg(O) holds for all 9. The only difference with the dichotomous case is that such a requirement is trivial for the case i = k of two item steps pertaining to the same item: if h > g, then

Vijh = 0

Viiq = 1 for all g < h; for all g > h.


So there is some deterministic dependence between item steps belonging to the same item. For item steps from different items, however, local or conditional independence is postulated, just like in the dichotomous case (Mokken, this volume), and sampling independence between responses of different persons is also assumed. Next, consider the practical consequences of MH and DM that were listed for the dichotomous case by Mokken (this volume). His result that any item pair correlates nonnegatively for all subgroups of subjects, when MH holds, remains valid for the polytomous case: by Mokken's early result applied to just two item steps, say V^ and Vkg, one obtains cov(Vih,Vkg) > 0. By Eq. (2) the covariance between Ui and Uk is equal to the sum of the (TO — 1) * (TO — 1) such covariances between all item steps, each of which is nonnegative, and the result follows. Rosenbaum's (1984) more general conditional association (CA) result, however, cannot be derived from just local independence and increasing ISRF's. For a generalization of the proof of his Lemma 1, one needs the socalled order 2 total positivity (TP2) property of P(Ui = h \ 9), i.e., for any 9' > 9 it must hold that P{Ut = h \ 9')/P(Ui = h\9) increases with h. This is trivial when h can only be 0 or 1, because P(Ui = 0 | 9) = 1 - P(Ui = 1 | 9) in this case. For three categories, however, a counterexample was presented by Molenaar (1990). This implies also that property MH2 (CA per item pair in rest score groups on the remaining n - 2 items) cannot be demonstrated without the additional TP2 assumption. Property MH3 would now mean that P(Ui > h) would not decrease over increasing rest score groups on the other n — 1 items. A proof, unpublished to date, was prepared by Snijders (1989). Property MH4 immediately follows from Eq. (2) and the dichotomous MH4 property. Property MH5 (monotone likelihood ratio of 9 and total score) is proven by Grayson (1988) in a way that makes very explicit use of the dichotomous character of the item scores. In the polytomous case MLR in the total score holds in the partial credit model but can be violated in extreme cases otherwise, see Hemker et al. (1996). Mokken's property DM1 is also valid in the polytomous case, where 7^ now becomes P(Ut > h). The same holds for Mokken's property DM2: One now has two matrices of order (TO - l)n by (TO - \)n with rows and columns corresponding to item steps. Matrix entries pertaining to the same item cannot be observed, but all other entries should, apart from sampling fluctuations, be increasing (for II) and decreasing (for n( 0) ). DM3 is meaningful for polytomous items when Ui and Uk are replaced by item step indicators V.h and Vfco.


Ivo W. Molenaar

Parameter Estimation Scalability Coefficients My first attempt to generalize Mokken scaling to polytomous items used the scalability coefficient H between two item step pairs. This was unsatisfactory, for two reasons. First, it follows from Eq. (4) that between any two steps belonging to the same item, say item i, the pairwise item step coefficient is identically 1: By definition, one cannot pass a difficult item step, say Ui > 2, and at the same time fail an easier one pertaining to the same item, say Ui > 1. Second, the search, test and fit procedures described by Mokken for the dichotomous case lead to decisions to admit or to remove certain items. It would make little sense to take such decisions at the item step level, say by admitting Ui > 2 but removing Ut > 1 from the scale. Although one could argue that this is permissible, and takes the form of a recoding in which categories 0 and 1 are joined, it will usually be preferable to keep or remove entire items, based on the joint scalability of all steps contained in an item. A suitable definition of the scalability coefficient for a pair of items with m ordered answer categories is obtained by establishing the perfect Guttman patterns in their cross table. This will be briefly described in the Example section; for details see Molenaar (1991). The derivation of a coefficient H for item i, and of a coefficient H for the total scale, proceeds exactly as in the dichotomous case. The features of H listed in Mokken (this volume) continue to hold. The reason to discuss scalability coefficients under the heading Parameter Estimation will be clear: such coefficients are initially defined for the population. When their sample counterparts are used for inference, the sampling variance should be taken into account. It refers to the conditional distribution of cell frequencies in the isomarginal family of all cross tables with the same marginals. The software package MSP (Molenaar et al., 1994) uses the mean and variance of the sample H values under the null model of independence to obtain a normal approximation for testing the hypothesis that a population H value equals zero. Space limitations prohibit details being provided here.

Ability Estimation The sum score across the n items of a multicategory Mokken scale, which is equal to the total number of item steps passed by Eq- (2), is positively correlated with the ability 6 by property MH4. The stronger monotone likelihood ratio property, by which the distribution of G would be stochastically increasing with total score, is only proven by Grayson (1988) for the dichotomous case. Ordering of persons by total score, however, will be

indicative of their ability ordering (see also the next subsection on reliability)Note that each item contributes equally to the total score if the number m of categories is the same across items. If one item would, for example, have two categories and another item four, then one could pass only one item step on the former and three on the latter. Unless one assumes that the presence of more categories implies higher discrimination of the item, this unequal contribution of items to the sum score appears to be undesirable. It is thus recommended that the same number of categories across items be used.

Reliability The extension of the reliability derivations by Mokken (1971) and Sijtsma and Molenaar (1987) to the polytomous case is relatively straightforward: In the matrix II per item step one may interpolate the missing values (where row and column refer to steps of the same item) from the observed values (where they refer to different items). The procedures are described by Molenaar and Sijtsma (1988) and have been implemented in the MSP program from version 2 onward. Note that the reliability results could be misleading if double monotonicity is seriously violated. Research by Meijer et al. (1993), however, indicates that minor violations of DM tend to have little influence on the reliability results.

Goodness of Fit The search and test procedures based on the //-coefficients have already been described by Mokken (this volume); they work exactly in the same way for polytomous items. Although it is plausible that scales obtained in this way will very often comply with the MH and DM requirements, this section will present some additional tests and diagnostics that assess the properties of the ISRF's more directly. This material is presented in more detail in Chap. 7 of the MSP User Manual (Molenaar et al., 1994), with examples from the MSP output regarding such tests and diagnostics. The model of MH assumes that the item step characteristic curves are increasing functions of the latent trait value, and the model of DM also assumes that these curves do not intersect. The search and test procedures for scale construction and scale evaluation will often detect major violations of these assumptions because decreasing and/or intersecting curves will lead to far lower //-values per item or per item pair. Such detection methods are rough and global, however, and their success will depend on factors such as the proportion of respondents with latent trait values in the interval for which the curves do not follow the assumptions. More detailed checks of model assumptions would thus be welcome, in particular when the assumptions are intimately linked to the research goal.


Ivo W. Molenaar

For validity considerations it may be important to establish that the item step curves increase over the whole range of the latent trait. For an empirical check of an item hierarchy derived from theory, it may be important that the same difficulty order of items or item steps holds for the whole range, or that it holds for subgroups like males and females. One would thus sometimes like to use a detailed goodness-of-fit analysis that produces detailed conclusions per item step or per subgroup of persons. Such analyses, however, are difficult, for a number of reasons. The first is that the assumptions refer to certain probabilities for given latent trait values, whereas all we can do is use the ordering of subjects with respect to their observed total score T as a proxy for their ordering on the latent trait. This means that we can only obtain rough estimates of the curves for which the assumptions should hold. An additional flaw is that the observed scores have even more measurement error when they are partly based on items that violate the model assumptions. A third problem is that detailed checks of many estimated curves for many score values multiplies the risk of at least some wrong conclusions: If one inspects several hundred pairs of values for which the model assumptions predict a specific order, by pure chance several of them will exhibit the reverse order, and in a few cases, this reverse order may even show statistical significance.

Checking of Monotone hom*ogeneity The check of MH (increasing item step curves) is the most important one: ordering the persons is often the main goal. Let T denote a person's total score on a Mokken scale consisting of n items which each have m ordered response categories coded as 0,1,...,m — 1; the modifications for other codings are trivial. One may subdivide the total sample into score groups containing all persons with total score T = t, where t = 0 , 1 , . . . , n*(m— 1). If the scale is valid, persons with T = t + 1 will on average have larger ability values than persons with T = t. Thus, if one forms the fractions of persons in each score group that pass a certain fixed item step, these fractions should increase with t if the item step response function increases with the latent trait value. Just like an item total correlation is inflated because the item itself contributes to the total score (psychometricians therefore prefer an item rest correlation), it is superior to base the above subdivision not on the total score but on the rest score (say R) on the remaining n— 1 items. The analysis of MH is therefore based on the fraction of persons passing a fixed item step in the rest score groups R = r for r = 0 , 1 , . . . , (n - 1) * (m - 1), based on the other n - 1 items. Unless the total number of item steps is very small, several of these rest score groups may contain only a few persons, or even no persons at all. For this reason one may first joint adjacent rest score groups until each new group contains enough persons. Here one needs a compromise between too few groups (by which a violation could be masked) and too many groups

(by which the fractions per group would be very instable). Every time an item step is passed by a larger fraction in some rest score group than in some higher rest score group, the absolute difference between such fractions is called a violation. Decisions about MH for a given item may be based on the number of violations, their sum and maximum, and their significance in the form of a z-value that would be for large group sizes normally distributed in the borderline case of exactly equal population fractions. In the 2x2 table of two rest score groups crossed by the item step result (0 or 1), the exact probability of exceedance would be found from the hypergeometric distribution. The z-value then comes from a good normal approximation to this hypergeometric distribution (Molenaar, 1970, Chap. 4, Eq. 2.37). It is clear that all such tests are one-sided; when there is no violation no test is made. For most realistic data sets, quite a few violations of MH tend to be found by the procedure. Most of them tend to be numerically small and to have low z-values, however; they may well be due to sampling variation. Indeed some of these violations disappear when a slightly different grouping of raw rest score groups into larger clusters is applied. In simulated data in which one non-monotone item was added to a set of MH items, the procedure has been shown to correctly spot the non-monotone item in most cases. Often, however, an item with non-monotone ISRF's is already detected because its item ff-value is much lower than that of the remaining items. When either a low item H-value or the above procedure casts doubt on an item, it is obvious that substantive considerations should also be taken into account in a decision to remove or to keep it. Moreover, it is recommended to remove one item at a time: removal of an item changes the rest score, so the fit of any other item may well change (the same holds for their item H-values, which are recalculated across n — 2 rather than n— 1 other items).

Checking DM Via the Rest Score The property that two-item step response functions may not intersect holds trivially for two steps belonging to the same item; therefore only pairs of item step curves belonging to two different items need to be examined. Such a pair is ordered by the sample popularity of the two item steps (the rare case of exactly equal sample popularity will require a special procedure not discussed here). The question then is whether there exist one or more intervals on the latent trait axis for which this overall order is reversed. Like for the MH check, the present method estimates the curves by the item step popularities in rest score groups. For a fair comparison, the rest score is here defined as the sum score on the remaining n — 2 items. There is a violation if the overall more popular item step is less popular in a certain rest score group. The z-score for each violation is now based on a McNemar test. The frequency of 10 and 01 patterns for the two item steps in each rest score group must reflect their overall popularity order. The boundary


Ivo W. Molenaar

case of equal probabilities of 01 and 10 in the group considered leads to a one-sided probability of exceedance in a binomial distribution with success probability-0.5. The software package MSP uses its z-value from a very accurate normal approximation to the binomial distribution with success probability 0.5 (Molenaar, 1970, Chap. 3, Eq. 5.5). The use of this procedure has led to similar findings as presented for the checking of MH. In this case, it is important to keep in mind that each violation of DM involves two different items. When there is one bad item, whose curves intersect with many others, the DM tables for other items will also show many violations, which will diminish or even vanish when the offending item is removed. Moreover, this item will often have a low item i7-value. Again it is recommended to remove one item at a time, even when two or more items appear to have many violations of double monotonicity.

Checking DM via the TL Matrices In the polytomous case, the n x (m — 1) by n x (m — 1) matrices II and n(°) contain the joint probabilities of passing (II) and failing (II^0^) pairs of item steps. For cells referring to steps belonging to the same item no sample estimates are available. The DM assumption implies for all other cells that, apart from sampling fluctuation, their rows and columns should be nondecreasing (for II) and nonincreasing (for II^0)). Consider one line of the matrices, in which the order of two columns should reflect their overall popularity. When this is violated in one of the two matrices, the general procedure combines the evidence from the two matrices into one 2 x 2 x 2 cross table for the three item steps involved, as required for a valid and complete analysis. Such violations can again be counted and their maximum, sum, and significance can be assessed.

Checking Equality of ISRF's Across Groups It is an important assumption of item response theory that the response probabilities depend on a person's latent trait value, and on no other characteristics of the person. For an up-to-date and complete treatment of the mathematical implications of such a statement see Ellis and Van den Wollenberg (1993); for an alternative test procedure for the dichotomous case, see Shealy and Stout (1993). Here it suffices to mention an example. If an achievement test contains one item that is easy for boys but difficult for girls with the same ability, then one speaks of item bias or differential item functioning. It is often desirable to detect and remove such items. In our nonparametric IRT setting, we may investigate DIF by checking for equal item step order in the groups of respondents defined by a variable specified by the user. It is recommended to do this only for those grouping variables for which item bias would be both conceivable and harmful (it makes little


sense to split on the basis of the first character of the respondent's name, for example). The point here is not whether one group has systematically higher scores on a scale than the other; rather, it is the question whether the scale has the same meaning and structure in subgroups of persons, in particular whether the item steps occur in the same order of popularity. When for some subgroups two item steps have the reverse ordering, this is a violation. Such violations can again be counted and summed into a summary table per item, and the significance of a violation can be expressed in a 2-value.

Example Agresti (1993) presented the responses of 475 respondents to the 1989 General Social Survey. With response format 0 = always wrong, 1 = almost always wrong, 2 = wrong only sometimes, and 3 = not wrong, their opinions were asked on teenage sex, premarital sex by adults, and extramarital sex, respectively. The three items will be denoted TEEN (teenagers of age 14-16 having sex relations before marriage); PRE (a man and a woman having sex relations before marriage), and EXTRA (a married person having sex relations with someone other than the marriage partner). One may wonder whether a common latent trait, say sexual permissiveness, is measured, and how items and persons are positioned on that trait. It might be better to have more than three items, and possibly different ones, but for the purpose of illustrating polytomous Mokken scaling, a short test leads to easier presentation, so such objections will be ignored. The data can be viewed as nine item steps that may be ordered from easy to difficult with the result that PRE > 1 is passed by 69% of the respondents, PRE > 2 by 61%, PRE > 3 by 39%, TEEN > 1 by 25%, EXTRA > 1 by 19%, TEEN > 2 by 10%, EXTRA > 2 by 8%, TEEN > 3 by 3%, and EXTRA > 3 by 1%. The frequency count per response pattern showed 140 scores of 000 (rejecting all three kinds of sex) and only four scores of 333 (no objections at all); whereas many respondents were at least somewhat tolerant toward premarital sex by adults, a large majority strongly objected to teenage sex, and even more to extramarital sex. The extent to which individual response patterns were in agreement with this difficulty order of the item steps in the total group can be expressed in terms of the scalability coefficients. First consider the pairwise H of the items PRE and TEEN, obtained from their cross tabulation in Table I. 1 There were 141 respondents with score 0 on both items. Someone with a slightly larger sexual tolerance is expected to score 1 on premarital sex 1 Note. The number of Guttman errors is given in brackets below the observed frequencies.


Ivo W. Molenaar

TABLE 1. Cross Tabulation of the Items PRE and TEEN for 475 Persons. TEEN 0

0 141 (0)

PRE 12 34 72 (0) (0)

3 109 (0)


4 (3)

5 (2)

23 (1)

38 (0)


1 (6)

0 (4)

9 (2)

23 (0)








but still score 0 on teenage sex (34 persons do so), because PRE > 1 is the easiest item step. For people with a still higher position on the latent trait, the expected response becomes TEEN = 0, PRE = 2 (72 persons) and then TEEN = 0, PRE = 3 (109 persons). Only then is one assumed to endorse the more difficult item step, TEEN > 1 and score TEEN = 1, PRE = 3 (38 persons), next TEEN = 2, PRE = 3 and finally the most tolerant answer TEEN = 3, PRE = 3. The seven cells of the table marked zero (0) in brackets form a path from 00 to 33 on which one never endorses a difficult item step while rejecting an easier one, and on which the total score on the two items rises from 0 to 6. There were 141 + 34 + • • • + 15 = 432 respondents along this path; the other 43 persons had score patterns with at least one Guttman error. Note that the 23 respondents with TEEN = 1 and PRE = 2 had just one such error (for the whole group the step PRE > 3 comes just before the step TEEN > 2). The one person with TEEN = 2 and PRE = 0, however, had six pairs of item steps in which the easier one was rejected and the more difficult one endorsed. As explained in more detail in Molenaar (1991) and in the MSP Manual (Molenaar et al., 1994), one may thus obtain a weighted error count of 72 for the 43 persons who were not on the path marked by zeros (0). This number was compared to the expected weighted count in the zero cells under the null hypothesis of statistical independence of the scores on both items, which was found to be 240.56. Then, the pairwise H of the two items equaled H(TEEN,PRE) = 1 - 72/240.56 = 0.70. This value is also equal to the observed correlation between the two item scores divided by the maximum correlation that could be obtained given the marginal frequencies for each item (Molenaar, 1991); that calculation will be skipped here. From the other cross tables not reproduced here one obtains in the same way H(EXTRA,TEEN) = 0.45 and H(EXTRA,PRE) = 0.71 (note that the path


followed from 00 to 33 may have a different shape in each table, dependent on the popularities of the item steps under consideration). The item H values were found to be 0.57, 0.58, and 0.70 for EXTRA, TEEN, and PRE, respectively, and the scale of three items had estimated scalability H = 0.62 and estimated reliability 0.69. All these values differed significantly from zero, and under the standard settings of MSP there appeared to be no violations of single or double monotonicity. It can be concluded that the three items form a scale for sexual tolerance on which 140 persons have the minimum score of 0 and four persons have the maximal score of 9; the mean score is 2.36 with a standard deviation of 2.06 and the distribution of scores is slightly skewed in the positive direction. From a detailed analysis of all answer patterns, it emerges that 366 persons had no Guttman errors (from their total score, their responses on the items can be inferred), 51 persons had only one Guttman inversion, and the frequency for 2 to 7 errors is 15, 28, 5, 3, 3, 3, respectively. No one had eight or more errors. Note that for an unlikely pattern like TEEN = 0, EXTRA = 3, PRE = 0 one would obtain 18 errors because the six easiest item steps would be failed and the three most difficult ones passed. In this example both item steps and persons were successfully ordered on a latent trait that measured sexual permissiveness. Space does not permit a more detailed explanation of the example, or a full illustration of the other facets of the model and the software. The latter can be purchased from ProGAMMA, P.O. Box 841, 9700 AV Groningen, The Netherlands.

References Agresti, A. (1984). Analysis of Ordinal Categorical Data. New York: Wiley. Agresti, A. (1993). Computing conditional maximum likelihood estimates for generalized Rasch models using simple loglinear models with diagonals parameters, Scandinavian Journal of Statistics 20, 63-71. Ellis, J. and Van den Wollenberg, A.L. (1993). Local hom*ogeneity in latent trait models: A characterization of the hom*ogeneous monotone IRT model. Psychometrika 58, 417-429. Grayson, D.A. (1988). Two-group classification in latent trait theory: Scores with monotone likelihood ratio. Psychometrika 53, 383-392. Hemker, B.T., Sijtsma, K., Molenaar, I.W., and Junker, B.W. (1996). Polytomous IRT models and monotone likelihood ration in the total score, Psychometrika, accepted. Jansen, P.G.W. and Roskam, E.E. (1986). Latent trait models and dichotomization of graded responses. Psychometrika 51, 69-91. Meijer, R.R., Sijtsma, K., and Molenaar, I.W. (1996). Reliability estimation for single dichotomous items based on Mokken's IRT model, Applied Psychological Measurement, in press.


Ivo W. Molenaar

Mokken, R.J. (1971). A Theory and Procedure of Scale Analysis. With Applications in Political Research. New York, Berlin: Walter de Gruyter, Mouton.Molenaar, I.W. (1970). Approximations to the Poisson, Binomial and Hypergeometric Distribution Functions (MC Tract 31). Amsterdam: Mathematisch Centrum (now CWI). Molenaar, I.W. (1990). Unpublished lecture for the Rasch research group. Arnhem: CITO. Molenaar, I.W. (1991). A weighted Loevinger .//-coefficient extending Mokken scaling to multicategory items. Kwantitatieve Methoden 12(37), 97117. Molenaar, I.W., Debets, P., Sijtsma, K., and Hemker, B.T. (1994). User's Manual MSP. Groningen: Iec ProGAMMA. Molenaar, I.W. and Sijtsma, K. (1988). Mokken's approach to reliability estimation extended to multicategory items. Kwantitatieve Methoden 9(28), 115-126. Rosenbaum, P.R. (1984). Testing the conditional independence and monotonicity assumptions of item response theory. Psychometrika 49, 425435. Shealy, R. and Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DIF as well as item bias/DIF. Psychometrika 58, 159-194. Snijders, T.A.B. (1989). Unpublished notes. Sijtsma, K. and Molenaar, I.W. (1987). Reliability of test scores in nonparametric item response theory. Psychometrika 52, 79-97.

22 A Functional Approach to Modeling Test Data J.O. Ramsay Introduction The central problem in psychometric data analysis using item response theory is to model the response curve linking a level 6 of ability and the probability of choosing a specific option on a particular item. Most approaches to this problem have assumed that the curve to be estimated is within a restricted class of functions defined by a specific mathematical model. The Rasch model or the three-parameter logistic model for binary data are best known examples. In this chapter, however, the aim is to estimate the response curve directly, thereby escaping the restrictions imposed by what can be achieved with a particular parametric family of curves. It will also be assumed that the responses to an item are polytomous, and can involve any number of options. First some notation. Let a test consist of n items, and be administered to N examinees. Assume that the consequence of the interaction of an examinee j and item i is one of a finite number mi set of states. These states may or may not be ordered, and little that appears in this chapter will depend on their order or on whether the items are dichotomous. The actual option chosen will be represented by the indicator vector yij of length mi with a 1 in the position corresponding to the option chosen and zeros elsewhere. At the core of almost all modern procedures for the modeling of testing data is the problem of fitting the N indicator vectors y^ for a specific item i on the basis of the covariate values 0j. The model for these indicator vectors is a vector-valued function Pi(#) whose values are the probabilities that a candidate j with ability 6j will choose option h, or, equivalently, will have a 1 in position h of indicator vector y^. The general statistical problem of estimating such a smooth function on the basis of a set of indicator variables is known as the multinomial regression problem. The special case of two options is the binomial regression or binary regression problem, and there is a large literature available, which is surveyed in Cox and Snell (1989). McCullagh and Nelder (1989) cover the important exponential family class


J.O. Ramsay

of models along with the highly flexible and effective generalized linear model or GLM algorithm for fitting data. Although- the value 9j associated with examinee j is not directly observed, various techniques in effect use various types of surrogates for these values. The total number of correct items for long tests, for example, can serve. Attempts to estimate both item characteristics and the values of 9j by maximum likelihood will alternate between an item-estimation step and an ability-estimation step, so that multinomial regression is involved in the item-estimation phase. For techniques that marginalize likelihood over a prior distribution for 9, numerical integration is almost always involved, and in this case the known values of 0 are appropriately chosen quadrature points. Finally, certain nonparametric approaches to be described below will assign values of 9 more or less arbitrarily to a pre-determined set of rank values. Like the quadrature procedures, these techniques, too, can bin or group the values prior to analysis to speed up calculation.


Parametric and Nonparametric Modeling

Some thoughts are offered next as to the relative merits of parametric versus nonparametric modeling strategies. The essential problem in multinomial regression is to estimate the vector-valued probability function Pj. The principal challenges are to propose flexible or at least plausible classes of functions that respect the constraints that probabilities must lie within [0,1] and sum to one. A secondary but often critical issue is computational convenience since N can be in the tens or hundreds of thousands. Finally, a number of statistical considerations lead to the need for Pi to have reasonably smooth first derivatives, and hence to be itself smooth. On the advantage side, parametric models are often motivated by some level of analysis of the psychological processes involved in the choice that an examinee makes when confronted with an item. The binary three-parameter logistic or 3PL model, for example, is essentially a mixture model motivated by the possibility that with a certain probability, denoted by parameter c, examinees will choose the correct of several options with certainty. Partly as a consequence of this psychological analysis, the parameter estimates can be useful descriptors of the data and thus provide numerical summaries of item characteristics that users can be taught to understand at some level. The use of a fixed and usually small number of parameters means that the machinery of classical mathematical statistics can be employed to provide interval estimates and hypothesis tests, although there is usually room for skepticism about the relevance of asymptotic theory in this context. Finally, some simple parametric models exhibit desirable statistical and mathematical properties, tempting one to wish that they were true or even to consider devising tests where they might be appropriate. On the down side, one has the ever-present concern about whether a parametric model offers an adequate account of the data. Most parametric

models fail when the sample size gets large enough. But there is a very important counter-argument that should be always kept in mind: A simple wrong model can be more useful for many statistical purposes than a complex correct model. Bias is not everything in data analysis, and the reduction in sampling variance resulting from keeping the number of parameters small can often more than offset the loss due to increased bias. Many parametric models have been proposed for various binary regression problems, but the multinomial case, ordered or not, has proven more challenging. Existing models have tended to be within the logistic-normal class (Aitchison, 1986), to be Poisson process approximations (McCullagh and Nelder, 1989), or to be so specific in structure as to apply to only a very limited class of data. It is also not always easy to see how to extend a particular linear model to allow for new aspects of the problem, such as multidimensionality or the presence of observed covariates. A more subtle but nonetheless important argument against parametric models is their tendency to focus attention on parameters when in fact it is the functions themselves that are required. For example, the proper interpretation of estimates of parameters a, b, and c in the 3PL model requires a fair degree of statistical and mathematical sophistication, whereas other types of data display such as graphs can speak more directly to naive users of test analysis technology. A particular issue is'the ontological status of 9. Parametric models are often motivated by appealing to the concept of a "latent trait" or something like "ability" that is imagined to underly test performance. Index 9 is understood by far too many users of test analysis technology as a type of measurement on an interval scale of an examinee's latent trait, a notion much favored by the widespread use of affectively loaded terms like "measure" and "instrument" to describe tests. Although careful accounts of item response theory such as Lord and Novick (1968) caution against this misinterpretation, and affirm that 9 can be monotonically transformed at will even in the context of parametric models, there are far more passages in the journal and textbook literature that tout the interval scale properties of 9 as assessed in the context, for example, of the 3PL model as one of the prime motivations for using parametric item response theory. Finally, parametric models can be problematical on the computational side. The nonlinear dependency of P, on parameters in most widely used models implies iterative estimation even when 9 values are fixed, and these iterations can often converge slowly. A closely related problem is that two parameter estimates can have a disastrous degree of sampling correlation, as is the case for the guessing and discrimination parameters in the 3PL model, resulting in slow convergence in computation and large marginal sampling variances for even large sample sizes. The term "nonparametric" has undoubtably been over-used in the statistical literature, and now refers to many things besides rank- and countbased statistical techniques. Many of the approaches to be considered below


J.O. Ramsay

do involve parameter estimation. Perhaps the defining considerations are whether the parameters themselves are considered to have substantive or interpretive significance, and whether the number of parameters is regarded as fixed rather than at the disposal of the data analyst. The main advantages to be gained by nonparametric models are flexibility and computational convenience. The fact that the number of parameters or some other aspect of the fitting process is available for control means that the analysis can be easily adapted to considerations such as varying sample sizes and levels of error variance. Moreover, the model usually depends on the parameters, when they are used, in a linear or quasi-linear manner, and this implies that parameter estimation can involve tried-and-true methods such as least squares. An appropriate choice of nonparametric procedure can lead to noniterative or at least rapidly convergent calculations. Nonparametric techniques often lend themselves easily to extensions in various useful directions, just as the multivariate linear model can be extended easily to a mixture of linear and principal components style bilinear analysis. They can also adapt easily to at least smooth transformations of the index 6, and have less of a tendency to encourage its over-interpretation. The issue of how to describe the results of a nonparametric analysis can be viewed in various ways. The lack of obvious numerical summaries can be perceived as a liability, although appropriate numerical summaries of the function P can usually be constructed. On the other hand, nonparametric models encourage the graphical display of results, and the author's experience with the use of his graphically-oriented program TestGraf (Ramsay, 1993) by students and naive users has tended to confirm the communication advantage of the appropriate display over numerical summaries.

Presentation of the Models


Boor, 1978), and the various special forms that these have taken. Other types of basis functions will be mentioned below. Associated with the choice of basis is the N by K matrix X, which contains the values of each basis function at each argument value. The coefficients 6^ of the linear combination are the parameters to be estimated, although the number of basis functions is to some extent also an estimable quantity. Let vector b contain these coefficient values. The vector Xb then contains the approximation values for the data being approximated. Once the coefficients have been estimated for a particular problem, there are usually a set of Q evaluation points 6q, q =

l,...,Q, often chosen

to be equally spaced, at which the model function is to be evaluated for, among other reasons, plotting and display purposes. The function values are Yb, where Q by K matrix Y contains the basis functions evaluated at the evaluation points. As a rule Q « N, so it is the size of N that dominates the computational aspects of the problem. The following discussion of strategies for nonparametric modeling will be within this linear expansion frame of reference. What constitutes a good basis of functions for a particular problem? All of the following features are to some extent important: Orthogonality. Ideally, the basis functions should be orthogonal in the appropriate sense. For Example, if least squares analysis is used, then modern algorithms for computing the solution to a least squares fitting problem without orthogonality require of an order of K2N floating point operations (flops), for large N, whereas if the columns of X are orthogonal, only KN flops are required. Moreover, rounding error in the calculations can be an enormous problem if matrix X is nearly singular, a problem that orthogonality eliminates.


Local support. Because the independent variable values 6 are ordered, it is highly advantageous to use basis functions which are zero everywhere except over a limited range of values. The cross-product matrix X*X is then band-structured, and the number of flops in the least squares problem is again of order KN. Regression splines such as B-splines and M-splines are very important tools because of this local support property (Ramsay, 1988). Local support also implies that the influence of a coefficient in the expansion is felt only over a limited range of values of 6. This can imply a great deal of more flexibility per basis function than can be achieved by non-local bases such as polynomials, and partly explains the popularity of regression splines. The importance of local support will be further highlighted below in discussing smoothing.

Familiar examples of basis functions xk are the monomials 9 ~ , orthogonal polynomials, and 1, sin(0), cos(0), etc. Less familiar but very important in recent years are bases constructed from joining polynomial segments smoothly at fixed junction points, the polynomial regression splines (de

Appropriate behavior. Often of critical importance is the fact that the basis functions have certain behaviors that are like those of the functions being approximated. For example, we use sines and cosines in time series

This section contains an overview of nonparametric modeling strategies, and then focuses on a kernel smoothing approach. While specific examples are drawn from the author's work, there is no intention here to convey the impression that any particular approach is the best. This is a rapidly evolving field with great potential for developing new and better ideas. Most nonparametric modeling strategies employ in some sense an expansion of the function to be modeled as a linear combination of a set of basis functions. That is, let f{0) be the model, and let zfe(0), k = 1,. •., K, be a set of K known functions. Then the model is proposed to have the form k l


J.O. Ramsay

analysis in part because their periodic behavior is strongly characteristic of the processes being analyzed. One particularly requires that the basis functions have the right characteristics in regions where the data are sparse or nonexistent. Polynomials, for example, become increasingly unpredictable or unstable as one moves toward more extreme values of the independent variable, and are therefore undesirable bases for extrapolation or for estimating behavior near the limits of the data. This issue is of particular importance in the multinomial regression problem, especially when the functions being approximated are probability functions. In broad terms, nonparametric techniques can be cross-classified in two ways: 1. Do we model function P directly, or do we model some transformation of P, and then back-transform to get P? 2. If we use an expansion in terms of basis functions, is the number K of basis functions fixed by the data analyst, or is it determined by the data? If the model function / is intended to estimate a multinomial probability function Pj, then the probability constraints can pose some tough problems. Again, things are easier in the binary case because only one function per item is required, and it is only necessary that its value be kept within the interval [0,1]. In the multinomial case the rrii probability functions must also be constrained to add to one, and this complicates the procedures. On the other hand, if the constraints can be effectively dealt with, the direct modeling of probability has the advantage of not getting too far from the data. Smoothing, least squares estimation, and even maximum likelihood estimation are estimation procedures with well-developed algorithms and often well-understood characteristics. Specific examples of these direct modeling approaches will be detailed below. The alternative strategy is to bypass the constraints by transforming the problem to an unconstrained form. In the binary case, the log-odds transformation (2) H(9) = In P{&) 1 - P{6) is widely used over a large variety of data analysis problems. This transformation is also the link function in the exponential family representation of the Bernoulli distribution, and is used to advantage in the generalized linear model or GLM algorithm. McCullagh and Nelder (1989) detail many applications of this technique, which has proven to be generally stable, simple to implement, remarkably flexible, and to generate many useful statistical summaries and tests. Since H is unbounded, there are no constraints to worry about, and expansions of H in terms of such standard bases as polynomials can be used. O'Sullivan et al. (1986) used smoothing techniques to develop what has come to be called a generalized additive model (GAM) for

binary regression problems, and Hastie and Tibshirani (1990) review this and other GAM models of potential interest to nonparametric modelers of psychometric data. The log-odds transformation is not without its potential problems, however. The transformation presumes that the probabilities are bounded away from 0 and 1, and, consequently, estimates of H can become extremely unstable if the data strongly or completely favor 0 or 1 choice probabilities, which is not infrequently the case. In these cases, some form of conditioning procedure, such as the use of a prior distribution, is often used, and the computational edge over direct modeling can be rather reduced. In the multinomial case, the counterpart of the log-odds transformation can take two forms (Aitchison, 1986). In the first, a specific category, which here will be taken for notational convenience to be the first, is chosen as a base or reference category, and the transformation is Hh{B) = In


h = 2 , . . . ,rrii.


Of course, this transformation only works if the base probability Pi keeps well clear of 0 or 1, and in the context of testing data, this may not be easy to achieve. Alternatively, the more stable transformation h = 1,... ,nii


can be used, where the denominator is the geometric mean of the probabilities. While zero probabilities still cause problems, one can usually eliminate these before computing the mean of the remainder. However, using the geometric mean means that one more function Hh is produced than is strictly required, and some linear constraint such as fixing one function, or requiring the pointwise mean to be zero, is then needed. The classical techniques for functional approximation have assumed that the data analyst pre-specifies the number K of basis functions on the basis of some experience or intuition about how much flexibility is required for the problem at hand. Various data-driven procedures can also be employed to guide the choice of K, and these are closely connected with the problem of variable selection in multiple regression analysis. In recent years, an impressive variety of procedures collectively referred to as smoothing have found great favor with data analysis. A comparison of fixed basis versus smoothing procedures can be found in Buja et al. (1989), and the discussion associated with this paper constitutes a virtual handbook of functional approximation lore. The simplest of smoothing techniques is kernel smoothing, in which the expansion is of the form N



J.O. Ramsay

where 9 is an evaluation point at which the value of function / is to be estimated. A comparison of this expansion with Eq. (1) indicates that the coefficient b^ has been replaced by the observed value yj, and the basis function value for evaluation point is now Xj(9) = Z



Clearly, the problem of estimating the coefficients of the expansion has been completely bypassed, although the number of basis functions is now N. How can this work? Two features are required of function z, called a smoothing kernel: 1. z must be strongly localized as a function of 9. This is because each yj effectively contributes or votes for a basis function proportional to itself at 9, and only those values of y associated with values of 6j close to 9 are wanted to have any real impact on f{9). This is because only these observations really convey useful information about the behavior of / in this region. Parameter A plays the role of controlling how local z is. 2. The following condition is needed: N 1



This condition ensures that the expansion is a type of average of the values yj, and along with the localness condition, implies that f{9) will be a local average of the data. The second condition is easy to impose, once more suitable family of local functions have been identified. Kernel smoothing local bases are usually developed by beginning with some suitable single function K{u), of which the uniform kernel, K(u) = 1, -1 < u < 1, and 0 otherwise, is the simplest example. Another is the Gaussian kernel, K(u) = exp(—u2/2). The basis functions are then constructed from K by shifting and rescaling the argument, so that (a) the peak is located at 6q, and (b) the width of the effective domain of z is proportional to A. Finally the normalizing of z can be achieved by dividing by the sum of z values (called the Nadaraya-Watson smoothing kernel), or by other techniques (Eubank, 1988). Using Nadaraya-Watson kernel smoothing with a Gaussian kernel, the nonparametric estimate of Pj is (8)



But where do the values 6 come from? Here, the fact that these values are in effect any strictly monotone transformation of the ranks of the examinees is relied upon. The estimation process begins, as do most item response modeling techniques, with a preliminary ranking of examinees induced by some suitable statistic. For multiple choice exams, for example, this statistic can be the number of correct items, and for psychological scales it can be the scale score. Or, it can come from a completely different test or some other test. The ranks thus provided are replaced by the corresponding quantiles of the standard normal distribution, and it is these values that provide 9. Why the normal distribution? In part, because traditionally the distribution of latent trait values have been thought of as roughly normally distributed. But in fact any target distribution could have served as well. Flexibility is controlled by the choice of the smoothing parameter A: the smaller A the smaller the bias introduced by the smoothing process, but the larger the sampling variance. Increasing A produces smoother functions, but at the expense of missing some curvature. The great advantages of kernel smoothing are the spectacular savings in computation time, in that each evaluation of / only requires N flops, and in the complexity of code. Other advantages will be indicated below. For example, a set of test data with 75 items and 18,500 examinees is processed by TestGraf (Ramsay, 1993) on a personal computer with a 486 processor in about 6 minutes. By contrast, commonly used parametric programs for dichotomous items take about 500 times as long, and polytomous response procedures even longer. While the notion of using the observation itself as the coefficient in the expansion is powerful, it may be a little too simple. Kernel smoothing gets into trouble when the evaluation values 9q are close to the extremes of 9. Local polynomial fitting is an extension of kernel smoothing that uses simple functions of the observations rather than yj as coefficients, and Hastie and Loader (1993) lay out the procedure's virtues. Other types of smoothing, such as spline smoothing, can be more computationally intensive, but bring other advantages. The direct modeling of probability functions by kernel smoothing developed by Ramsay (1991) and implemented in the computer software TestGraf was motivated by the need to produce a much faster algorithm for analyzing polytomous response data. A large class of potential users of test analysis software work with multiple choice exams or psychological scales in university, school, and other settings where only personal computers are available. Their sample sizes are usually modest, meaning between 100 and 1000. Moreover, their level of statistical knowledge may be limited, making it difficult for them to appreciate either the properties of parametric models or to assess parameter estimates. On the other hand, direct graphical display of option response functions may be a more convenient means of summarizing item characteristics. Finally, in many applications, includ-


J.O. Ramsay

ing those using psychological scales, the responses cannot be meaningfully reduced to dichotomous, with one response designated as "correct."







Various diagnostics of fit are possible to assess whether an estimated response curve gives a reasonable account of a set of data. Most procedures appropriate for parametric models are equally applicable to nonparametric fits. Ramsay (1991, 1993) discusses various possibilities. One useful technique illustrated in Ramsay (1991) is the plotting of the observed proportion of examinees passing an item among those whose total score is some fixed value x against the possible values of x. This is an empirical item-total score regression function. Each point in this plot can also be surrounded by pointwise confidence limits using standard formulas. At the same time, expected total score E(x) can be computed as a function of 6, and subsequently the nonparametric estimates of probabilities plotted against expected total score. By overlaying these two plots, an evocative image of how well the estimated response curves fits the actual data is produced, and the user can judge if over-smoothing has occurred by noting whether the estimated curve falls consistently within the empirical confidence limits. One great advantage offered by nonparametric methods is the fact that the fitting power is under user control. In the case of kernel smoothing, it is smoothing parameter A that permits any level of flexibility required in the curve. Of course, there is an inevitable trade-off between flexibility and sampling variance of the estimate, and another positive feature of nonparametric modeling is that this trade-off is made explicit in a parameter. There seems to be no discernible loss in mean square error of estimation of the functions P, incurred by using kernel smoothing. Analysis of simulated data where the generating models were in the 3PL class produced fits which were at least as good as those resulting from using maximum likelihood estimation with the correct model. Of course, what is being fitted here is the function itself, and not parameter estimates. With kernel smoothing approximation, in particular, it is also very simple to compute pointwise confidence regions for the true response curve. These are invaluable in communicating to the user the degree of precision that the data imply, and they are also helpful in deciding the appropriate level of smoothing. TestGraf along with an extensive manual is available on request from the author.

Probabilh .2 0.4 0.6

Goodness of Fit


i i mi


^^ I^ •^^^^^^^^^^MH U IJ N I 1I / /



- 3 - 2 - 1 0 1 Ability



FIGURE 1. A fit by kernel smoothing to simulated data. (The solid curve indicates the smoothing estimate, and the dotted curve is the response curve which generates the data. The binary observations are indicated by the small vertical bars.)

Examples Figure 1 gives an impression of how kernel smoothing works in practice. The smooth dashed line indicates a response curve that is to be estimated. The values of 9 used for this problem were the 500 quantiles of the standard normal distribution, and for each such value, a Bernoulli 0-1 random variable was generated using the response curve value to determine the probability of generating one. The actual values of these random variables are plotted as small vertical bars. The solid line indicates the approximation resulting by smoothing these data with a Gaussian Nadaraya-Watson kernel smooth using A = 0.3. Figure 2 displays a test item from an administration of the Advanced Placement Chemistry Exam developed by Educational Testing Service and administered to about 18,500 candidates. The exam contained 75 multiple choice items, each with five options. The response of omitting the item was coded as a sixth option. The display was produced by TestGraf (Ramsay, 1993). The solid line is the correct option response function, and the vertical bars on this curve indicate pointwise 95% confidence regions for the true curve. It will be observed that the nearly linear behavior of this response function is not consistent with a logistic-linear parametric model such as the three-parameter logistic.


J.O. Ramsay

22. A Functional Approach to Modeling Test Data






problems in which the essential task is to estimate a function. While parametric families can be a means to this end, they can also be too restrictive to capture features of actual data, and estimating a function via estimating parameters can bring other difficulties. By contrast, there are now a number of function estimation techniques which are fast, convenient, and arbitrarily accurate. Unless there are substantive reasons for preferring a particular parametric model, nonparametric estimation of the response curve may become the method of choice.



Probabilit) 0.4




^---V;--~----::;;;-----::"''"!::==""»==:::: o

-*- ------r-----1.5 2.5 -0.5 0.5 Ability FIGURE 2. Response functions estimated by kernel smoothing for a multiple choice item. (The solid line is the response curve for the correct option and the dotted curves are for the incorrect options. The vertical bars on the solid curve indicate 95% pointwise confidence limits for the position of the true curve.) -2.5


Discussion While the main nonparametric approach described in this chapter is based on kernel smoothing, many other approaches are possible. Monotone regression splines were investigated by Ramsay and Abrahamowicz (1989) and Ramsay and Winsberg (1991) for the modeling of dichotomous data. These curves are based on a set of basis functions called regression splines constructed from joining polynomial segments smoothly, and which can be designed to accommodate constraints such as monotonicity (Ramsay, 1988). Abrahamowicz and Ramsay (1991) applied regression spline modeling to polytomous response data. A promising approach is the combination of spline smoothing with the GLM algorithm and statistical model. O'Sullivan et al. (1986) used polynomial smoothing splines to develop a nonparametric binary regression model. However, it can be very advantageous to adapt the nature of the spline to the character of the fitting problem, and Wang (1993) has developed a specialized smoothing spline which is very effective at not only estimating the response curve itself, but also its derivative and the associated information function. The increase in computation time over kernel smoothing does not seem to be prohibitive. Response curve estimation in psychometric data analysis is typical of

Abrahamowicz, M. and Ramsay, J.O. (1991). Multicategorical spline model for item response theory. Psychometrika 56, 5-27. Aitchison, J. (1986). The Statistical Analysis of Compositional Data. London: Chapman and Hall. Buja, A., Hastie, T., and Tibshirani, R. (1989). Linear smoothers and additive models (with discussion). The Annals of Statistics 17, 453-555. Cox, D.R. and Snell, E.J. (1989). Analysis of Binary Data (2nd ed.). London: Chapman and Hall. de Boor, C. (1978). A" Practical Guide to Splines. New York: SpringerVerlag. Eubank, R.L. (1988). Spline Smoothing and Nonparametric Regression. New York: Marcel Dekker, Inc. Hastie, T. and Loader, C. (1993). Local regression: Automatic kernel carpentry (with discussion). Statistical Science 18, 120-143. Hastie, T. and Tibshirani, R. (1990). Generalized Linear Models. London: Chapman and Hall. Lord, F.M. and Novick, M.R. (1968). Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley. McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models. London: Chapman and Hall. O'Sullivan, F., Yandell, B., and Raynor, W. (1986). Automatic smoothing of regression functions in generalized linear models. Journal of the American Statistical Association 18, 96-103. Ramsay, J.O. (1988). Monotone regression splines in action (with discussion). Statistical Science 3, 425-461. Ramsay, J.O. (1991). Kernel smoothing approaches to nonparametric item characteristic curve estimation. Psychometrika 56, 611-630. Ramsay, J.O. (1993). TestGraf: A Program for the Grapical Analysis of Multiple Choice Test Data. Unpublished manuscript. McGill University, Montreal, Canada.


J.O. Ramsay

Ramsay, J.O. and Abrahamowicz, M. (1989). Binomial regression with monotone splines: A psychometric application. Journal of the American Statistical Association 84, 906-915. Ramsay, J.O. and Winsberg, S. (1991). Maximum marginal likelihood estimation for semiparametric item analysis. Psychometrika 56, 365-379. Wang, X. (1993). The analysis of test data using smoothing splines. Unpublished thesis, McGill University, Montreal, Canada.

Part V. Models for Nonmonotone Items

The first models in IRT were developed for responses to items in tests designed to measure such quantities as abilities, skills, knowledge, and intelligence. For these quantities, monotonicity of the response functions makes sense. Indeed, as explained in the chapters addressing nonparametric models in Section IV, one of the basic assumptions a set of dichotomous response data should meet is the one that states that the probability of a correct response is a monotonically increasing function of 8. IRT shares the assumption of monotonicity with response models in the fields of psychophysics and bioassay (see Chapter 1). However, the behavioral and social sciences do have a domain of measurement for which the assumption of monotonicity is unlikely to hold. This is the domain of attitudes, opinions, beliefs, and values. Usually, variables in this domain, for which this introduction uses the term "attitude" as a pars pro toto, are measured by instruments consisting of a set of statements which persons are asked to indicate the extent to which they agree or disagree. It is a common experience that attitudes show that "extremes may meet." That is, it is not unusual to find that persons with opposite attitudes agreeing with the same statements. For example, consider the statement, "The courts usually give fair sentences to criminals." It is likely that those persons who think the courts are too lenient in sentencing and those who think the courts give overly hash sentences, would be in agreement with respect to their opposition to the statement. Of course, their reasons for disagreeing with the statement are totally opposite to each other. This rinding is a direct violation of the assumption of monotonicity. The first successful attempt to establish a stochastic model for attitude measurement was made by Thurstone (1927). His model was also presented under the more provocative title Attitudes can be measured in Thurstone (1928). Thurstone's model can only be used to infer the locations or scale values of attitude statements on an underlying unobserved scale from comparisons between pairs of statements. The model does not have any parameters for the attitudes of persons. Attitude measurement can therefore not take the form of model-based estimation but has to resort to ad hoc procedures for the inference of attitude measures from the scale values of statements the person agrees with. An attempt to place persons and statement locations jointly on a scale of


Part V. Models for Nonmonotone Items

Coombs' (1964) method of unfolding. This method is able to "unfold" preferential choice data into a single continuum through the use of the principle that preferences between statements are guided by the distances between their locations and the person's location. The method has the advantage of jointly scaling of persons and statements but is based on deterministic rules and does not allow for randomness in the response behavior of persons or in any other aspect of its underlying experiment. Therefore, though Coombs' paradigm was immediately recognized to expose the fundamental mechanism at work in preferential choice data, the need for a stochastic formulation of the mechanism has been felt for some time. A review of the attempts to do so is given in Bossuyt (1990, Chap. 1). Some of these attempts have also tried to formulate these models for the experiment of direct responses of (dis)agreement to attitude statements rather than preferences between pairs of statements. Such models would make data collection more efficient because if n is the number of statements, only n responses rather than n preferences are needed. At the same time, such models would have much in common with the models typically formulated in IRT. Formally, a stochastic model for direct responses to attitude statements in which the mechanism of folding is at work, takes the familiar form of a mathematical function describing the relation between the probability of a certain response and an underlying unknown variable. The only difference with IRT models is that the function cannot be monotonic. This section offers two chapters on models for nonmonotone items which have the well-known form of a response function. The chapter by Andrich uses the fact that response functions of the nonextreme categories in polytomous models also are nonmonotonic. His model is, in fact, the result of collapsing a 3-category rating scale model (see Section I) into a 2-category model for agree-disagree responses. The PARELLA model by Hoijtink is motivated by Coombs' parallelogram analysis but specifies the response function directly as a Cauchy density function. To date, the chapters in this section present the only two models known to specify a response function for direct responses to attitude statements. However, additional reading is found in Verhelst and Verstralen (1993) who independently presented an equivalent form of the model in the chapter by Andrich.

References Bossuyt, P. (1990). A Comparison of Probabilistic Unfolding Theories for Paired Comparison Data. New York, NY: Springer-Verlag. Coombs, C.H. (1964). A Theory of Data. New York, NY: Wiley.



Thurstone, L.L. (1927). A law of comparative judgement. Psychological Review 34, 273-286. (Reprinted in Psychological Review 101, 266-270.) Thurstone, L.L. (1928). Attitudes can be measured. American Journal of Sociology 33, 529-554. Verhelst, N.D. and Verstralen, H.H.F.M. (1993). A stochastic unfolding model derived from the partial credit model. Kwantitatieve Methoden 42, 73-92.

23 A Hyperbolic Cosine IRT Model for Unfolding Direct Responses of Persons to Items David Andrich Introduction The two main mechanisms for characterizing dichotomous responses of persons to items on a single dimension are the cumulative and the unfolding. In the former, the probability of a positive response is a monotonic function of the relevant parameters; in the latter, it is single-peaked. This chapter presents a unidimensional IRT model for unfolding. Figure 1 shows the response functions (RFs) of the probabilities of the responses, including the resolution of the negative response into its two constituent components. Table 1 shows a deterministic unfolding response pattern for five items.

Presentation of the Model Although introduced by Thurstone (1927, 1928) for the measurement of attitude, the study of unfolding is associated with Coombs (1964) who worked within a deterministic framework, and when more than four or so items are involved, the analysis is extremely complex. Probabilistic models have been introduced subsequently (e.g., Andrich, 1988; Davison, 1977; Post, 1992; van Schuur, 1984, 1989). In this chapter, a rationale that links the responses of persons to items directly to the unfolding mechanism through a graded response structure is used to construct an item response theory (IRT) model for unfolding. In addition to the measurement of attitude, unfolding models have been applied to the study of development along a continuum, e.g., in psychological development (Coombs and Smith, 1973); development in learning goals (Volet and Chalmers, 1992); social development (Leik and Matthews, 1968); in general preference studies (Coombs and Avrunin, 1977); and in political science (van Blokland-Vogelesang, 1991; van Schuur, 1987). The model studied here is developed from first principles using a concrete example in attitude measurement. Consider a dichotomous response

David Andrich


TABLE 1. Deterministic Ideal Cumulative and Unfolding Response Patterns to Five Items. Items 1 0 1 1 0 0 0 0 0

2 0 0 1 1 0 0 0 0

3 0 0 0 1 1 0 0 0

4 0 0 0 0 1 1 0 0

5 0 0 0 0 0 1 1 0



of Agree or Disagree of a person to the statement "I think capital punishment is necessary but I wish it were not." This statement appears in the example later in the chapter, and reflects an ambivalent attitude to capital punishment. If the person's location is also relatively ambivalent and therefore close to the location of the statement, then the person will tend to agree to the statement, and if a person's location is far from that of the statement—either very much for or very much against capital punishment, then the probability of Agree will decrease and that of Disagree will increase correspondingly. This gives the single-peaked form of the response function for the Agree response shown in Fig. 1. However, it is more instructive to consider the Disagree rather than the Agree response. Let the locations of person j and statement i be 9j and 6i, respectively. Then formally, if 8j « Si or Oj >> Si, the probability of a Disagree response tends to 1.0. This reveals that the Disagree response occurs for two latent reasons, one because the person has a much stronger attitude against capital punishment, the other because the person has a much stronger attitude for capital punishment, than reflected by the statement. The RF with broken lines in Fig. 1 show the resolution of the single Disagree response into these two constituent components. This resolution shows that there are three possible latent responses that correspond to the two possible manifest responses of Agree/Disagree: (i) Disagree because persons consider themselves below the location of the statement (disagree below -DB); (ii) Agree because the persons consider themselves close to the location of the statement (agree close —AC); and (iii) disagree because the persons consider themselves above the location of the statement (disagree above —DA). Furthermore, the probabilities of the two Disagree responses have a monotonic decreasing and increasing shape, respectively. This means that, as shown in Fig. 1, the three-responses take the form of responses to three graded responses. Accordingly, a model for graded responses can be applied to this structure. Because it is the simplest of models for graded responses, and because it can be expressed efficiently and simply in the case of three categories, the model applied is the Rasch (1961) model which can take a number of forms, all equivalent to (1)

exp h=\

FIGURE 1. Response functions of the Agree Close (AC) and Disagree (D) re~ sponses and the resolution of the Disagree response into its constituent coiuP°" nents: Disagree Below (DB) and Disagree Above (DA).

where y^ G {0,l,2,...,m} indicates the m +1 successive categories beginning with 0, T^, h = 1,..., m are m thresholds on the continuum dividing the categories; and 7^ = ^ ^ L o exp[- YX=i Tih+y(8j-6i)] is the normalizing factor which ensures that the sum of the probabilities is 1.0 (Andersen, 1977; Andrich, 1978; Wright and Masters, 1982). Furthermore, without loss of generality (Andrich, 1978), it is taken that XX=i T^ = °The correspondence between the three responses in Fig. 1 and the random variable in Eq. (1), in which case m = 2, is as follows: y = 0 <-+ DB;


David Andrich

23. A Hyperbolic Cosine IRT Model

y == 1 <-> AC; y = 2 <-> DA. It is evident that the RFs in Fig. 1 operate so that, when 8j is close to Si, the probability of the middle score of 1 (AC) is greater than the probability of either 0 (DB) or 2 (DA). This is as required in the unfolding mechanism. Now define the parameter Ai according to A, = (r i2 - Tn)/2. Then the location parameter Si is in the middle of the two thresholds, and Ai represents the distance from <5, to each of the thresholds (Andrich, 1982). These parameters are also shown in Fig. 1. It is now instructive to write the probability of each RF explicitly in terms of the parameter A^: = 0} =


y = 1} = 7^ exp[Ai + {63 - 8,)}, x

Px{Vij = 2} = 7 ^ exp 2(0,- - 6t).

(2a) (2b) (2c)

Because the thresholds define the points where the probability of each extreme response becomes greater than the middle response, the probabilities of, respectively, AC and DB, or AC and DA are identical at Si ± Xi. Thus, Ai is very much like the half-unit of measurement, such as half a centimeter, about the points marking off units on a ruler. For example, any object deemed to be located within half a centimeter on either side of a particular number of say x centimeters, will be declared to be x centimeters long. Analogously, a location 6j of person j within ±Ai of Si gives the highest probability for the Agree response, which reflects that the person is located close to the statement. This parameter is discussed further in Andrich and Luo (1993), and consistent with the interpretation presented briefly here, it is termed the unit parameter—it characterizes the natural unit of measurement of the statement. Although the model characterizes the three implied responses of a person to a statement, there are, nevertheless, only two manifest responses, and, in particular, only one manifest Disagree response. That is, y = 0 and y = 2 are not distinguished in the data. To make the model correspond to the data, define a new random variable Uij, which takes the values u^ = 0 when y^ = 0 or y^ = 2 and u^ — 1 when y^ = 1. Then Pv{Uij = 0} = Pr{y y = 0} + Pv{yij = 2}


Pr{uzj = 1} = Pr{yij = 1}.


Item 1


Item 3


Pr{x=l} 0.4-


0.0 -10




Pr{uy = 0} = 7^(1 + exp 2(9j - Si)} ui3 = 1} = -yZ1 exp[Ai + (0, - «*)],


FIGURE 2. Response functions for four items: <5i = 62; S3 = 64; and Ai = A3; A2 = A4.

where 7^ = 1 + exp[A* + (0,- - <5*)] + exp 2(0j - Si). Because the random variable Uij is dichotomous, it is efficient to focus on only one of Eqs. (4a) or (4b). Focusing on Eq. (4b), and writing the normalizing factor 7^ explicitly gives =n =

e x p [ A , + (9j - Sj)]

Multiplying the numerator and denominator by exp[— (6j ~ Si)], and simplifying gives PT{Uij = 1} =

expAj exp Xi + exp(dj - Si) + exp(-6j + St)'


Recognizing that the hyperbolic cosine, cosh(a) = [exp(a) + exp(—a)]/2, gives expAi Pr{u0- = 1} = (7a) exp Ai + 2 cosh(#j - Si) and

2cosh(0j -Si) exp Xi +2 cosh(0j -Si)'


Equations (7a) and (7b) can be written as the single expression

Inserting the explicit expressions of Eq. (2) into Eq. (3) gives



Pr{ Uii = 0} = 1 - Pr{Uij = 1} =



(4a) (4b)

Pr{««> = 7y 1 (expA i ) u " [2 c o s h 1 - - (Oj - fc)],


where 7^ = expAj + 2cosh(#j —Si) is now the normalizing factor. For obvious reasons, the model of Eq. (8) is termed the hyperbolic cosine model (HCM). Figure 2 shows the RFs for Uy = 1 for four statements, two


David Andrich

each with the same unit Aj, but different locations 6i, and two each with the same location Si but different A;. To consolidate the interpretation of the parameter Aj, note that the greater its value, the greater the region in which a person is most likely to give an Agree response. In the case that A; = A for all i, then Eq. (8) specializes to PT{UIJ}


- - Si)}.



This is termed the simple hyperbolic cosine model (SHCM). Verhelst and Verstralen (1991) have presented effectively the same model but in a different form.

A Solution Algorithm The Newton-Raphson algorithm is efficient in reaching an iterative solution to the implicit equations. The complete multivariate algorithm is not feasible because as N increases, the matrix of second derivatives becomes too large to invert. The alternative algorithm commonly used in such situations is to take the statement parameters as fixed while the person parameters are improved individually to the required convergence criterion; the person parameters are then fixed, and the statement parameters are improved individually until the whole set of estimates converges. In that case, the algorithm takes the form , i = 1, /,

Parameter Estimation


and provides the log-likelihood equation log I, = i






(11) Differentiating Eq. (11) partially with respect to each of the parameters 6, S, A, and equating to 0, provides the solution equations


= 1,N,

j = 1, J;

where ptj = Pi{xij = 1}. In addition, the constraint posed.

= 0 is im-


where >AA = E^L/dXf}; ^ = E[d2L/d6?]; ^e = E[d2L/dd]). The hypothesis that Ai = A for all i can be tested using the usual likelihood ratio test. The full procedure and program is described in Luo and Andrich (1993).

Initial Estimates The initial values A^ are simply taken as log 2 because the model then simplifies to Pr{xy = 1} = 1/(1 +cosh(0j - Si)), a symmetric counterpart of the simple cumulative Rasch model (Andrich and Luo, 1993). Although no empirical case has yet been found, it is possible for an estimate of A^ to be negative, in which case the response curve would be U-shaped (Andrich, 1982), and it would provide immediate evidence that the response process does not give a single peaked RF. The initial location values of 8\ are obtained by setting all 8j = 0 and all Ai = log 2, giving, on simplification S\ = cosh~l((N — S»)/SJ), where Si = ^2jUij is the total score of a statement. Because cosh Si > 1, if for some i, (N — Sj)/si < 1, then the minimum of these values is taken, and the difference a = 1 — min[(iV — Si)/si] calculated to give = cosh""1 [a + {N-


- Xij)tanh(0j - St) = 0,

(13a) (13b)

Although the HCM is constructed from a Rasch model for graded responses, by combining two categories, the distinguishing feature of the Rasch models, namely the existence of sufficient statistics for the person and statement parameters (Andersen, 1977), is destroyed. Therefore, methods of estimation that involve conditioning on sufficient statistics are not available. Other methods, including the joint maximum likelihood (JML), the EM algorithm, and the marginal maximum likelihood procedures, are being investigated. The procedure described here is the simplest of these, the JML. The likelihood of the matrix of responses persons j; = 1, J to statements i = 1, / is given by i




The unfolding structure implies that statements at genuinely different locations can have the same total score statistic Si. This difference can be taken into account by assigning a negative sign to the initial estimates of approximately half of the statements and a positive sign to the other half. There are two ways of choosing the statements that receive a negative sign, one empirical, the other theoretical, and of course, they can be used


David Andrich

together. First, if data have an unfolding structure, then the traditional factor loadings on the first factor of a factor analysis unfolds into the positive and negative loadings (Davison, 1977). Second, a provisional ordering of the statements should be possible according to the theory governing the construction of the variable and the operationalization of the statements. The importance of the theoretical ordering will be elaborated upon in the discussion of fit between the data and the model. The initial estimates 8 ° are obtained from the initial estimates of the statement locations s\0' according to #]• = YliL\uij"i l^h where /,• is the number of items that person j has with a score utj = 1. These are the average values of the statements to which each person agreed.

Information and Standard Errors According to maximum likelihood theory, large sample variances of the estimates Ai, Si, §i are given by

a\ = -1/E[fax] = -1/Y.P^1 - P«)»



a\ = -1/E[rl>ss] = - 1 / 5 > 2 J ( 1 - P i j ) t a n h 2 ( ^ - St),


= -1/E[il>oe] = -V 5 > t f ( l " Pi,)tanh2(% - Si),



respectively. Andrich and Luo (1993) show that the approximations are very good.

Inconsistency of Parameter Estimates From the work on estimation of parameters in the Rasch models (Andersen, 1973; Wright and Douglas, 1977) it is known that the JML estimates of the HCM are not consistent in the case of a fixed number of items and an unlimited increase in the number of persons. However, no assumptions regarding the distribution of person parameters needs to be made, and this is particularly desirable in attitude measurement, where, for example, attitudes may be polarized and the empiricial study of the shape of this distribution is important. With the number of items of the order of 15 or so, the inconsistency is noticeable (Andrich and Luo, 1993), but very small. However, in the case of the analysis of a particular sample of data, this effect is essentially a matter of scale in that the estimates are stretched relative to their real values. When the number of items is of the order of 100, the effect of the inconsistency is not noticeable. In the simple Rasch model, the estimates can be


corrected effectively by multiplying them by (7 - 1)// (Wright and Douglas, 1977), and it may be possible to work out a correction factor for the HCM. Alternative procedures, such as the marginal maximum likelihood, may overcome the inconsistency without imposing undesirable constraints on the shape of the distribution. These possibilities are still being explored.

Goodness of Fit In considering tests of fit between the data and a model, the perspective taken in this chapter is that the model is an explicit rendition of a theory, and that the data collection is in turn governed by the model. Therefore, to the degree that the data accord with the model, to that degree they confirm both the theory and its operationalization in the data collection, and vice versa. In the application of IRT, the model chosen is expected to summarize the responses with respect to some substantive variable in which the items reflect differences in degree on the latent continuum. Thus the aim should be to construct items at different locations on the continuum, and the relative order of the items, become an hypothesis about the data. This theoretical ordering, taken as an hypothesis, is especially significant when the model that reflects the response process is single-peaked because, as already indicated, there are always two person locations that give the same probability of a positive response. The hypothesis that the responses take a single-peaked form can be checked by estimating the locations of the persons and the statements, dividing the persons into class intervals and checking if the proportions of persons responding positively across class intervals takes the single peaked form specified by the model. This is a common general test of fit, and can be formalized as /



EE t=i 0=1






where g = 1, G are the class intervals. Asymptotically, as the number of persons and items increases, this statistic should approximate the \2 distribution on (G - I)(I - 1) degrees of freedom. The power of this test of fit is governed by the relative locations of the persons and the statements, where the greater the variation among these the greater the power. Complementary to this formal procedure, a telling and simpler check is to display the responses of the persons to the statements when both are ordered according to their estimated locations. Then the empirical order

23. A Hyperbolic Cosine IRT Model

David Andrich



of the statements should accord with the theoretical ordering, which in the case of attitude statements is according to their affective values, and the matrix of responses should show the parallelogram form shown in Table 1. The study of the fit between the data and the model using the combination of these approaches, the theoretical and the statistical, is illustrated in the example. TABLE 2. Scale Values of Statements about Capital Punishment from Direct Responses Equal Unit Ai Estimated Ai Statement <5i(

Example The example involves the measurement of an attitude to capital punishment using a set of eight statements originally constructed by Thurstone's (1927, 1928) methods and subsequently studied again by Wohlwill (1963) and Andrich (1988). The data involve the responses of 41 persons in a class in Educational Measurement at Murdoch University (Australia) in 1993, and the sample will be discussed again. Table 21 shows the statements, their estimated affective values, and standard errors under the HCM model in which first all statements are assumed to have the same unit A, and second, where the unit A, is free to vary among statements. Table 3 shows the frequencies of all observed patterns, the estimated attitudes, and their standard errors. It may be considered that the sample is small. However, it has the advantage that the analyses can be studied very closely, and perhaps even more importantly, it shows that the model can be applied successfully even in the case of small samples. In this case, there is already strong theoretical and empirical evidence that the statements do work as a scale, and so the analysis takes a confirmatory role. In Table 2 the statements are ordered according to their estimated affective values and in Table 3 the persons are likewise ordered according to the estimate of the attitudes. This helps appreciate the definition of the variable and is a check on the fit. Table 2 shows that the ordering of the statements begins with an attitude that is strongly against capital punishment, through an ambivalent attitude, to one which is strongly for capital punishment. In addition, the likelihood ratio test confirms that the unit value is the same across statements, and this is confirmed by the closeness of the estimates of the affective values of statements when equal and unequal units are assumed and likewise for the estimates of the attitudes of the persons. The ordering of the persons shows the required feature of l's in a parallelogram around the diagonal. Figure 3 shows the distribution of the persons, and it is evident that it is bimodal. Even without any further checks on the fit, it should be apparent that the data are consistent with the model, and therefore this confirms the hypothesis of the ordering of the statements and their usefulness for mea1

Note: Likelihood ratio

2 X

for Ho: A; = X:

2 x

= 6.64, df = 6, p > 0.36.


Capital punishment is not an effective deterrent to crime.






The state cannot teach the sacredness of human life by destroying it.





i I [:


I don't believe in capital punishment but I am not sure it isn't necessary.





f ; •


I think capital punishment is necessary, but I wish it were not.





!' ;• '


Until we find a more civilized way to prevent crime, we must have capital punishment.






Capital punishment is justified because it does act as a deterrent to crime.






Capital punishment gives the criminal what he deserves.





Copyright 1995, Applied Psychological Measurement, Inc. Reproduced by permission.


David Andrich

TABLE 3. Distribution of Attitude Scale Values Response Pattern Total Attitude Estimate 9(d$) Different Aj Score Equal A 3 1 1 1 u 0 u -9.70 (1.53) -10.31 (2.42) 4 1 1 1 1 0 0 -6.50 (2.27) -5.86 (2.31) 2 0 0 1 1 0 0 -2.32 -1.75 (1.07) (1.5) 4 0 1 1 1 1 0 -2.32 -1.75 (1.07) (1.5) 4 0 0 1 1 1 0 0.31 (1.01) 0.04 (1.05) 5 0 1 0 1 1 1 1.15 (1.06) 1.36 (1.04) 5 0 0 1 1 1 1 1.15 (1.06) 1.36 (1.04) 4 0 0 1 0 1 1 2.44 (1.06) 2.31 (1.10) 6 0 0 1 1 1 1 2.44 (1.06) 2.31 (1.10) 4 0 0 0 1 1 1 2.44 (1.06) 2.31 (1.10) 4 0 0 1 0 1 1 2.31 (1.10) 2.44 (1.06) 5 0 0 1 0 1 1 3.67 (1.28) 3.67 (1.22) 5 0 0 0 1 1 1 3.67 (1.28) 3.67 (1.22) 3 0 0 0 0 1 1 3.67 (1.28) 3.67 (1.22) 4 0 0 0 1 0 1 6.88 (2.81) 6.30 (2.31) 4 0 0 0 0 1 1 (2.81) 6.32 (2.31) 6.89 3 0 0 0 0 0 1 10.69 (1.57) 10.36 (2.01)


u 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1 1

0 0 0 0 0 1 0 0 1 0 1 1 1 0 1 1 1


7 1 1 1 1 2 1 2 1 1 1 3 2 1 10 1

Copyright 1995, Applied Psychological Measurement, Inc. Reproduced by permission.
















06 &7



FIGURE 3. Distribution of attitudes of persons and locations of statements in the sample.


suring an attitude to capital punishment. The global test of fit according to Eq. (17), in which the sample is divided into three class intervals, has a value \2 = 18.00, df = 14, p > 0.21, which confirms this impression. In addition to the distribution of the persons, Fig. 3 shows the locations of the statements. Given the evidence that the data conforms to the SHCM (i.e., HCM with equal units), this figure summarizes the effort to simultaneously locate statements and persons on an attitude continuum when direct responses of persons to statements subscribe to the unfolding mechanism. The original data involved 50 persons, but the papers of two persons had no responses, and of three persons had incomplete responses. Although the models can be operationalized to cater for missing data, it was decided to use only complete data. It was surmised that, because the class had a number of students with non-English-speaking backgrounds, that the persons who did not complete their papers might have been from such backgrounds. Lack of familiarity with the language of response is relevant in all assessment, but it has a special role in the measurement of attitude where the statements often deliberately take the form of a cliche, catch phrase, or proverb, and it is not expected that a person ponders at length to make a response. Indeed, it is required to have a relatively quick and spontaneous affective, rather than a reasoned cognitive, response, and this generally requires fluency in the vernacular. The responses of the remaining 45 students with complete direct responses were then analyzed according to the model. The theoretical ordering was violated quite noticeably in that the statement / don't believe in capital punishment, but I am not sure it is not necessary was estimated to have a stronger attitude against capital punishment than the statement Capital punishment is one of the most hideous practices of our time. A close examination of the data showed that the responses of four persons had glaring anomalies and they seemed to agree to statements randomly. It was surmized that these four persons may also have had trouble understanding these statements or were otherwise distracted. Therefore they, too, were removed, leaving the sample shown in Table 3. Although there is no space to describe the example in detail, responses from the same persons to the same statements were obtained according to a pairwise preference design and analyzed according to a model derived from the HCM. When the responses of these four persons were eliminated from the analysis of the pairwise preference data, the test of fit between the data and the model also improved (Andrich, in press). This confirmed that those four persons had responses whose validity was doubtful in both sets of data. The point of this discussion is that four persons in a sample of 45 were able to produce responses that provided an ordering of statements which grossly violated the theoretical ordering of two statements, and that, with a strong theoretical perspective, these four persons could be identified. Ideally, these persons, if they were known, would be interviewed regarding


David Andrich

their responses. Then either it would be confirmed that they had trouble with the language, or if they did not, that they perhaps read the statements in some way different from the intended reading. It is by such a process of closer study that anomalies are disclosed, and in this case possible ambiguities in statements are understood.

Acknowledgments The research on hyperbolic cosine models for unfolding was supported in part by a grant from the Australian Research Council. G. Luo worked through the estimation equations and wrote the computer programs for the analysis. Irene Styles read the chapter and made constructive comments. Tables 2 and 3 were reproduced with permission from Applied Phychological Measurement, Inc., 1995.

References Andersen, E.B. (1973). Conditional inference for multiple choice questionnaires. British Journal of Mathematical and Statistical Psychology 26, 31-44. Andersen, E.B. (1977). Sufficient statistics and latent trait models. Psychometrika 42, 69-81. Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika 43, 357-374. Andrich, D. (1982). An extension of the Rasch model for ratings providing both location and dispersion parameters. Psychometrika 47, 105-113. Andrich, D. (1988). The application of an unfolding direct-responses and pairwise preferences (Research Report No. 4). Murdoch, Australia: Murdoch University, Social Measurement Laboratory. Andrich, D. (in press). Hyperbolic cosine latent trait models for unfolding direct responses and pairwise preferences. Applied Psychological Measurement. Andrich, D. and Luo, G. (1993). A hyperbolic cosine latent trait model for unfolding dichotomous single-stimulus responses. Applied Psychological Measurement 17, 253-276. Coombs, C.H. (1964). A Theory of Data. New York: Wiley. Coombs, C.H. and Avrunin, C.S. (1977). Single-peaked functions and the theory of preference. Psychological Review 84(2), 216-230. Coombs, C.H. and Smith, J.E.K. (1973). On the detection of structure in attitudes and developmental process. Psychological Review 5, 80(5), 337-351.

Davison, M. (1977). On a metric, unidimensional unfolding model for attitudinal and developmental data. Psychometrika 42, 523-548. Leik, R.K. and Matthews, M. (1968). A scale for developmental processes. American Sociological Review 33(1), 62075. Luo, G. and Andrich, D. (1993). HCMDR: A FORTRAN Program for analyzing direct responses according to hyperbolic cosine unfolding model (Social Measurement Laboratory Report No. 5). Western Australia: School of Education, Murdoch University. Post, W.J. (1992). Nonparametric Unfolding Models: A Latent Structure Approach. Leiden: DSWO Press. Rasch, G. (1961). On general laws and the meaning of measurement in psychology. In J. Neyman (ed.), Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability IV (pp. 321-334). Berkeley, CA: University of California Press. Thurstone, L.L. (1927). A law of comparative judgement. Psychological Review 34, 278-286. Thurstone, L.L. (1928). Attitudes can be measured. American Journal of Sociology 33, 529-554. van Blokland-Vogelesang, R. (1991). Unfolding and Group Consensus Ranking for Individual Preferences. Leiden: DSWPO Press. van Schuur, W.H. (1984). Structure in Political Beliefs, a New Model for Stochastic Unfolding with Application to European Party Activists. Amsterdam: CT Press. van Schuur, W.H. (1987). Constraint in European party activists' sympathy scores for interest groups: The left-right dimension as dominant structuring principle. European Journal of Political Research 15, 347362. van Schuur, W.H. (1989). Unfolding German political parties: A description and application of multiple unidimensional unfolding. In G. de Soete, H. Ferger, and K.C. Klauser (eds.), New Developments in Psychological Choice Modelling, (pp. 259-277). Amsterdam: North Holland. Verhelst, N.D. and Verstralen, H.H.F.M. (1991). A stochastic unfolding model with inherently missing data. Unpublished manuscript, CITO, The Netherlands. Volet, S.E. and Chalmers, D. (1992). Investigation of qualitative differences in university students' learning goals, based on an unfolding model of stage development. British Journal of Educational Psychology 62, 1734. Wohlwill, J.F. (1963). The measurement of scalability for noncumulative items. Educational and Psychological Measurement 23, 543-555.


David Andrich

Wright, B.D. and Douglas, G.A. (1977). Conditional versus unconditional procedures for sample-free item analysis. Educational and Psychological Measurement 37, 47-60. Wright, B.D. and Masters, G.N. (1982). Rating Scale Analysis: Rasch Measurement. Chicago: MESA Press.

24 PARELLA: An IRT Model for Parallelogram Analysis Herbert Hoijtink Introduction Parallelogram analysis was introduced by Coombs (1964, Chap. 15). In many respects it is similar to scalogram analysis (Guttman, 1950): Both models assume the existence of a unidimensional latent trait; both models assume that this trait is operationalized via a set of items indicative of different levels of this trait; both models assume that the item responses are completely determined by the location of person and item on the latent trait; both models assume that the item responses are dichotomous, that is, assume 0/1 scoring, indicating such dichotomies as incorrect/correct, disagree/agree, or, dislike/like; and both models are designed to infer the order of the persons as well as the items along the latent trait of interest from the item-responses. The main difference between the models is the measurement model used to relate the item responses to locations of persons and items on the latent trait. The scalogram model is based on dominance relations between person and item: Uij = 1 if (0j - Si) > 0, (1) Uij = 0 if (0j - Si) < 0, where Uij denotes the response of person j, j = 1,... ,N to item i, i = 1,..., n, 6j denotes the location of person j, and 6t denotes the location of item i. The parallelogram model is based on proximity relations between person and item: Utj = l Utj = 0

if \6j-6i\>T, if \9j - Si\ < r,

,, {l)

where r is a threshold governing the maximum distance between 6 and S for which a person still renders a positive response. A disadvantage of both the scalogram and the parallelogram model is their deterministic nature: The item responses are completely determined by the distance between person and items. As a consequence, these models are not very useful for the analysis of empirical data. What is needed


Herbert Hoijtink

are probabilistic models that take into account that item responses are explained not only by the distance between person location and item location but also by random characteristics of either the person or the item. The logistic item response models (Hambleton and Swaminathan, 1985) can be seen as the probabilistic counterpart of the scalogram model. This chapter will describe the PARELLA model (Hoijtink, 1990; 1991a; 1991b; Hoijtink and Molenaar, 1992; Hoijtink et al., 1994), which is the probabilistic counterpart of the parallelogram model. Most applications of the PARELLA model concern the measurement of attitudes and preferences. Some examples of such applications are the attitude with respect to nuclear power; the attitude with respect to capital punishment; political attitude; and preference for different quality/price ratios of consumer goods. PARELLA analyses are usually not finished with the estimation of the locations of the persons and the items. The estimated person locations can be used to investigate different hypotheses: Did the attitude toward gypsies in Hungary change between 1987 and 1992; does the political attitude of democrats and republicans differ; did the attitude toward save sex change after an information campaign with respect to aids; and, is the attitude in the car-environment issue related to income or age?

Presentation of the Model The PARELLA model can be characterized by four properties that will be discussed next: unidimensionality; single peaked item characteristic curves; local stochastic independence; and sample invariance of the item locations.

Unidimensionality The PARELLA model assumes that the latent trait a researcher intends to measure is unidimensional. This implies that the items that are used to operationalize this latent trait differ only with respect to the level of the latent trait of which they are indicative. Furthermore, this implies that the only structural determinant of the item responses is a person's location on the latent trait, i.e., that there are no other person characteristics that have a structural influence on the responses. The representation resulting from an analysis with the PARELLA model is unidimensional since both persons and items are represented on the same scale or dimension. The locations of the items constitute (the grid of) the measurement instrument for the assessment of the trait. The locations of persons constitute the measures obtained.


Probability of a Positive Response 1r


Latent Trait

FIGURE 1. Item characteristic curves as defined by the PARELLA model for items with location and 7-parameters (from left to right) of (—4, .5), (0,1), and (4,10).

Single-Peaked Item, Characteristic Curves In the PARELLA model the distribution of each item response is a function of the distance between person and item on the latent trait, and random characteristics of either the person or the item. The relative importance of these structural and random components is reflected by the parameter 7. The larger 7, the more sensitive is the probability of the response to the distance between person and item: P(Uij = 1 I M i , 7 ) = Pij = 1 - Qa = 1/(1 + \9j - <


In Fig. 1, the probability of a positive response in Eq. (3) (subsequently to be called item response function or IRF) is plotted as a function of 6 for the following three 6 — 7-combinations: (—4,0.5), (0,1), and (4,10). It can be seen that the probability of a positive response is single peaked. It decreases in the distance between person and item location, which is as it should be in a model where proximity relations are the main determinant of an item response. Note, furthermore, that the probability of a positive response equals one if 0 = S. Figure 1 also illustrates the function of the 7-parameter. From the IRF for 7 = 10, it can be seen that a person's response is completely determined by the distance between the person and the item: If this distance is smaller than one, a person will respond 1; if this distance is larger than one, a person will respond 0. For 7-parameter values larger than 10, the PARELLA model is virtually identical to Coombs' deterministic parallelogram model.


24. PARELLA: An IRT Model for Parallelogram Analysis

Herbert Hoijtink

The IRF's for 7 = 1 and 7 = .5 show that the item response is no longer exclusively determined by the distance between person and item. There is a non-ignorable probability (increasing with decreasing 7) that persons located at a relatively large distance from the items will response positively. Similarly, there is a non-ignorable probability that persons located closely to an item will not respond positively. For 7 = 0, the probability of a positive response in Eq. (3) is independent of the distance between person and item. This property implies that the item responses are completely random, and not (to at least some extent) determined by the locations of the persons and the items on the latent trait.


are the same for each person. In item response theory, this property is often called sample item invariance. As a consequence of this property, the estimators of the item locations do not have any (asymptotic) bias due to the sample of persons used. b. Absence of interaction also implies that the person locations are the same for any (sub)set of items. This implies that any (sub)set of items can be used to measure a person without any bias due to the items used, provided the PARELLA model gives an adequate description of the response process for each item.

Other Parallelogram Models or Generalizations Local Stochastic Independence The data typically analyzed with the PARELLA model consist for each person of a vector of l's and 0's indicating positive and negative item responses, respectively. These data can be seen as realizations of a series of (differently distributed) Bernoulli variables. For a valid analysis of such data sets, the dependence structure between the random variables has to be taken into consideration. This can be done in two ways: Either (1) the multivariate distribution of the variables has to be specified completely; or (2) a random effects model has to be used. The latter option is followed, for example, in the logistic item response models and the PARELLA model. The rationale behind a random effects model is that the dependence between the response variables can be explained by a person-specific random component, i.e., by the location of the person on the latent trait. Within the context of item response models this principle is often identified using the label of "local stochastic independence." Local stochastic independence implies that, conditionally on the person location, the item responses are independent: (4)

where U_j denotes the response vector of person j.

Sample Invariance of the Item Locations The PARELLA model contains two types of parameters: The item locations or fixed effects as well as the person locations or random effects. The model does not include an interaction between the fixed and the random effects. This has two implications that are relevant when the aim is to obtain measurements of some latent trait: a. The absence of interaction implies that if the PARELLA model gives an adequate description of the response process, the item locations

The PARELLA model is not the only stochastic parallelogram model. Other formulations for the probability of a positive item response conditional on the locations of the person and the item have appeared in the literature. Verhelst and Verstralen (1993) and Andrich (this volume) suggests to use ^






(5) In Fig. 2, the probability specified in Eq. (5) is plotted as a function of 6 for the following three 6 - 7-combinations: (-4, .5), (0,1), and (4,2). It can be seen that the probability of a positive response is single peaked. It decreases in the distance between person and item location, which is as it should be in a model where proximity relations are the main determinant of an item response. There are two differences between the PARELLA model and the model in Eq. (5). In the latter, the probability of a positive response is smaller than 1 for 6 = 6. Furthermore, the parameter 7 has a different function. In the PARELLA model the parameter represents the relative importance of the distance between person and item. As can be seen in Fig. 2, the 7-parameter in Eq. (5) jointly influences the threshold and the maximum of the item characteristic curve; i.e., the larger 7, the wider the threshold and the larger the probability of a positive response. Still other formulations of single peaked item characteristic curves can be found in Andrich (1988), Munich and Molenaar (1990), and DeSarbo and Hoffman (1986). Others are easily constructed. Both the PARELLA model and the model in Eq. (5) contain a location parameter for each person and item as well as a 7-parameter. One way to generalize these models would be to make the 7-parameter item specific. However, Verhelst and Verstralen (1993) show for Eq. (5) that, owing to a high correlation between the item specific 7 and 6 parameters, the model is not identified. For the PARELLA model no results with respect to this kind of generalizations are available yet.


Herbert Hoijtink


the representation in Eq. (7) over the one in Eq. (6) is that it enables the estimation of the density function of 6 without the necessity to make assumptions with respect to its shape. In other words, a nonparametric estimate of g(6) will be obtained (Lindsay, 1983). The likelihood function of the parameters (6,7, ?r) conditional upon the sample of response vectors 5 and the step locations B is given by

Probability of a Positive Response


log L{6,7, 7T I S, B) = 53 log P(Uj = uj) N


I Dq)Kq3=1





Latent Trait

FIGURE 2. Item characteristic curves as defined by Eq. (5) for items with location and 7-parameters (from left to right) of (—4, .5), (0,1), and (4.2).

Parameter Estimation In this section, procedures for parameter estimation in the PARELLA model will be given. First, the problem of estimating the item parameters will be addressed, and, next, attention will be paid to the problem of estimating the person locations in the model.



Note that the locations of the steps, £?, do not appear among the parameters. In the software for the PARELLA model, these locations are not estimated but chosen such that the weights for the first and the last location are in the interval (0.005,0.025) with the other locations equally spaced in between (Hoijtink, 1990; Hoijtink et al., 1994). In this way, at least 95% of the density function of 9 is located between the first and the last location. Note further that the parameters as they appear in Eq. (8) are only identified under the restrictions that

and Q

(9) 9=1

Estimation of Item Parameters Within item response models the basic elements in a sample are given by the response vectors U_ (Cressie and Holland, 1983). The marginal distribution of U can be written as follows:


U = u)= I P{U\9)dG{9), Jee


If the data contain more than one sample (e.g., men and women, or pre- and post-measures), a sample structure, with samples indexed by g = 1,..., G, can be incorporated in the model. The sum over samples of Eq. (8) under the restriction of equal item locations and equal 7 parameters across samples but with sample-specific density functions of the person parameter, yields N9

where g{6) denotes the density function of 9. Using the property that a continuous density function can be fairly well approximated by a step function with a finite number of steps, Eq. (6) can be rewritten as


£i°g 9=1


with restrictions

(7) 9=1

where the location and the heights of the steps (weights) are denoted by B_ and 7T, respectively, both with index q— 1,..., Q. An advantage of using

and f o r0 = 1 , . . . , G , 9=1



Herbert Hoijtink

where N9 is the number of persons in sample g. The estimation procedure is an application of the Expected Conditional Maximum-Likelihood (ECM) algorithm (Meng and Rubin, 1993). In an iterative sequence across three stages, subsequently the item locations, the density function of the person locations (for one or more samples), and the parameter 7 are updated. Within each stage, the EM algorithm (Dempster et al., 1977) is used to update the parameters of interest. In the M-step the values of the parameters that maximize the expected value of what Dempster et al. call the "complete data likelihood" (see below) have to be estimated. This is done conditionally upon the current estimates of the marginal posterior density function of the "missing data" (person location) for each person:

conditional on the response vector U_: Q

(14) 9=1

The error variance of the EAP estimate is

a2(EAP) =



~ EAP)2P{Bq | U),


= <•


where, from Bayes theorem, it follows that (16)

Software for the PARELLA Model

PQL 9=1

(12) Let {Kgj} be the set of probabilities in Eq. (12) for j = 1 , . . . , N, then the expected value of the complete data likelihood is given by (13) 9=19=1i=i

Within each stage, the EM-algorithm iterates between the E-step and the M-step until the estimates of the current parameters have converged. Across all three stages, the ECM-algorithm iterates until all parameter estimates have converged. Standard errors of estimation are obtained by inversion of the Hessian matrix of the likelihood functions. For further details, the reader is referred to Hoijtink and Molenaar (1992) and Hoijtink et al., (1994). A number of simulation studies were executed to determine the accuracy of the estimators of the item locations, the 7 parameter, the nonparametric estimate oig(9), and the standard errors for each of these estimators. There was strong evidence that all parameter estimators are unbiased and consistent. Furthermore, the simulation results indicated that the standard errors of the estimators were reasonably accurate. For some simulation results, see Hoijtink (1990, 1991a).

Estimation of Person Locations Once the item locations, the 7 parameter, and the density function of the person locations for one or more samples have been estimated, an estimate of each person's location and associated standard error of estimation can be obtained. The expected a posterior estimator (EAP) (Bock and Aitkin, 1981) is the mode of the a posteriori density function of the person location 1111



P(Bq\U) = P(U\Bq,6,'T)^l I


The PARELLA software is available from ProGAMMA, P.O. Box 841, 9700 AV Groningen, The Netherlands. The software has implemented the estimation procedures described above as well as the goodness-of-fit tests that will be described in the next section. Furthermore, it contains a number of diagnostic matrices (correlation matrix, conditional adjacency matrix) and goodness-of-fit tests thafe were developed and studied by Post (1992) for use with nonparametric parallelogram models, i.e., models that do not have a parametric specification of the item response function. The software goes with a comprehensive manual (Hoijtink et al., 1994) describing the theory underlying both the parametric and the nonparametric parallelogram models is available. The manual describes the use of the software and the interpretation of the output. The program can handle 10,000 persons, 55 items, and 10 samples at the same time. It runs on PC's and compatibles and has a user-friendly interface to assist with the input of the data and the selection of the output options.

Goodness of Fit As noted earlier, the PARELLA model has the four important properties of unidimensionality, single peaked item response functions, local stochastic independence, and sample invariance of item locations. In this section goodness-of-fit tests will be discussed that address the sample invariance of item locations and the single peakedness of the item response functions.

Tests for Sample Invariance of the Item Locations Two procedures for testing the invariance of the item locations and 7 are across samples g = 1, • • • ,G are presented. In the first procedure the hy-


Herbert Hoijtink

pothesis of invariance is tested simultaneously for all parameters involved. Then, for diagnostic purposes, a separate test for each individual parameter is discussed: The simultaneous hypotheses of sample invariance of the item and 7 parameters is formulated as and These hypotheses can be tested by a likelihood-ratio statistic consisting of the product of the unconstrained likelihood functions in Eq. (8) across the samples (LI) over this product with item and 7 parameters are constrained to be equal across the samples (L2): LR = -2 log(L2/Ll).

Ho:fi| = ••• = «?( 19 ) where i runs from 1 to n across the tests. In addition, the hypothesis of invariant 7 parameters, HoY—- — 7G,


has to be tested separately. These hypotheses of invariant locations can be tested using the following Wald statistics: G Y,- = > (Of — 0)


Test for Singled Peakedness of the Item Response Function The adequacy of the item response function in the PARELLA model for the data at hand, can be tested for each item through a comparison of the latter with the empirical function. This is done with the ICC statistic defined below. The null distribution of the statistic is unknown but some relevant percentiles were determined via a simulation study (95th = .65, 99th = .78) (Hoijtink, Molenaar, and Post, 1994). The ICC statistic is defined as (23) Lq=l

for i = 1,..., n, where


Under the hypotheses in Eq. (17), LR is asymptotically chi-squared distributed with (G - l)ra degrees of freedom (Hoijtink and Molenaar, 1992). If LR is significant, the hypotheses of sample invariant item and 7 parameters have to be rejected. The next step is to determine which parameter is sample dependent and to remove that item from the item set. The individual hypotheses of sample invariant locations are formulated as






1 , . . . , Q,


denotes the empirical number of persons at node q giving a positive response to item i estimated from the data, and, Nqi, =

for q = 1 , . . . , Q,


is this number predicted by the PARELLA model. If the differences in Eq. (23) are small, the PARELLA model provides an adequate description of the response process for the item at hand: the empirical choice proportions will be small at nodes located at a large distance from the item at hand, and large at nodes at a small distance.




where 6f, and


The Wald statistic for the 7 parameter is obtained substituting estimates for these parameter and its standard error. Under the null hypothesis each Wald statistic is asymptotically chisquare distributed with G - 1 degrees of freedom (Hoijtink and Molenaar, 1992). Since the Wald statistics are correlated, usually only the item with the largest value for the statistic is removed from the item set, after which the procedure is repeated for the reduced item set.

The data for this example were collected by Rappoport et al. (1986), and concern the degree of agreement with 12 political measures (see Table 1) by 697 party activists from Missouri in the United States. The original ratings of 1-5 (agree strongly to disagree strongly) were recoded to dichotomous ratings so that the data could be analyzed with the PARELLA model. In doing so, the ratings 1 and 2 were recorded as 1, whereas the ratings 3, 4, and 5 were recoded as 0. The sample consisted of the following two subsamples: Democrats (N = 317) and Republicans (N = 380). The main question to be answered with the help of the PARELLA model is whether the 12 political measures were indicative of a latent trait ranging from a Democratic to Republican political attitude. This will only be the case if the proximity relations between the activists and the measures are primarily determined by the evaluation of the measures, that is, if being


24. PARELLA: An IRT Model for Parallelogram Analysis

Herbert Hoijtink

National health insurance Affirmative actions program Ratification of SALT II Wage and price controls U.S. military in Middle East Draft registration Increase defense spending Spending cuts/balancing budget Nuclear power Reduce inflation Amendment banning abortion Deregulation of oil and gas 7 parameter Mean of g{9) Variance of g(6)

Democrats Location Prop. -2.0 0.45 -1.8 0.39 -1.7 0.39 -1.2 0.39 -0.5 0.48 0.0 0.70 0.6 0.51 1.1 0.38 1.2 0.35 1.3 0.34 1.3 0.35 1.6 0.28

ICC 0.90 0.11 0.19 0.33 0.94 0.50 0.70 0.50 0.75 0.49 0.53 0.71

Republicans Location Prop. -3.9 0.06 0.17 -1.8 -3.6 0.05 -1.4 0.17 1.4 0.67 1.3 0.70 0.9 0.90 1.2 0.80 1.3 0.70 1.5 0.61 1.7 0.52 1.4 0.67



-0.53 2.61

0.81 0.55



0.25 0.39 0.33 0.36 0.20 0.18 0.06 0.23 0.39 0.14 0.78 0.21

2.2 0.5 1.9 1.3 2.1 1.8 1.5 0.6 1.3 1.2 1.7 0.7

a Democrat (Republican) tends to coincide with agreement of the political measures located on the Democratic (Republican) end of the attitude dimension. - In Table 1, the results of separate and joint analyses of the Democratic and Republican samples are presented. It is clear from the results for the test of the single peakedness of the item response function (column ICC) that within each subsample proximity relations between activists and measure are the main determinant of the evaluations of the measures since most ICC values are smaller than the critical value associated with the 99th percentile of the null distribution (0.78). However, the last column of


the table shows that the single peakedness assumption does not hold for the complete sample because most ICC values were larger than 0.78. Such a result usually implies that the locations of the measures differ between the two samples, i.e., if both Democrats and Republicans had to order the measures from "Very Democratic" to "Very Republican" each group would give a different order. Comparing the estimates of the locations of the political measures between Democrats and Republicans (see Table 1), this result is indeed found. To be sure that the differences are not due to sample fluctuations, the likelihood ratio test for sample invariance was computed. The result leaves no room for doubt (LR = 214, DF = 12, p = 0.00). It is not uncommon to find that a limited number of items/measures causes this problem. However, even after removal of 6 of the 12 measures from the analysis, the results did not improve. Since the Democrats and the Republicans used different definitions for the political attitude dimension, their locations cannot be compared. Nevertheless, a number of interesting observations can be made. For example, in Table 1 it can be seen that the average political attitude of the Democrats equaled -0.53. This average is located toward a number of political measures of a social nature. The average of the Republicans equaled .81. This average is towards a number of political measures of a more economic nature. Thus, although these numbers cannot directly be compared, they at least suggest that the Democrats indeed were more Democratic than the Republicans. For the variances of the distribution of the political attitudes within each sample, we found the value 2.61 for the Democrats, and .55 for the Republicans. This result implies that the Republicans constitute a much more hom*ogeneous group with respect to the political attitude than the Democrats. In Table 2, estimates of the full density functions of the political attitudes are presented for both the Democrats and the Republicans. As can be seen, the inhom*ogeneity of the Democrats is mainly caused by a group of activists located around 3.75. This value actually represents a very undemocratic location. From the first column in Table 1, it can be seen that this group is located much closer to the economic measures than to the more social measures.

TABLE 2. Nonparametric Estimate of g{6). Step Location -3.75 -2.75 -1.75 -0.75 0.25 1.25 2.25 3.25 4.25

Democrat Weight 0.04 0.12 0.14 0.05 0.60 0.00 0.00 0.02 0.03

Republican Weight 0.00 0.01 0.00 0.00 0.34 0.61 0.01 0.00 0.03

Discussion Occasionally, it is found that the parametrization chosen for the PARELLA model is not flexible enough to model item response. For such cases the use of a nonparametric parallelogram model (Post, 1992; Van Schuur, 1989) is recommended. This model only assumes single peakedness of the item response function. As a nonparametric parallelogram model only orders the persons and items along the latent trait, and does not offer a representation at interval


Herbert Hoijtink

level, they may fail to meet the demands of the researcher interested in applying parallelogram models. Currently parametrizations more flexible than Eqs. (3) and (5) are under investigation. Some of these parameterizations contain more parameters than just the person and item locations and the 7 parameter.

References Andrich, D. (1988). The application of an unfolding model of the PIRT type to the measurement of attitude. Applied Psychological Measurement 12, 33-51. Bock, R.D. and Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM-algorithm. Psychometrika 46, 443-459. Coombs, C.H. (1964). A Theory of Data. Ann Arbor: Mathesis Press. Cressie, N. and Holland, P.W. (1983). Characterizing the manifest probabilities of latent trait models. Psychometrika 48, 129-141. Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihood estimation from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39, 1-38. DeSarbo, W.S. and Hoffman, D.L. (1986). Simple and weighted unfolding threshold models for the spatial representation of binary choice data. Applied Psychological Measurement 10, 247-264. Guttman, L. (1950). The basis for scalogram analysis. In S.A. Stouffer et al. (eds.), Measurement and Prediction (pp. 60-90). New York: Wiley. Hambleton, R.K. and Swaminathan, H. (1985). Item Response Theory. Principles and Applications. Boston: Kluwer-Nijhoff Publishing. Hoijtink, H. (1990). A latent trait model for dichotomous choice data. Psychometrika 55, 641-656. Hoijtink, H. (1991a). PARELLA: Measurement of Latent Traits by Proximity Items. Leiden: DSWO-Press. Hoijtink, H. (1991b). The measurement of latent traits by proximity items. Applied Psychological Measurement 15, 153-169. Hoijtink, H. and Molenaar, I.W. (1992). Testing for DIF in a model with single peaked item characteristic curves: The PARELLA model. Psychometrika 57, 383-397. Hoijtink, H., Molenaar, I.W., and Post, W.J. (1994). PARELLA User's Manual. Groningen, The Netherlands: iecProGAMMA. Lindsay, B.G. (1983). The geometry of mixture likelihoods a general theory. The Annals of Statistics 11, 86-94.

Meng, X. and Rubin, D.B. (1993). Maximum likelihood via the ECM algorithm: A general framework. Biometrika 80, 267-279. Munich, A. and Molenaar, I.W. (1990). A New Approach for Probabilistic Scaling (Heymans Bulletin HB-90-999-EX). Groningen: Psychologische Instituten, Rijksuniversiteit. Post, W.J. (1992). Nonparametric Unfolding Models: A Latent Structure Approach. Leiden: DSWO-Press. Rappoport, R.B., Abramowitz, A.I., and McGlennon, J. (1986). The Life of the Parties. Lexington: The Kentucky University Press. van Schuur, W.H. (1989). Unfolding the German political parties. In G. de Soete, H. Feger, and K.C. Klauer (eds.), New Developments in Psychological Choice Modeling. Amsterdam: North-Holland. Verhelst, N.D. and Verstralen, H.H.F.M. (1993). A stochastic unfolding model derived from the partial credit model. Kwantitatieve Methoden 42, 73-92.

Part VI. Models with Special Assumptions About the Response Process

The four models presented in this section do not fit into the other five sections and are difficult to describe as a group. At the same time, the four models have features that make them important to psychometricians in the analysis of special types of data. First, situations often arise in practice where the IRT model of choice is to be applied to several groups. A good example is the identification of differential item functioning (DIF). The groups might consist of (1) males and females; (2) Whites, Blacks, and Hispanics; or (3) examinees from multiple nations. These comparative IRT analyses have become about as common as classical item analyses in testing agencies. Another common situation involving multiple groups arises when tests are being linked or equated to a common scale. Many more popular IRT applications involving multiple groups can be identified. It is common to analyze the results for each group separately and then link them in some way (e.g., using anchor items) for the purposes of making comparisons at the item level, ability level, or both. The model developed by Bock and Zimowski (Chap. 25) unifies many of the applications of IRT involving multiple groups and even the analysis of data when the group becomes the unit of analysis (e.g., as in program evaluation). Some of the most common applications of IRT become special cases of the multiple group IRT model. A unified treatment is valuable in its own right but may help too in formulating solutions to new IRT applications. Rost's logistic mixture models introduced in Chap. 26 have some similar properties to the unified treatment of multiple group models of Bock and Zimowski. The main differences are that Rost's models are (1) a mixture of an IRT model and a latent class model (to handle multiple groups) and, to data, (2) less general (Rost has limited his work to the Rasch model only though he has considered the extension to polytomous response data). Related models by Gitomer and Yamamoto (1991) and Mislevy and Verhelst (1990) are not considered in this volume but would be excellent follow-up references. Fundamental to every IRT model is the assumption of local independence. Unfortunately, this assumption is violated at least to some extent


Part VI. Models with Special Assumptons

with many tests. For example, one violation occurs when a set of items in a test is organized around a common stimulus such as a graph or passage. In both examples, examinee responses to items may not be completely independent and this is a violation of the assumption of locally independence. It becomes a problem whenever one more ability is needed to explain test performance in excess of the abilities contained in the model of interest. Violations of the assumption of local independence complicate parameter estimation and may result in ability estimates of dubious value. One solution to this problem is to develop IRT models which do not make the troublesome assumption. These are called locally dependent models by Jannarone (Chap. 27) and while not ready for day-to-day use, these new models do offer researchers some potential for solving a troublesome problem. More work along the same lines of Jannarone would certainly be welcome in the coming years. One of the criticisms that has been leveled at dichotomously-scored multiple-choice items is that there is no provision for assessing partial knowledge. As both Bock, and Thissen and Steinberg have reported in their chapters, there is assumed to be partial knowledge contained in the wrong answers of multiple-choice items. Some incorrect answers reflect more partial knowledge than others and ability can be more accurately estimated if this information in partially correct answers is used. In Chap. 28, Hutchinson describes a family of models what he calls mismatch models for improving the estimation of examinee ability and gaining new information about the cognitive functioning of examinees on the test items. These models are new but deserve further exploration. Comparisons of Hutchinson's solution to the problem of assessing partial knowledge in multiple-choice items and the solutions offered in the polytomous IRT models earlier in this volume is a topic deserving of additional research.

References Gitomer, D. and Yamamoto, (1991). Performance modeling that integrates latent trait and class theory. Journal of Educational Statistics 28, 173189. Mislevy, R. and Verhelst, N. (1990). Modeling item responses when different subjects employ different solution strategies. Psychometrika 55, 195215. Thissen, D. (1976). Information in wrong responses to the Raven progressive matrices. Journal of Educational Measurement 13, 201-214.

25 Multiple Group IRT R. Darrell Bock and Michele F. Zimowski Introduction The extension of item response theory to data from more than one group of persons offers a unified approach to such problems as differential item functioning, item parameter drift, nonequivalent groups equating, vertical equating, two-stage testing, and matrix-sampled educational assessment. The common element in these problems is the existence of persons from different populations responding to the same test or to tests containing common items. In differential item functioning, the populations typically correspond to sex or demographic groups; in item parameter drift, to annual cohorts of students; in vertical equating, to children grouped by age or grade; in nonequivalent groups equating, to normative samples from different places or times; in two-stage testing, to examinees classified by levels of performance on a pretest; and in matrix-sampled educational assessment, to students from different schools or programs administered matrix-sampled assessment instruments. In all these settings, the objective of the multiplegroup analysis is to estimate jointly the item parameters and the latent distribution of a common attribute or ability of the persons in each of the populations. In classical test theory it has long been known that the sample distribution of a fallible test score is not an unbiased estimator of the true-score distribution. For if the test score, y = 9 + e, is assumed to be the sum of the true score, 9, distributed with mean fi and variance a2 in the population of persons, and an independent measurement error e distributed with mean zero and variance of, then the mean of a sample of test scores is an unbiased estimator of n, but the score variance estimates cr2 + of. This fact complicates considerably the comparison of population distributions based on scores from tests with different measurement error variances, even when the tests are presumed to measure the same true score. Andersen and Madsen (1977) were the first to show that IRT can provide direct estimation of parameters of the latent distribution without the intervening calculation of test scores. They simplified matters somewhat by assuming that the item parameters were already known from some previous large-sample calibration, but subsequent work by Sanathanan and Blumen-


R. Darrell Bock and Michele F. Zimowski

25. Multiple Group IRT

thai (1978), de Leeuw and Verhelst (1986) and Lindsay et al. (1991) extended their results to simultaneous estimation of parameters of the items and the latent distribution, and to semi-parametric and latent-class representation of the latent distribution. All of this work was limited to the oneparameter logistic model and to one population, however, and is primarily of interest for the connections it reveals between IRT, log-linear analysis, and latent class estimation. The results are too restrictive for application to practical testing problems where items may differ in discriminating power, the effects of guessing must be accounted for, some exercises may have multiple response categories, and more than one population may be involved. A more general approach adaptable to these situations was foreshadowed in Mislevy's (1984) work on estimating latent distributions by the maximum marginal likelihood (MML) method (Bock and Aitkin, 1981; Bock and Lieberman, 1970). In this chapter, we discuss the extension of Mislevy's work to applications, such as those above, involving multiple populations.

Presentation of the Model

its probability density function as gk(d)- The unconditional, or marginal, probability of pattern Uj in group k may therefore be expressed as, say, Pk(Uj) = /

P(Uj | 9)gk(9)d9.


Following Bock (1989), we suppose that the response functions depend upon some unknown, fixed parameters, £, and that the population distribution functions depend on some other unknown, fixed parameters, rj. In that case, Eq. (3) may be written more explicitly as



that is, we assume each of the response and distribution functions to be members of their respective parametric families. We address below the problem of how to estimate £ and 77 simultaneously, given the observed item scores Uktj-

Other Representations of the Population Distributions

To be fully general, we consider data arising from person j in group k responding in category h of item i. We suppose there are g groups, Nk persons in group k, rrii response categories of item i, and a test composed of n items. A response to item i may therefore by coded as Ui, where E/j takes on any integral value 1 through rrii. Similarly, the pattern of response of person j to the n test items may be expressed as Uj = [Uij, U23 ,•••, Unj\We assume the existence of a twice-differentiable response function, P l h { 6 ) = Pi(Ui = h \ 6 ) ,



common to all groups and persons, that describes the probability of a response in category h, given the value of a continuous and unbounded person-attribute, 9. We require the response categories to be mutually exclusive and exhaustive, so that

We further assume conditional independence of the responses, so that the probability of pattern Uj, given 9, is (2)

We also assume 9 to have a continuous distribution with finite mean and variance in the population of persons corresponding to group k and write

In educational, sociological, psychological, and medical applications of IRT, the presence of many independent sources of variation influencing the person attribute broadly justifies the assumption of normal distributions within groups. The most likely exceptions are (1) samples in which, unknown to the investigator, the respondents consist of a mixture of populations with different mean levels and (2) groups in which the respondents have been arbitrarily selected on some criterion correlated with their attribute value. Both of these exceptions, and many other situations leading to non-normal distributions of 9, are amenable to the following alternative representations treated in Mislevy (1984).

Resolution of Gaussian Components A plausible model for distributions in mixed populations is a mixture of Gaussian components with common variance but different means and proportions. (The assumption of different variances leads to difficult estimation problems and is best avoided.) Day (1969) proposed a method of obtaining the maximum likelihood estimates of the variance, means, and proportions in manifest distributions. Mislevy (1984) subsequently adapted it to latent distributions.

Histogram In educational testing one often encounters test data from groups of students who have been selected on complicated criteria based partly on previous measures of achievement. The effect of the selection process on the


R. Darrell Bock and Michele F. Zimowski

25. Multiple Group IRT

latent distribution may be too arbitrary or severe to allow parsimonious modeling. In that case, the only alternative may be a nonparametric representation of the latent density. A good choice is to estimate the density at a finite number of equally spaced points in an interval that includes almost all of the probability. If these estimates are constrained to sum to unity, they approximate the probability density in the form of a histogram.

and Lkj{6) = P(Ukj

| 6) = flPiiUw = h\6)

is a conditional likelihood, given 0. Applying the differential identity


Parameter Estimation Fortunately, almost all practical applications of item response theory involve large samples of respondents from the population of interest. This broadly justifies maximum likelihood methods and related large-sample statistics when estimating item parameters and population distributions. Specifically, the maximum marginal likelihood (MML) method applies here.

The Likelihood Equations When sample sizes are large in relation to the number of items, there are computational advantages in counting the occurrences of distinct response patterns in the data. Let rkj be the number of occurrences of pattern j in group fc, and let sk < min(Nk,S), where 5 = Jl? mii be the number of patterns with rkj > 0. Assuming random sampling of independently responding persons, we have product-multinomial data of which the marginal likelihood is, say,

The likelihood equations for the parameters of item i may be expressed as a log

to Eq. (4) gives 9





. o, (a,

which must be solved under the restriction Y^™1 Pih{9) = 1The m-category logistic models (see Bock, this volume) give especially simple results in this context. For m = 2, for example, the restriction Pi2{9) = 1 - Pn(0) reduces Eq. (5) to >P \ p rkj k


(6) ^



d li

where Zi(6) = a.i6 + ji is the logit in 8, the item slope Qj, and the item intercept 7$. Similar results for the logistic nominal categories model and ordered categories model may be deduced from derivatives given in Bock (1985). The likelihood equations for the population parameters r\k are considerably simpler: d\ogLM


kj k




Pkj Suppose the distribution of 6 for group k is normal with mean \ik and variance a\. Then g{9) = (1/V^ra) exp[-(6l - ^k)2/2a2} and

A Lh (e)air(9)d0 )LkA6)gk{6)de

r_kj_ f Pkj J-





rkj (4)



Pkj = P(Ukj), f

1 if Ukij = h

l 0 otherwise







= 0-


We recognize Eq. (8) as the sum of the posterior means (Bayes estimates) of 6 in group k, given the response patterns Ukj, j = 1, 2,..., Sfc, minus the


R. Darrell Bock and Michele F. Zimowski

25. Multiple Group IRT

corresponding population mean. Thus, for provisional values of the item parameters, the likelihood equation for \ik is, say, (10) where


In some applications (achievement testing, for example), it may not be desirable to include information from group membership in the estimation of 9. In that case, the population density function in the Bayes estimator should be replaced by the sum of the group density functions normalized to unity.

Gaussian Resolution = 6- /


is the posterior mean of 9, given Ukj. Similarly, making the substitution 9 - fik = (9 - 9kj) + {9kj - A*fc), we see that Eq. (9) is Nk times the sample variance of the Bayes estimates of 9 minus the population variance, plus the sum of the posterior variances of 9, given the observed patterns. Thus, the likelihood equation for a1 when the item parameters are known is

In the present notation, the resulting likelihood equations at iteration t for the proportion and mean attributable to component £, £ = 1,2,...,£, of the distribution in group k are





where 1



•* kj J—

& s


(0~ hj?Lkj{9)g{9)d9

is the posterior variance of 9, given Ukj- These equations are in a form suitable for an EM solution (see Bock, 1989; Dempster et al., 1981). In IRT applications, the integrals in Eqs. (5) and (7) do not have closed forms and must be evaluated numerically. Although Gauss-Hermite quadrature is the method of choice when the population distribution is normal, it does not generalize to other forms of distributions we will require. For general use, simple Newton-Cotes formulas involving equally-spaced ordinates are quite satisfactory, as demonstrated in the computing examples below.

Estimating 9 In IRT applications, the most prominent methods of estimating the personattribute, 6, are maximum likelihood and Bayes. In multiple-group IRT, the Bayes estimator, which appears in Eq. (10), has the advantage of incorporating the information contained in the person's group assignment. Conveniently, it is also a quantity that is calculated in the course of estimating the item parameters by the MML method. The quadrature required to evaluate the definite integral poses no special problem, but the range and number of quadrature points must be sufficient to cover the population distributions of all groups. The square root of the corresponding posterior variance, which appears in Eq. (11), serves as the standard error of the estimate.




=W / fc £j Wr j 7_o iv


°$\uki = " 5 -





*i(0)Pki9ki(e)d6, P


where gu{9) = (27ralr1/2exp[l~(9-fike)2/2&2k} and Lkj(9)pkegke(9)/Pkj is the posterior probability that a person in group k with attribute value 9 belongs to population component (, given Ukj. Similarly, for the common cr2, (14)



These equations are in a form suitable for an EM solution starting from plausible provisional values pk°e\ jrk°^, and (CT 2 .)^. Typically, one starts with (. — 1 and proceeds stepwise to higher numbers of components. The difference of marginal log likelihoods at each step provides a test of the hypothesis that no further components are required in group k. The difference in log likelihoods for group k is evaluated as (15) where Pkj is the marginal probability of pattern j in group k after fitting the £ + 1 component latent distribution, and Pk]' is the similar probability for the £ component model. Under the null hypothesis, Eq. (15) is distributed in large samples as chi-square with df = 2. To be conservative, She £ + 1 component should be added only if this statistic is clearly signifcant (say, 3 to 4 times its degrees of freedom). Resolution of Gaussian components has interesting applications in belavioral and medical genetics, where it is used to detect major-gene effects



25. Multiple Group IRT

R. Darrell Bock and Michele F. Zimowski

on a quantitative trait in a background of polygenic variation (see Bock and Kolakowski, 1973; Dorus et al., 1983). It is also a tractable alternative to the computationally difficult Kiefer and Wolfowitz (1956) step-function representation of latent distributions, and it has the advantage of being analytic.

The Latent Histogram Estimation of the latent distribution on a finite number of discrete points is especially convenient in the present context, because the density estimates are a by-product of the numerical integrations on equally-spaced ordinates employed in the MML estimation. That is, the estimated density for group k at point Xm is, say, (16)

4>km =

where Nkq = Y^k rkjLkj(Xq)gk(Xq)/Pkj. These density estimates can be obtained jointly with the item parameter estimates provided the indeterminancy of location and scale is resolved by imposing on Xq the restrictions ,







[See Mislevy (1984).] These point estimates of density may be cumulated to approximate the distribution function. Optimal kernel estimation (Gasser et al., 1985) may then be used to interpolate for specified percentiles of the latent distribution.

Example The purpose of educational assessment is to evaluate the effectiveness of educational institutions or programs by directly testing learning outcomes. Inasmuch as only the performance of the students in aggregate is in question, there is no technical necessity of computing or reporting scores of individual students, except perhaps as a source of motivation (see Bock and Zimowski, 1989). Lord and Novick (1968, pp. 255-258) have shown that, for this type of testing, the most efficient strategy for minimizing measurement error at the group level is to sample both students and items and to present only one item to each student. In statistical terminology, this procedure is an example of matrix sampling: the students correspond to the rows of a matrix and the items correspond to the columns. If multiple content domains are measured simultaneously by presenting one item from each domain to each student, the procedure is called multiple matrix


sampling. The original conception was that the results of a matrix sampled assessment would be reported at the school or higher level of aggregation simply as the percent of correct responses among students presented items from a given domain. Standard errors of these statistics under matrix sampling are available for that purpose (Lord and Novick, 1968; Sirotnik and Wellington, 1977). An IRT treatment of such data is also possible, however, even though, with each student responding to only one item in each domain, matrix sampling data would not seem suitable for conventional IRT analysis. (At least two items per respondent are required for nondegenerate likelihoods of the item or person parameters.) Bock and Mislevy (1981) proposed a grouplevel IRT analysis of multiple matrix sampling data that retains most of the benefits of the person-level analysis. They developed the group-level model that the California Assessment Program applied successfully to binary scored multiple-choice items and multiple-category ratings of exercises in written expression. Briefly, the statistical basis of this model, as described in Mislevy (1983), is as follows. The groups to which the model applies are students at specified grade levels of the schools in which the effectiveness of instruction is periodically evaluated. The probability that a randomly selected student j in a given grade will respond correctly to item i of a given content domain is assumed to be a function of a school parameter v, and one or more item parameters (,. Writing this probability for school k and item i as and assuming that the students respond independently, we have for the probability that rki of the Nki students presented item i will respond correctly: Then on the condition that each student is presented only one item from the domain, the likelihood of v and £, given responses to n items in N schools, is N

(18) Assuming that a good-fitting and computationally tractable group-level response function can be found, the likelihood equations derived from Eq. (18) can be solved by standard methods for product-binomial data on the assumption that v is either a fixed effect or a random effect. Solutions for the fixed-effect case have been given by Kolakowski and Bock (1982) and Reiser (1983). In the random-effects case, the maximum marginal likelihood method of Bock and Aitkin (1981) applies if Eq. (17) is substituted for the likelihood for person-level data on the assumption of conditional independence.


R. Darrell Bock and Michele F. Zimowski

MML analysis is attractive because the substitution Nki = 1 and rki = Uku in Eq. (17) specializes the computing algorithm to the person-level model for binary-scored items. It also is more stable numerically than the fixed-effects analysis when Nki (the number of students in school k presented item i) is small. Concerning the crucial question of the choice of group-level response function, there is good reason to assume, as a working hypothesis testable in the data, the normal ogive model or its logistic approximation. For if the normal-ogive model applies at the student level, and the student proficiencies within schools are distributed N(vk,cr2)—where vk is the school mean and a2 is hom*ogeneous among schools—then the response process for item i within schools, y = 0 + £$, will be normal with mean vk and variance a2 + a" 1 , where a* is the item slope. In that case, the proportion of students responding correctly to item i in school k will be


where <& is the standard normal distribution function and /3j is the item threshold. This line of reasoning broadly justifies the use of the conventional person-level item-response models at the group level, and it extends as well to the model for guessing effects (see Mislevy, 1984). The validity of the model is easy to check because, with different students presented each item, the frequencies of correct response are strictly independent and (rki ~ NkPki)2 (19) NkPki(l - Pki) is distributed as x2 on n degrees of freedom. If Eq. (19) is generally nonsignificant for most schools in the sample, there is no evidence for failure of the model. For the generally easy items of the California assessment, Bock and Mislevy (1981) found uniformly good fit of the two-parameter logistic model. The group-level model extends readily to multiple-category data and is especially valuable in the analysis and scoring of responses to essay questions. The California Direct Writing Assessment, for example, required students to write on one matrix-sampled essay topic in a 45-minute period. The essays were graded by trained readers on a six-category scale for each of three aspects of writing proficiency. Although these ratings did not permit IRT analysis at the student level, the group-level treatment was straightforward and greatly facilitated annual refreshing of the essay topics by nonequivalent groups equating. The example is based on data from the California Direct Writing Assessment. Eighth-grade students in a sample of 20 public schools wrote a


40-minute essay in response to a randomly assigned prompt for a particular type of writing. Each type of writing was represented by ten different prompts. Each essay was scored on six-point scales of "Rhetorical Effectiveness" and "Conventions of Written English." The data in Table 1 summarize the scores of the former for twelveth grade students in a random sample of 20 California schools in response to ten prompts of a type of writing called "Autobiographical Incident." Because eight different types of writing were assessed, making a total of eighty distinct prompts, the data are sparse, even in large schools, as is evident in the table. Nevertheless, they are sufficient to estimate item parameters and school-level scores by the graded model or the generalized partial credit model (see Muraki, this volume). Table 2 shows the MML estimates of the item slope and threshold parameters of the generalized rating scale model. The category parameters are shown in Table 3. A normal latent distribution with mean 0 and variance 1 was assumed in the estimation. As a consequence of the small number of schools and the few students per school responding to these prompts, the standard errors of the item parameters are relatively large. School scale scores of the California Direct Writing Assessment were scaled to a mean of 250 and standard deviation of 50 in the first year of the program. The scale for subsequent years, with 60 percent replacement of prompts each year, was determined by nonequivalent groups equating. Table 4 represents first year maximum likelihood estimation of scale scores computed from the estimated parameters in Tables 2 and 3. The sparse data again result in relatively large standard errors. Maximum likelihood rather than Bayes estimation is necessary in this context to avoid regressions to the mean proportional to the number of students responding per school.

Discussion With the exception of the group-level model, applications of multiplegroup IRT in large-scale testing programs require substantial computing resources. They are quite amenable, however to 486 or Pentium micro computers with 8 MB of RAM and a large hard disk, and to RISC and SPARC workstations. With the exception of the resolution of latent distributions into Gaussian components, the multiple-group procedures are implemented on these platforms in the case of binary scored items in the BILOG-MG program of Zimowski et al. (1996). Incorporation of the Gaussian resolution is in progress for the BILOG program (Mislevy and Bock, 1996) and the BILOGMG program. Implementation of multiple-group procedures for multiplecategory ratings is in progress for the PARSCALE program of Muraki and Bock (1991). The group-level model above is implemented in the BILOG, BILOG-MG, and PARSCALE programs.

TABLE 1. School Level Data from the California Direct Writing Assessment: Six-Category Ratings of Ten Essay Prompts. School


1 178

1 2

3 4 5 2 206

1 2

3 4 5 3 209

1 2

3 4 5 4 182

1 2

3 4 5 5 254

1 2

3 4 5 6 219

1 2 3

4 5 7 466

10 371

Categories 00 1 100 000 000 00 1 2 0 0 001 2 0 0 000 200

School 11 363

0 0 12 0 0 0 1002 0 0 0 111 0 0 10 11 0 0 0 10 1 0

6 7 8 9 10

0 0 1 1 00

12 347

1 0 00 1 0 0 0011 0 0 0 12 0 0 0 0 0 12 0 0 002 0 0

6 7 8

0 03 000 0 1 10 0 0 0 0 1 200

9 10

0 0 1 0 10 001 1 1 0

111 0


10 1 0

7 8 9 10

001 0 2 0 0 00 1 1 0 0 00 1 0 1

00 00 00 00

100 0 002 0 0 00 10 1 0 00 11 1 0 040 0 0 0 1200 0 0 0 110 0 1 2 0 00 0

0 00 10 1 0 100 1 0 1 0100 0 0 0 12 0 0 1 0 10 1 0

000 3 0 0 13 423

14 245

0 1 0 1 10

002 00 0 15 227

6 7

00 0 2 2 0

8 9 10

1 0 2000 00 1 1 1 0 00 1 1 1 0


001 01 1 0 0 0 000 003 000 0 0 2 10 0 0 1 2 10 0

16 199

TABLE 1. (Cont.)

25. Multiple Group IRT

R. Darrell Bock and Michele F. Zimowski


References TABLE 2. Estimated Item Parameters for Rhetorical Effectiveness in Writing an "Autobiographical Incident." Prompt 1 2 3 4 5 6 7 8 9 10

Effectiveness Slope (S.E.) Thresholds (S.E.) 0.518(0.104) 0.376(0.124) 0.789(0.143) 0.118(0.098) 0.515(0.098) 0.236(0.124) 1.068(0.206) 0.339(0.088) 0.534(0.099) -0.066(0.119) 0.703(0.123) 0.099(0.100) 0.864(0.155) 0.135(0.093) 0.601(0.111) 0.258(0.113) 0.483(0.099) 0.101(0.141) 0.820(0.148) 0.124(0.096)

TABLE 3. Estimated Unadjusted Category Parameters. Effectiveness (S.E.)

0.0 (0.0)



1.731 (0.195)

0.595 (0.094)

0.109 (0.084)

-0.901 (0.124)

TABLE 4. Group-Level Scores and (SE) for the Schools (Scaled to Mean 250 and SD 50). School 1 2 3 4 5 6 7 8 9 10

Rhetorical Effectiveness 236 (36) 175 (32) 242 (32) 164 (36) 291 (30) 273 (32) 187 (22) 341 (64) 267 (24) 217 (23)

School 11 12 13 14 15 16 17 18 19 20

Rhetorical Effectiveness 210 (24) 284 (25) 193 (23) 190 (30) 259 (31) 290 (33) 325 (33) 295 (27) 289 (110) 268 (29)

Andersen, E.B. and Madsen, M. (1977). Estimating the parameters of a latent population distribution. Psychometrika 42, 357-374. Bock, R.D. (1985 reprint). Multivariate Statistical Methods in Behavioral Research. Chicago: Scientific Software International. Bock, R.D. (1989). Measurement of human variation: A two-stage model. In R.D. Bock (ed.), Multilevel Analysis of Educational Data (pp. 319-342). New York: Academic Press. Bock, R.D. and Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 46, 443-445. Bock, R.D. and Kolakowski, D. (1973). Further evidence of sex-linked major-gene influence on human spatial visualizing ability. American Journal of Human Genetics 25, 1-14. Bock, R.D. and Lieberman, M. (1970). Fitting a response model for n dichotomously scored items. Psychometrika 35, 179-197. Bock, R.D. and Mislevy, R.J. (1981). An item response model for matrixsampling data: The California grade-three assessment. New Directions for Testing and Measurement 10, 65-90. Bock, R.D. and Zimowski, M. (1989). Duplex Design: Giving Students A Stake in Educational Assessment. Chicago: Methodology Research Center, NORC. Day, N.E. (1969). Estimating the components of a mixture of normal distributions. Biometrika 56, 463-473. de Leeuw, J. and Verhelst, N. (1986). Maximum likelihood estimation in generalized Rasch models. Journal of Educational Statistics 11, 193196. Dempster, A.P., Rubin, D.B., and Tsutakawa, R.K. (1981). Estimation in covariance component models. Journal of American Statistical Association 76, 341-353. Dorus, E., Cox, N.J., Gibbons, R.D., Shaughnessy, R., Pandey, G.N., and Cloninger, R.C. (1983). Lithium ion transport and affective disorders within families of bipolar patients. Archives of General Psychiatry 401, 945-552. Gasser, T., Miiller, H.-G., and Mammitzsch, V. (1985). Kernels for nonparametric curve estimation. Journal of the Royal Statistical Society, Series B 47, 238-252. Kiefer, J. and Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Annals of Mathematical Statistics 27, 887-906.


R. Darrell Bock and Michele F. Zimowski

Kolakowski, D. and Bock, R.D. (1982). A multivariate generalization of probit analysis. Biometrics 37, 541-551. Lindsay, B., Clogg, C.C., and Grego, J. (1991). Semiparametric estimation in the Rasch model and related exponential response models, including a simple latent class model for item analysis. Journal of the American Statistical Association 86, 96-107.

26 Logistic Mixture Models

Lord, F.M. and Novick, M.R. (1968). Statistical Theories of Mental Test Scores (with Contributions by A. Birnbaum). Reading, MA: AddisonWesley.

Jiirgen Rost

Mislevy, R.J. (1983). Item response models for grouped data. Journal of Educational Statistics 8, 271-288. Mislevy, R.J. (1984). Estimating latent distributions. Psychometrika 49, 359-381.


Mislevy, R.J. and Bock, R.D. (1996). BILOG 3: Item Analysis and Test Scoring with Binary Logistic Models. Chicago: Scientific Software International. Muraki, E. and Bock, R.D. (1991). PARSCALE: Parametric Scaling of Rating Data. Chicago: Scientific Software International. Reiser, M.R. (1983). An item response model for demographic effects. Journal of Educational Statistics 8(3), 165-186. Sanathanan, L. and Blumenthal, N. (1978). The logistic model and estimation of latent structure. Journal of the American Statistical Association 73, 794-798. Sirotnik, K. and Wellington, R. (1977). Incidence sampling: An integrated theory for "matrix-sampling." Journal of Educational Measurement 14, 343-399. Zimowski, M.F., Muraki, E., Mislevy, R.J., and Bock, R.D. (1996). BILOGMG: Multiple-Group IRT Analysis and Test Maintenance for Binary Items. Chicago: IL: Scientific Software International.

Discrete mixture distribution models (MDM) assume that observed data do not stem from a hom*ogeneous population of individuals but are a mixture of data from two or more latent populations (Everitt and Hand, 1981; Titterington et al., 1985). Applied to item response data this means that a particular IRT model does not hold for the entire sample but that different sets of model parameters (item parameters, ability parameters, etc.) are valid for different subpopulations. Whereas it has become a common procedure to compare parameter estimates between different manifest subpopulations defined, for example, by sex, age, or grade level, the subpopulations addressed by discrete MDM are latent, i.e., unknown both with respect to their number and membership of individuals. It is the aim of a mixture analysis to identify these latent populations. As an example, spatial reasoning tasks can be solved by different cognitive strategies, say, one strategy based on mental rotation of the stimulus and another based on feature comparison processes. If different individuals employ different strategies, an IRT model cannot fit the item response data because different strategies imply different item difficulties for the same items. On the other hand, the model may hold within each of the subpopulations employing the same strategy. The items, then, may have different parameter values for each subpopulation. Hence, the task of a data analysis with a discrete MDM is twofold: First to unmix the data into hom*ogeneous subpopulations, and second to estimate the model parameters for each of the subpopulations. However both tasks are done simultaneously by defining a discrete mixture model for the item responses and estimating all parameters by maximum likelihood methods.


26. Logistic Mixture Models

Jiirgen Rost

usual assumption of stochastic independent responses follows the conditional pattern probability

Presentation of the Model The general structure of mixed IRT models is as follows:

r(u|g) = n"f*(f*-




(1) PYTi I >


where p(u) is the pattern probability, i.e., the probability of the response vector u = (u\,U2,... ,un), G is the number of subpopulations or classes, •Kg is a probability parameter denning the size of class g with restriction (2) 9=1

and p(u | g) the pattern probability within class g. The parameters 7rg are also called the mixing proportions. In the sequel, only discrete mixtures of the one-parameter logistic model, i.e., the Rasch model, and its polytomous generalizations are treated in detail. One reason among others is that parameter estimation becomes harder when multiplicative (discrimination) parameters are involved as well. Let Uij denote the response variable of individual j on item i, then the conditional response probability of an individual being in class g is denned as (^ - aig) . . (3)

where 6jg is a class-specific individual parameter, denoting the ability of individual j to solve the items, if she/he would belong to class g. Hence, each individual gets as many parameters as there are latent classes. In the same way, Oig is a class-specific item parameter describing the difficulty of item i when it is solved by individuals in class g. The marginal response probability, then, is the mixed dichotomous Rasch model (Rost, 1990): exp(6>jg - aig)


with norming conditions V ] TTg = 1



aig = 0 for all classes g,


where n is the number of items. Here a third type of model parameter is involved, namely the mixing proportions ng defined in Eq. (2). From the

l+exp{6jg -aig

II H • I • PVn I


7/ (T(6)

The model assumes that the Rasch model holds within a number of G different subpopulations, but in each of them with a possibly different set of ability and item parameters. Whereas the mixing proportions ng are model parameters which are estimated by maximum likelihood (ML) methods (see below), the number G of latent classes cannot be estimated directly but has to be inferred from comparing the model fit under the assumption of different numbers of latent classes. A major application of the model in Eq. (4) is the possibility of testing the fit of the ordinary Rasch model to a set of data. Because item parameters have to be constant over different subsamples, a widely applied goodness-of-fit test, the conditional likelihood ratio (CLR) test by Andersen (1973), is based on a splitting of the sample into various subsamples, e.g., score groups.,From a mixture models perspective, the CLR-test is testing the null-hypothesis of a one-class Rasch model against the alternative hypothesis of a manifest mixture Rasch model, where the score or some other criterion is the mixing variable. Testing the one-class solution against the two-class solution of the mixed Rasch model (MRM), is a more powerful test of the hypothesis of constant item parameters across all subsamples: the MRM identifies that partition of the population into two subpopulations, among which the item parameters show highest differences. As a consequence, it is not necessary to try out several splitting criteria for the sample of individuals in order to test the assumption of constant item parameters for all individuals (Rost and von Davier, 1995). Another field of application are tests where different groups of people employ different solution strategies for solving items in a test. Because different solution strategies usually are connected with different sets of item difficulties, the analysis of these tests requires a model which takes account of different item parameters for these subpopulations (Rost and von Davier, 1993). In order to apply the model some further assumptions have to be made regarding the ability parameters 0jg. Mislevy and Verhelst (1990) have chosen a marginal likelihood approach, i.e., they specify a distribution function of 9 and estimate the parameters in this function along with the item parameters. Kelderman and Macready (1990) discuss the concept of the mixed Rasch model in the framework of log-linear models. Here, the ability parameters are neither estimated directly nor eliminated using the assumption


Jiirgen Rost

26. Logistic Mixture Models

of a distribution function. The log-linear formulation of the Rasch model provides conditional item parameter estimates and some kind of score parameters reflecting the ability distribution. Rost (1990) chose to condition out the ability parameters within the classes to estimate the score frequencies within the classes, and to provide unconditional ability estimates on the basis of the conditional estimates of item parameters. Although this topic sounds like a technical question of parameter estimation, it implies a reparameterization of the MRM with a different set of parameters and, therefore is described in this section. The conditional pattern probability can be factorized by conditioning on the score r associated with this pattern, r = E™=i uii '-e-> p(u | g) =p(u | g,r)p(r \ g).


The pattern probability under the condition of score r is, just as in the ordinary Rasch model, free of the ability parameters and can be computed by means of the symmetric functions 7,. of the item parameters: p(u \r,g) =

P(u i 9)

exp ( 7r(exp(-cr))

According to these probabilities, each individual can be assigned to that latent population, where his or her probability is highest. The model provides two pieces of diagnostic information about each individual: First, the latent population to which the individual belongs, and second, the ability estimate. In this sense, the mixed Rasch model simultaneously classifies and quantifies on the basis of the observed response patterns.

Logistic Mixture Models for Ordinal Data The mixed Rasch model can easily be generalized to such Rasch models for ordinal data as the partial credit model (Masters, 1982, this volume), the rating-scale model (Andrich, 1978; Andersen, this volume), the dispersion model (Andrich, 1982), or the successive interval model (Rost, 1988). In this family of models, the partial credit model is the most general one. It is defined by the following response function: p{U%J =h) =


By introducing new probability parameters 7rrg = p(r | g) for the latent score probabilities, the marginal pattern probability is


for h = 0,1,2,...,m,



where the item-category parameter Uih is a cumulated threshold parameter defined as h

Vih = ^2Tis,

cri0 = 0,



(9) 9=1

and TiS is the location of threshold s on the latent continuum. Generalized to a discrete mixture model, the marginal response probability becomes

with normalization condition 71-1

= 1 for all g.



Z^s=o exp(.sVjg - aisg)


... With




In the normalization of the score probabilities, those of the extreme scores r = 0 and r = n are excluded, the reason being that these score frequencies cannot be estimated separately for each class because the patterns of only 0's or l's have identical probabilities of one in each class g. As a consequence, the number of independent score parameters is 2 + G(n — 2). The norming condition for the class sizes, therefore, is Tig = 1 - p(r = 0) - p(r = n).


(Rost, 1991; von Davier and Rost, 1995). The normalization conditions are


i=l h=\

Again, the likelihood function can be defined on the basis of conditional pattern probabilities, so that the individual parameters are eliminated and score probabilities are introduced: N


It follows from this model that each individual belongs to each class with a (posterior) probability which depends on the response pattern in the following way: 7rsP(u I 9) (12) p{g 9)






One drawback of this kind of reparametrization is the relatively high number of score parameters, 2 + G(nm — 2), which, for example, lowers the power of the goodness-of-fit tests. In order to reduce this number, a logistic


26. Logistic Mixture Models

Jiirgen Rost

two-parameter function can be used to approximate the score distributions within the latent classes. In this function, the parameter \i is a location parameter whereas A represents the dispersion of the score distribution in class g: /_ TT = + 4r(n77i—s) \ \ (17) spnm rg





This distribution function can be used to model different kinds of single peaked distributions; it even becomes u-shaped when A is negative. As special cases of the mixed partial credit model in Eq. (15), the following restricted ordinal Rasch models can be defined (von Davier and Rost, 1995): (1) the mixed rating scale model exp(h9jg - ho-jg - iphg)

= h) =

^Lo e x p(-


and normalization conditions ipog = ipmg =



Li <Jig = 0 for all classes g;

(2) the mixed dispersion model exp(h6jg - hcrig - h(h - m)\8ig)



not showing the expected increasing order can be interpreted as a class of unscalables because ordered thresholds are a prerequisite for measuring a latent trait with rating scales. Until now, all of these mixed ordinal Rasch models have been denned such that the same type of model is assumed in all classes. Gitomer and Yamamoto (1991) proposed a hybrid model where in one class the Rasch model is assumed to describe the item response process while in other classes the ordinary latent class model may also be defined for the models described here, so that a large number of different mixture models can be generated (von Davier and Rost, 1996). However, data analysis with these models should be theory driven in any case: The idea of mixed logistic models is not to fit all possible data sets by a colorful mixture of IRT models. Rather, the idea is to have a mixture model when there are strong reasons to assume that different things may happen in different subpopulations: different response sets, sloppy response behavior with unscalable individuals, different cognitive strategies, or different attitude structures. Last but not least, the idea is to test the basic assumption of all IRT models namely that the responses of all individuals can be described by the same model and the same set of parameters. If that is the case, the one-class solution, which is an ordinary IRT model, holds.

Parameter Estimation


with normalization condition E™=i °»g = 0 f° r (3) the mixed successive interval model exp(h6jg - huig - tphg - h{h - m)\8lg)



with normalization conditions ipog — i>mg = E?=i °ig ~ all classes g.

r=i ^9


^ f° r

These models can be applied, e.g., to questionnaires for attitude measurement in order to separate groups of individuals with a different attitude structure in terms of different item parameters in different subpopulations. Moreover, subpopulations may not only differ with respect to their sets of item parameter values but also with respect to the threshold distances between the categories of the response scale. These threshold distances can be interpreted in terms of response sets common to all individuals in that particular latent class. Large distances between the thresholds of the middle category of a rating scale can be interpreted as a tendency toward the mean, whereas a low threshold parameter value for the first categories combined with a high parameter value for the last reflects a tendency toward extreme judgments. Often, a latent class with threshold parameters

The parameters in the mixed Rasch model and its generalizations to ordinal data can be estimated by means of an extended EM-algorithm with conditional maximum likelihood estimation of the item parameters in the M-step. On the basis of preliminary estimates (or starting values) of the model parameters, the pattern frequencies for each latent class are estimated in the E-step as: = n(u)

(21) E°=i


where n(u) denotes the observed frequency of response vector u, and n(u, g) is an estimate of the portion of that frequency in class g. The conditional pattern frequencies are defined by Eq. (6), or, as a function of the item and score parameters, by Eqs. (7) and (8). For the sake of simplicity, the estimation equations for the M-step will be given only for the dichotomous mixed Rasch model. Parameter estimation for the ordinal models can be done analogously and bears no special problems, except for the computation of the symmetric functions, where the so-called summation algorithm has to be applied (von Davier and Rost, 1995). In the M-step, these proportions of observed pattern frequencies in Eq. (21) form the basis for calculating better estimates of the class size, item


26. Logistic Mixture Models

Jiirgen Rost

and score parameters nrg. These parameters can be estimated separately for each latent class by maximizing the log-likelihood function in class g: log(7rrff) -

log Lg =

- log(7 r (exp(-<7)))

first partial derivatives of (26) with respect to the trait parameters 8jg are computed. The estimation equations are



r-i — i=\

Setting the first partial derivatives with respect to Oig to zero yields the estimation equations for the item parameters within class g: Ui,

Tig = lOg


lr 1 i


where riig denotes preliminary estimates of the number of individuals with a positive response on item i in class g, mrg estimates the number of individuals with score r in class g [both calculated by means of ra(u, g) obtained in the previous M-step] and 7 r -i,i are the elementary symmetric functions of order r — 1 of all item parameters except item i. The elementary symmetric functions on the right-hand side of Eq. (23) are calculated by means of preliminary item parameter estimates, and new (better) estimates are obtained on the left-hand side of Eq. (23). Only one iteration of this procedure suffices because it is performed in each M-step of the EM-algorithm, and, hence, converges to maximum-likelihood-estimates in the course of the EM-procedure. The estimates of the score probabilities and class sizes are explicit:


(27) • ^ig) '

which can be solved iteratively. Hence, the class-dependent trait parameters of the model in Eq. (4) can be estimated in a second step of the estimation procedure, making use of the conditional item parameter estimates, <7ig, obtained in the first step. Each individual has as many trait (or ability) parameter estimates as there are latent classes. These can be interpreted as conditional trait estimates, i.e., individual j has the trait value 6jg under the condition that he or she belongs to class g. However, these conditional abilities of a single individual j usually do not differ much from one class to another, because the estimates depend mainly on Tj which, of course, is the same in all classes. On the other hand, the class membership of an individual is nearly independent of his or her total score but strongly dependent on which items have been solved. In this sense, the MRM enables one to separate the quantitative and the qualitative aspects of a response pattern: The continuous mixing variable mainly depends on how many items have been solved, the discrete mixing variable mainly depends on which items have been solved.

Software mr nn





where ng denotes the number of individuals in class g, computed on the basis of n(u, g)The ability parameters 9jg in Eq. (4) do not appear in these equations because their sufficient statistics Tj were substituted. They can, however, be estimated using the final estimates of the item and score parameters. This is achieved by maximizing the following intra-class likelihood: lo

S L9 = H l°SP(uJ I 9) =


log(l + exp(ejg + crlg))


with respect to the unknown ability parameters 0jg, which only depend on the score r of individual j. The number of individuals in class g solving t be b kknown because b c a u s e this term vanishes when the needd nott to item i

The models described in the previous sections can be computed by means of the Windows-program WIN-MIRA [Mixed RAsch model; von Davier (1995)]. The program provides ML-estimates of the expected score frequencies and individual parameters and conditional ML-estimates of the class-specific item parameters including estimates of the standard errors based on Fisher's information function. As a separate output file for all persons, the overall ability estimates along with the number of the most likely latent class for each person, and all class membership probabilities, are provided. Although there are no technical limitations from doing so, it usually makes little sense to apply a mixture model to more than 20 or 30 dichotomous items or 15 to 20 polytomous items, respectively. For large numbers of items the likelihood function becomes very flat, and the risk increases that there are several local maxima. Computation time strongly depends on the size of the data set and the number of classes but usually a PC-486 would not take more than a few minutes to compute the parameters of a model with reasonable accuracy.


Jiirgen Rost

26. Logistic Mixture Models


Goodness of Fit


To test the fit of a logistic mixture model to empirical data, the usual statistics of cross-table analysis and latent class models can be applied but a selection has to be made with respect to the size of the data set. The strongest, and from a statistical point of view the most satisfying tests, are the Pearson chi-square and the likelihood ratio for comparing observed and expected pattern frequencies. Both statistics yield, under regular conditions, similar results, and, likewise, have similar asymptotic requirements. However, these requirements are usually not fulfilled when more than 6 or 8 dichotomous items are analyzed, or even only 4 items with 4 categories: In the case of, say about 500 different possible response patterns an expectation of 1 or more for all pattern frequencies would require such gigantic sample sizes that these statistics are only applicable when some kind of pattern aggregation takes place. Unfortunately, the same argument holds for testing the number of latent classes in a mixture model by means of a likelihood-ratio statistic. A likelihood-ratio statistic, which is the likelihood of a model with G classes divided by the likelihood of the mixture of G + 1 classes, is only asymptotically chi-square distributed if all patterns have a reasonable chance to be observed (which is not possible for even 12 items). For these reasons, information criteria like the AIC or the BIC (Bozdogan, 1987; Read and Cressie, 1988) are a useful alternative for comparing competitive models without drawing statistical inferences in the usual sense. These indices enable the conclusion that a particular model describes the data better than a specified alternative model, given the number of independent model parameters. The information criteria are defined as:

The Windows-Software WIN-MIRA (von Davier, 1995) provides the loglikelihood of all models computed, the log-likelihood of the saturated model, the LR-statistic based upon these values as well as the AIC and BIC value. A bootstrapping procedure is optional. Furthermore, the standard errors of the ability estimates are given, and the item-Q index as a measure of item fit (Rost and von Davier, 1994) is calculated for all latent classes. The latter allows an insight into the fit of the Rasch model within each latent class.

AIC = -2 log(L) + 2k, BIC = -2 log(L) + log(JV)fc,


where L is the maximum of the likelihood function of a given model, and k is the number of independent model parameters. The smaller the value, the better the model fits. These criteria are based on exactly the same information as the LR-chi-square statistic, i.e., the log-likelihood and the number of parameters. The crucial point, of course, is the kind of "penalty function" connecting the two aspects of model fit. In this respect, the BIC prefers simple solutions, i.e., it penalizes overparametrization more than the AIC. The fact that the number of estimated parameters plays such an important role when comparing the fit of two models was one reason for introducing a logistic approximation to the latent score distributions with only two parameters for each class.

Examples A typical application of mixed models for dichotomous items is to spatial reasoning tasks. Here it is assumed that subpopulations of individuals either employ some kind of analytical strategy, in which feature comparison processes take place, or that they employ some kind of strategies based on an analogous cognitive representation of the stimuli, called "holistic" strategies. As a consequence, at least two latent classes with a preference to one of these strategies are to be expected (Roller et al., 1994; Mislevy et al., 1990; Mislevy et al., 1991; Rost and von Davier, 1993). Applications of the polytomous mixed models are primarily in the field of attitude questionnaires (Rost and Georg, 1991), measurement of interests (Rost, 1991) or personality assessment. In a project on the scalability of individuals with personality questionnaires, five scales of a questionnaire assessing the so-called "big five" personality factors were analyzed (Borkenau and Ostendorf, 1991). As an example, the results for the dimension "conscientiousness" are reported next. Twelve items with 4 response categories ("strongly disagree," "disagree," "agree," "strongly agree") were designed to measure the dimension "conscientiousness." A total number of 2112 individuals, an aggregation of different samples, were analyzed with the mixed partial credit model, using the approximation of the score distributions by means of the two parameter logistic function described earlier. Out of 16,777,216 possible patterns, 1937 different response patterns were observed, so that the chi-square statistics could not be applied without (arbitrary) grouping. The decision on the number of latent populations was made on the basis of the BIC-index because the AIC turned out to identify so many latent classes that they were no longer interpretable. (This finding is a general experience when working with data sets of this size.) Table 1 shows the results needed for the decision on the number of classes. According to these results, the two-class solution fits best. The two classes have an interesting interpretation which can be drawn from the graphical representation of the model parameters in Figs. 1 and 2. Figure


26. Logistic Mixture Models

Jiirgen Rost



TABLE 1. Goodness of fit Statistics for the Conscientiousness Example. G 1 2 3 4

-logi 24300.7 24815.7 24683.8 24567.6


37 75 113 151

BIC 50884.5 50205.4 50232.4 50290.7

• *

class 1, ?r,=0.65 class 2, TT,=0.35

AIC 50675.4 49781.4 49598.6 49437.1

class 1, TTI=0.65

• *

class 2,7r2=0.35 threshold 1 threshold 2 threshold 3 - 7 - 6 - 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 6 7

ability estimates (0) FIGURE 2. Expected frequencies of ability estimates in both classes.

in a different way.

Discussion 1 2






FIGURE 1. The thresholds of the 2-class solution for the scale "conscientiousness."

1 shows the threshold parameters of all 12 items in both classes. The bigger class with 65.2 percent of the total population is characterized by ordered thresholds of all items and reasonable threshold distances. This finding is to be expected for individuals that used the rating scale in correspondence with the dimension to be measured. The smaller class with 34.8 percent of the population is characterized by small threshold distances, especially for the second and third threshold. This result may be indicative of different response sets in both classes or even of the fact that the response format is not used adequately in the second class. Figure 2 shows the estimated distributions of the individual parameters in both classes. Each dot or each square represents a score group. It turns out that the same score groups are assigned to more extreme parameter estimates in class 1 than in class 2. The range of the estimated conscientiousness parameter values obviously is larger in the 65% class than in the small class. This result is also indicative of the fact that the questionnaire measures the trait in both populations

Mixture models for item response analysis are a relatively new development in IRT. They seem especially useful where test analysis is driven by substantial theory about qualitative differences among people. Whenever different strategies are employed to solve items, or different cognitive structures govern the responses to attitude or personality assessment items, a mixture model may help to reveal subgroups of individuals connected to one of these strategies or structures. Further development should go into the direction of mixtures of different models, for example, models with a Rasch model in one population and some model representing guessing behavior in the other population. In general, mixture models provide an elegant way of testing an ordinary (non-mixed) IRT-model by focussing on the basic assumption of all IRTmodels, which is hom*ogeneity in the sense of constant item parameters for all individuals.

References Andrich, D. (1978). Application of a psychometric rating model to ordered categories which are scored with successive integers. Applied Psycholog-


Jiirgen Rost

ical Measurement 2 581-594. Andrich, D. (1982). An extension of the Rasch model for ratings providing both location and dispersion parameters. Psychometrika 47, 104-113. Andersen, E.B. (1973). Conditional inference for multiple-choice questionnaires. British Journal of Mathematical and Statistical Psychology 26, 31-44. Borkenau, P. and Ostendorf, F. (1991). Ein Fragebogen zur Erfassung fiinf robuster Personlichkeitsfaktoren. Diagnostica 37, 29-41. Bozdogan, H. (1987). Model selection for Akaike's information criterion (AIC). Psychometrika 53, 345-370. Everitt, B.S. and Hand, D.J. (1981). Finite Mixture Distributions. London: Chapman and Hall. Gitomer, D.H. and Yamamoto, K. (1991). Performance modeling that integrates latent trait and class theory. Journal of Educational Measurement 28, 173-189. Kelderman, H. and Macready, G.B. (1990). The use of loglinear models for assessing differential item functioning across manifest and latent examinee groups. Journal of Educational Measurement 27, 307-327. Koller, O., Rost, J., and Koller, M. (1994). Individuelle Unterschiede beim Losen von Raumvorstellungsaufgaben aus dem IST-bzw. IST-70-Untertest "Wiifelaufgaben." Zeitschrift fur Psychologie 202, 64-85. Masters, G.N. (1982). A Rasch model for partial credit scoring. Psychometrika 47, 149-174. Mislevy, R.J. and Verhelst, N. (1990). Modeling item responses when different subjects employ different solution strategies. Psychometrika 55, 195-215. Mislevy, R.J., Wingersky, M.S., Irvine, S.H., and Dann, P.L. (1991). Resolving mixtures of strategies in spatial visualisation tasks. British Journal of Mathematical and Statistical Psychology 44, 265-288. Read, T.R.C. and Cressie, N.A.C. (1988). Goodness-of-Fit Statistics for Discrete Multivariate Data. New York: Springer. Rost, J. (1988). Measuring attitudes with a threshold model drawing on a traditional scaling concept. Applied Psychological Measurement 12, 397-409. Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement 14, 271282. Rost, J. (1991). A logistic mixture distribution model for polytomous item responses. British Journal of Mathematical and Statistical Psychology 44, 75-92.

26. Logistic Mixture Models


Rost, J. and von Davier, M. (1993). Measuring different traits in different populations with the same items. In R. Steyer, K.F. Wender, and K.F. Widaman (eds.), Psychometric Methodology, Proceedings of the 1th European Meeting of the Psychometric Society (pp. 446-450). Stuttgart/ New York: Gustav Fischer Verlag. Rost, J. and von Davier, M. (1994). A conditional item fix index for Rasch models. Applied Psychological Measurement 18, 171-182. Rost, J. and von Davier, M. (1995). Mixture distribution Rasch models. In G. Fischer and I. Molenaar (eds.), Rasch Models: Foundations, Recent Developments, and Applications (pp. 257-268). Springer. Rost, J. and Georg, W. (1991). Alternative Skalierungsmoglichkeiten zur klassischen Testtheorie am Beispiel der Skala "Jugendzentrismus." ZAInformation 28, 52-75. Titterington, D.M., Smith, A.F.M., and Makov, U.E. (1985). Statistical Analysis of Finite Mixture Distributions. Chichester: Wiley. von Davier, M.V. and Rost, J. (1995). WINMIRA: A program system for analyses with the Rasch model, with the latent class analysis and with the mixed Rasch model. Kiel: Institute for Science Eduation (IPN), distributed by Iec Progamma, Groningen. von Davier, M.V. and Rost J. (1995). Polytomous mixed Rasch models. In G. Fischer and I. Molenaar (eds.), Rasch Models: Foundations, Recent Developments, and Applications (pp. 371-379). Springer. von Davier, M.V. and Rost, J. (1996). Self-monitoring—A class variable? In J. Rost and R. Langeheine (eds.), Applications of Latent Trait and Latent Class Models in the Social Sciences. Miinster: Waxmann.

27 Models for Locally Dependent Responses: Conjunctive Item Response Theory Robert J. Jannarone Introduction The last 15 years have been locally dependent IRT advances along nonparametric and parametric lines. Results include nonparametric tests for unidimensionality and response function monotonicity (Holland, 1981; Holland and Rosenbaum, 1986; Rosenbaum, 1984, 1987; Stout, 1987, 1990; Suppes and Zanotti, 1981), locally dependent models without individual difference provisions (Andrich, 1985; Embretson, 1984; Gibbons et al. 1989; Spray and Ackerman, 1986) and locally dependent models with individual differences provisions (Jannarone, 1986, 1987, 1991; Jannarone et al. 1990; Kelderman and Jannarone, 1989). This chapter features locally dependent, conjunctive IRT (CIRT) models that the author and his colleagues have introduced and developed. CIRT models are distinguished from other IRT models by their use of nonadditive item and person sufficient statistics. They are useful for IRT study because they make up a statistically sound family, within which special cases can be compared having differing dimensionalities, local dependencies, nonadditives and response function forms. CIRT models are also useful for extending IRT practice, because they can measure learning during testing or tutoring while conventional IRT models cannot. For example, measuring how students respond to reinforcements during interactive training and testing sessions is potentially useful for placing them in settings that suit their learning abilities. However, such measurement falls outside the realm of conventional IRT, because it violates the local independence axiom. The IRT local independence axiom resembles the classical laws of physics and CIRT resembles modern physics in several ways (Jannarone, 1991). Just as modern physics explains particle reaction to measurement, CIRT explains human learning during measurement; just as modern physics includes classical physics as a special case, CIRT includes conventional IRT as a special case; just as modern physics extensions are essentially non-


27. Models for Locally Dependent Responses

Robert J. Jannarone

additive, so are CIRT extensions; just as modern physics extensions are potentially useful, so are CIRT extensions and just as its usefulness has resulted in modern physics acceptance, locally dependent IRT will become accepted eventually. IRT and CIRT are different than classical and modern physics in historical impact terms, and neither IRT nor CIRT will become household terms for some time. They figure to become increasingly important, however, as information becomes increasingly available and human abilities to learn from it become increasingly taxed.


constraints must be imposed on Eq. (1) to create special cases that can identify unique parameters. One such special case is the conventional Rasch kernel,


oc exp



which is obtained by equating all first-order person parameters in Eq. (1) to lj ~ 2j — • • ' — Vnj — Vj, J — 1, • • • , JV, {6} and excluding all higher order terms in Eq. (1) by setting

Presentation of the Model Conjunctive IRT Model Formulation Although a variety of CIRT models have been proposed, this chapter emphasizes the family of conjunctive Rasch kernels because of their sound psychometric properties. (Kernels in this context correspond to IRT probability models with fixed person parameters.) To outline notation, Roman symbols are measurements and Greek symbols are parameters; upper case symbols are random variables and lower case symbols are their values; bold symbols are arrays, superscripts in braces are power labels, subscripts in parentheses are array dimensions and subscripts not in parentheses are array elements or vector labels; Greek symbols are parameters denned on {—00,00}; U and u are item response random variables and values defined on {0,1}; i labels denote items, j labels denote subjects and k labels denote testlets. Each member of the conjunctive Rasch family is a special case of a general Rasch kernel,


l2j — Pl2 ' °13j — ^13 ' • • • ' "12



J — l,...,l\.


(To be precise, one additional constraint must be imposed on the parameters in Eq. (2) to identify it, such as fixing one of the ft values at 0.) Conjunctive Model Exponential Family Structure Many useful formal and practical features of conjunctive Rasch kernel stem from their exponential family membership (Andersen, 1980; Lehmann, 1983, 1986). The likelihood for a sample of independent response patterns ui through UN based on the general Rasch model in Eq. (1) has exponential structure, Pr{U1 = u i , . . . , U n = u,,

i = l j-l TV


E n—1

(i) The 9 values in Eq. (1) are person parameters usually associated with abilities and the /? values in Eq. (1) are item parameters usually associated with difficulties. Equation (1) is too general to be useful as it stands, because it is not identifiable (Bickel and Doksum, 1977). For example, adding a positive constant to /?i and adding the same positive constant to each 6\j does not change any item response probabilities based on Eq. (1). As a result,




Ui Ui

^li'E J 'J




and the likelihood based on the conventional Rasch special case in Eq. (2) has exponential structure,


jttoDert j . Jannarone

27. Models for Locally Dependent Responses

The exponential structure in Eq. (5) identifies a linked parameter, sufficientstatistic pair with each exponent term. The maximum likelihood estimate of each parameter is a monotone function of its linked sufficient statistic when all other sufficient statistics are fixed. As a result, special cases of Eq. (5) are easy to interpret. For example, the conventional Rasch model exponential structure in Eq. (6) links each individual's parameter with the number of items passed by that individual, giving individual parameters an ability interpretation. The same structure links each item parameter with the number of individuals passing the item, giving item parameters a difficulty interpretation. Exponential structure will be used to interpret other Rasch conjunctive kernels after reviewing local independence next.




{(unuU2jl) = (uljUu2ji) | (ef,&f) = (of,of);Pli},P2\,P?

, m



riK //id] o[ih , /a''" -'•"•

exp{(^- J - PiiUijx] + exp{(0] ; ' - PU )uiji + (&j

^l _ ^1}

+ exp{9f


The test theory local independence axiom (Lazarsfeld, 1960; Lord and Novick, 1968) has a formal side and a practical side. Formally, local independence requires test pattern kernels to be products of item response kernels. For example, the conventional Rasch kernel is a locally independent special case of Eq. (1) as the first expression in Eq. (2) shows. Unlike the conventional Rasch kernel, most other conjunctive Rasch kernels are locally dependent. For example, consider a test made up of t "testlets" (Wainer and Kiely, 1987), with two items in each testlet. One locally dependent kernel for such a test has the form,

"21 >



+ Of - $} + of (9)


+ Of -


Conjunctive IRT and Local Independence Form


of - P™

(10) the product of which is distinct from Eq. (8). Thus, the conjunctive kernel in Eq. (7) is formally locally dependent.

Practical Consequences of Local Dependence The practical side of locally dependent conjunctive models can be seen by examining their conditional probability and exponential family structure. From Eq. (8) and Eq. (10), the conditional probability of passing the second item in testlet 1 once the first item has been taken is, = u2jl

Pr{u(2xt)] =«(2X()J- i (ef.ef) = (of,ofy,p%xt),f3™} =


(Of Of - pfk)u2jk + (Of ]



*=i 1 + exp{0f - p\\ } + exp{9f - p[ l} + exp{0f


+ of -

+ ef+ef-p• 12] l

(7) The test kernel in Eq. (7) indicates that items are locally independent between testlets, because Eq. (7) is a product of its t testlet kernels. Item pairs within testlets are locally dependent, however, because each factor in Eq. (7) is not the product of its two marginal item probabilities. For example, the joint kernel for the first testlet item pair in Eq. (7) is Pr{(uljUu2jl)


Oj - Pi [


but it follows from Eq. (7) that its two item kernels are

- (Pl2


Evidently from Eq. (11), C/2ji will follow a Rasch model with ability parameter Of and difficulty parameter (3^ a f t e r uiji h a s b e e n failed, but it will follow a different Rasch model with ability parameter (Of +6f') and difficulty parameter ($} + pf]) after Uin has been passed. Thus U2ji follows a locally dependent Rasch model, with its ability and difficulty parameters depending on the previous u\j\ response. In practical terms, the structure of Eq. (11) shows that Eq. (7) can model learning during testing. For example, testlet item pairs like those in Eq. (7) can be given in a series of t trials, within which the student is asked to learn during an initial task, the student is next given a reinforcement in keeping with the task response and the student is then given a second task that requires proper comprehension from the first task. In that case, the second-order item parameters /?[/ indicate how students respond to


27. Models for Locally Dependent Responses

Robert J. Jannarone

reinforced learning as a group and the second-order person parameters 8j indicate how students respond to reinforced learning as individuals. The conditional probabilities in Eq. (11) for the testlet model in Eq. (7) show that one Rasch model applies if the first item has been passed and another Rasch model applies otherwise. The second-order ability and difficulty parameters in Eq. (11) have similar interpretations to their firstorder counterparts in Eq. (11) and in the conventional Rasch model. The second-order difficulty parameter represents the difference between difficulty when the first item is passed and difficulty when the first item is failed, and the second-order ability parameter represents the difference between ability when a first item is passed and ability when a first item is failed. While its conditional probability structure explains Eq. (7) as a learning model, its exponential family structure justifies it as a learning measurement tool. Its structure based on Eq. (5) is, i, • • • , U(2xt)JV = u(2xt)N — \Pl , TV * I v~"9/ill] V

oc exp^ }^ j S=i

-E E fc=l




- E fit E fc=l j=\


Eef] E fc=l

-E E fc=l


Thus, the exponential structure in Eq. (12) links each first-order ability parameter 6^ with the number of items passed by each individual, and it links each first-order difficulty parameter /3J£ with the number of individuals passing the item, just as in the conventional Rasch model. It departs from additive Rasch structure, however, by also linking second-order learning statistics to their corresponding learning parameters. The testlet item pairs that persons pass are linked to their 8j parameters in Eq. (12), giving them a conjunctive learning ability interpretation. Also, among other individuals having the same number-correct scores, higher conjunctive scores indicate the ability to learn through reinforced item passing, and lower conjunctive scores indicate the ability to learn through reinforced item failing. In this way, the conjunctive testlet scores provide extra information about learning during testing that additive item scores cannot. For example, suppose a test is made up of 20 testlets, each of which involves two comprehension items from a short test paragraph. The test is given by initially presenting the first item in each testlet and receiving the response, next indicating the correct answer to the student and then


presenting the second related testlet item. Now consider the subsample of students who get number-correct scores of 20 on the test, that is, students who pass half the test items. Within the subsample, conjunctive scores of 0 through 20 are possible, with a score of 0 indicating students who learn only by failing first testlet items, a score of 20 indicating students who learn only by passing first testlet items and intermediate scores indicating mixed learning styles. Conjunctive models such as Eq. (7) are useful for ability selection and diagnosis based on number-correct scores, just like the conventional Rasch model. The key difference is that conjunctive models can also measure learning abilities during testing while conventional models cannot. Conjunctive models are especially suitable for concurrent tutoring settings, where learning styles are measured during on-line tutoring and remaining tasks are tailored accordingly.

Related Locally Dependent Models The testlet model in Eq. (7) is one among many special cases of the CIRT family [Eq. (1)] that have been reviewed elsewhere (Jannarone, 1991). These include Rasch Markov models having items linked into learning chains; pretest-posttest models for assessing learning style after pretest feedback and componential models for assessing individually necessary skills associated with composite tasks. As with the testlet model in Eq. (7), all such CIRT models are distinguished by their nonadditive statistics and justified by their exponential family structure. Other CIRT models that have been studied include concurrent speed and accuracy (CSA) measurements models (Jannarone, to appear, a) and concurrent information processing (CIP) neural network models (Jannarone, to appear, b). CSA models are based on measuring both item response quickness and correctness, with provisions for reflecting individual differences in cognitive processes. CIP models are designed to produce neurocomputing structures that can learn relationships among measurements, automatically and quickly. All such models rely on nonadditive sufficient statistics within the exponential family that violate local independence. Returning to the formal side, a variety of CIRT contrasts and connections with traditional test theory have been described elsewhere (Jannarone, 1990, 1991, in press), regarding local independence, sufficient statistic additivity, specific objectivity, item discrimination, dimensionality, item response function form and nonparametric alternatives. Some of these connections will be summarized in the remainder of this section.

Local Independence and Measurement Additivity A straightforward analysis of Eq. (1) has shown that CIRT special cases are locally independent if and only if they exclude nonadditive sufficient


Robert J. Jannarone

statistics (Jannarone, 1990). This simple result has broad implications once CIRT family generality and its statistically sound (SS) properties and recognized, in terms of both exponential family membership and specific objectivity (Jannarone, in press). For example, the two-parameter (Birnbaum) model attempts to explain more than the additive Rasch model by including an extra parameter for each item that reflects discrimination power. However, the Birnbaum model falls outside the family of SS models, which poses an interesting question. Can a SS model be constructed that measures differential item discrimination? Interestingly, the answer is that such a model can be found but it must necessarily violate local independence. Similar conclusions follow for other extended IRT models that measure differential guessing ability, multivariate latent abilities and the like. Indeed, any SS alternative to the Rasch model that extracts extended (i.e., nonadditive) information from binary items must violate the local independence axiom (Jannarone, in press). Thus on the practical side, local independence restricts IRT to settings where people cannot learn during testing. On the formal side, locally independent and SS models are restricted to conventional Rasch models. From both formal and practical viewpoints, therefore, the local independence axiom stands in the way of interesting extensions to test theory and application.

Local Dependence and Dimensionality Modern IRT has been greatly influenced by factor analysis models, for which a joint normal distribution among latent and observable variables is often assumed (Joreskog, 1978). Factor analysis models usually also assume that observable variables are uncorrelated if all latent variables are fixed, which with normality implies local independence. Indeed, the existence of a "complete" latent space that can satisfy local independence is a conventional test theory foundation (Lord and Novick, 1968, Sec. 16.3; McDonald, 1981; Yen, 1984). Given conjunctive family membership, however, latent variables that can account for all local, interitem dependencies mayy not exist. For example, if t l t model dl satisfying Eq. (7) exists with any pf}, f} / 6y, f} if a testlet 6y, then then no no equivalent locally independent CIRT counterpart can be constructed. Thus, the traditional notion that conditioning on added dimensions can remove interitem dependencies breaks down in conjunctive IRT, placing it at odds with conventional test theory at the foundation level.

Local Independence and Nonparametrics Nonparametrics offers a model-free alternative to parametric modeling, which is attractive to some because it avoids making false assumptions.

27. Models for Locally Dependent Responses


(Nonparametrics is less attractive to others, because it avoids linking model parameters to sufficient statistics and producing sound alternatives in the process.) When properly conceived, nonparametrics admits useful alternatives that fall outside conventional model families. This subsection will describe one nonparametric approach to test theory that is improperly conceived, because it restricts admissible IRT models to only the locally indepenent variety. When parametric family membership is not assumed, the local independence assumption is trivial unless other conditions are imposed. For any locally dependent models, Suppes and Zanotti (1981) showed how to construct an equivalent locally independent model, by placing latent variables in one-to-one correspondence with observations. In a similar development, Stout (1990) showed that alternative locally independent and unidimensional models can always explain test scores when items are discrete. However, these alternatives are vacuous as Stout pointed out, because statistical inference is meaningless when only one observation exists per parameter. To exclude such trivial models from consideration, some authors (Holland, 1981; Holland and Rosenbaum, 1986; Rosenbaum, 1984; Stout, 1987; 1990) developed "nonparametric tests" based on test models that satisfied certain requirements. Unfortunately the requirements they imposed, namely response function (RF) monotonicity and local independence, restricted admissible IRT models to the exclusion of useful parametric models, namely CIRT models. Consequently, the unidimensionality tests that they based on these restrictions are not at all nonparametric. Understanding the rationale behind imposing the above restrictions is useful, because it clarifies some important distinctions between IRT and CIRT models. Response function monotonicity is appealing because it allows latent variables to be regarded as abilities that increase with item passing probabilities (Holland and Rosenbaum, 1986). Monotonicity along with local independence also guarantee latent trait identifiability in a variety of settings. For example, consider the observable random variable (C/i, C/2) for a two-item test that depends on a single latent trait Q. Given local independence in this case, a model would not be identifiable if for any 01 ? 02, (Pr{Ux = 1 I 0 = ex},Pr{U2 = 1 | 9 = 0i})

= {Pr{ux = 11 e = 02}, Pr{u2 = 11 e = e2}),


but Eq. (13) would be impossible given RF monotonicity. Thus, besides improving interpretability, monotonicity can help guarantee statistical soundness. Response function monotonicity is restrictive, however, because it excludes useful models from consideration. For example, identifiable CIRT models can be constructed satisfying Eq. (13), much as analysis-of-variance models can be constructed having interactions but no main effects (Jannarone, 1990; Jannarone and Roberts, 1984).


27. Models for Locally Dependent Responses

Robert J. Jannarone

In a related issue, CIRT models illustrate that "nonparametric tests for unidimensionality" (Holland and Rosenbaum, 1986) are misleading. For example, the testlet model in Eq. (7) with the 6[^ = ef] is one-dimensional. However, data generated from such a model when the 9^ < ffi would produce negative interitem correlations within testlets, which are considered to be "rather strong evidence that the item responses themselves are not unidimensional" (Holland and Rosenbaum, 1986, p. 1541). Thus, excluding useful models from a restrictive "nonparametric" framework produces misleading conclusions when the framework is applied to them. Another related issue is Stout's (1990) "essential local independence" requirement, which is satisfied if for each latent trait vector 9, Dn(9)


= 0)


as n —> oo. Stout's definition is misleading in that some special cases of Eq. (1) are "essentially locally independent," but they have measurable and interesting locally dependent properties that prevail asymptotically. In the testlet model (Eq. (7)) case for example, Cov(Ui,Ui> | 0 = 9) is a positive constant if i = i', a different constant for values of i and i' corresponding to items within testlets and 0 otherwise. Thus, as n -> oo, Dn(6) approaches 0, because the denominator of Eq. (10) is of higher order in n than its numerator (2 vs. 1). Stout's conclusion based on Eq. (14) would therefore be that the testlet model in Eq. (7) is "essentially locally independent," which makes no sense in light of its useful locally dependent properties. The incorrect conclusions that follow when the above results are applied to conjunctive IRT models are worthy of special note, because the above approach is improperly presented as nonparametric. A worthwhile nonparametric approach excludes as few possibilities from consideration as possible, especially those that have clear potential utility. In a broader sense, the same problem holds for the local independence axiom that prevails throughout test theory. The time is at hand for the axiom to be replaced by a more general alternative, now that useful test models have been shown to exist that violate it.

Parameter Estimation Parameter estimation for binary item conjunctive IRT resembles conventional Rasch model estimation, in that person and item parameters are estimated from person and item sufficient statistics. The soundness and efficiency of conjunctive IRT estimates are also assured by their exponential family membership, just as in the conventional Rasch estimation case.


The simplest person-parameter estimation approach for conjunctive IRT models is to use person sufficient statistics themselves in place of their linked parameters. More efficient schemes are possible for item parameter estimation, but they are much more complicated. Conjunctive Rasch model estimation is more complex than conventional Rasch model estimation, because nonadditive statistic possibilities are restricted when additive statistics values are fixed, and vice versa. For example, if a test is made up of five testlets satisfying Eq. (7), person sufficient statistics of the form (15) have possible s^1' values ranging from 0 to 10 and possible s\j values ranging from 0 to 5. However, the range of sj?' values is restricted when sj- values are fixed, for example to 0 given that s^11 = 0, to between 1 and 3 given that sf] = 6 and to 4 given that sj-11 = 9. With these and other such restrictions, the end result is that only 21 possible contingencies exist for (s[1\ sf]) instead of the 66 (11 x 6) that might be expected, causing sf] to [21

be highly correlated with Sj . Efficient schemes tha,t have been most studied so far are two-stage maximum likelihood estimation (Jannarone, 1987), conditional maximum likelihood estimation (Kelderman and Jannarone, 1989) and "easy Bayes" estimation (Jannarone et al., 1989) for the Rasch Markov model. Besides complexity resulting from additive and nonadditive statistic contingencies, an added source for complexity is the need for iterative procedures such as the Newton-Rapheson algorithm. A second added source for complexity is the need for recursive (elementary symmetric-like) functions to avoid counting all possible test patterns during each iteration. A third added source for complexity is the need to augment sample data with prior information so that useful estimates can be obtained for all possible contingencies. In the Eq. (7) case involving five testlets, for example, maximum likelihood person parameter estimation is possible for only 6 of the 21 possible (sf\sj ) contingencies, because the other 15 occur at boundary values for one person-statistic when the other person-statistic is fixed. To give one such set of boundary values, Sj values are restricted to between 1 and 3 given that sj-1' = 6, but among these possible contingencies only (s • , s ) = (6,2) admits a maximum likelihood estimate, because (s • , s'- ) values of (6,1) and (6,3) are conditional boundary values for s'2'. Such boundary-value contingencies can produce useful "easy Bayes" estimates instead of impossible maximum likelihood estimates, but deriving them is not easy and obtaining them takes iterative convergence time (Jannarone et al., 1990).


27. Models for Locally Dependent Responses

Robert J. Jannarone

Because efficient estimation takes both effort to program and time to converge, the use of simple sufficient statistics in place of efficient parameter estimates is recommended, at least as a basis for initial study. For example, assessing the stepwise correlation between external measurements ol interest and s\j values, over and above Sj values, is a natural and simple first step in evaluating the utility of testlet learning measurements. Also, using SA values instead of efficient 9j ' estimates may be needed in tailored tutoring or testing settings where person estimates must be updated and used concurrently and quickly, before selecting a new task for presentation.


Results of all such assessments have shown that CIRT models have potential utility over and above additive IRT models. Significantly interacting item counts were found to far exceed chance values in personality data (Jannarone and Roberts, 1984), and significant conjunctive item parameter counts were found to far exceed chance levels in verbal analogies and personality data (Jannarone, 1986). Simulation studies have also been performed to assess CIRT model utility based on componential item pairs (Jannarone, 1986) and items linked by learning (Jannarone, 1987). Both studies showed that CIRT models produce conjunctive ability estimates with significant predictability over and above additive ability estimates.

Goodness of Fit Efficient methods for assessing conjunctive parameters have been developed based on straightforward asymptotic theory (Jannarone, 1986; 1987; Kelderman and Jannarone, 1989), but the use of simple methods based on sufficient statistics is recommended for initial study. For example, secondorder testlet parameters are worth estimating only if their linked measurements provide useful predictive power over and above additive sufficient statistics. As in the previous section, the easiest means for utility assessment is through stepwise correlation of second-order sufficient statistics with external measurements, controlling for first-order sufficient statistic correlations. Less simple methods for assessing goodness of fit have been developed based on consistent conjunctive parameter estimates and multinomial hypothesis testing (Jannarone, 1986). These methods produce asymptotic chi square tests that additive and conjunctive item parameters are zero, which provide global as well as nested assessment of additive Rasch models and their conjunctive extensions. Other asymptotic tests based on maximum likelihood estimates (Jannarone, 1987) and conditional maximum likelihood estimates (Kelderman and Jannarone, 1989) are simple to construct, because CIRT models belong in the exponential family.

Examples Available empirical results for nonadditive test measurement include simple tests for item cross-product step-wise correlations from personality data (Jannarone and Roberts, 1984), and asymptotically efficient tests of hypotheses that second-order parameters are 0 from educational test data (Jannarone, 1986). Also, Monte Carlo studies have been performed for the Rasch-Markov model, verifying that useful predictability can be obtained from second-order statistics, over and above additive statistical information (Jannarone, 1987).

Discussion A natural area for conjunctive IRT application is interactive computer testing and tutoring, where learning performance and task response times can quickly be assessed, reinforcements can quickly be provided and new tasks can quickly be selected accordingly. This area is especially promising now that concurrent information processing procedures are available for automatically learning and responding to input computer measurements as quickly as they arrive (Jannarone, to appear, b). For tailored tutoring and testing applications, this translates to receiving a student's response to an item, measuring its quickness and correctness, quickly updating that student's parameter estimates accordingly and automatically selecting the next task for presentation accordingly. Bringing IRT into computerized tutoring requires developing entirely new tests that are based on assessing reinforced learning, instead of assessing previously learned aptitude. To the author's knowledge this is an interesting but unexplored field for future IRT study. As human-computer interaction continues to accelerate its impact on human information processing, a prominent role for IRT in computing technology could bring new vitality to psychometrics. The most promising such role is for concurrent learning assessment during adaptive computerized tutoring. A necessary condition for IRT involvement in this vital new area, however, is that it break away from local independence as a fundamental requirement.

References Andersen, E.B. (1980). Discrete Statistical Models with Social Science Applications. Amsterdam: North Holland. Andrich, D. (1985). A latent trait model with response dependencies: impli-


Robert J. Jannarone

27. Models for Locally Dependent Responses


28 Mismatch Models for Test Formats that Permit Partial Information to be Shown T.P. Hutchinson Introduction The conventional IRT models—that is, those for which the probability of examinee j correctly answering item i is Pi(8j), 8 being ability and P being an increasing function—make no allowance for the examinee having partial information about the question asked. The models are unable to predict what might happen in situations which allow the partial information to be shown. Relations between probabilities of correctness in different formats of test—for example, with different numbers of options to choose from, or permitting a second attempt at items initially answered wrongly—do not fall within their scope. There are two reasons why the existence of partial information is credible: 1. Introspection. Often, one is conscious of having some information relating to the question posed, without being sure of being able to select the correct option. 2. Experiment. Among responses chosen with a high degree of confidence, a greater proportion are correct than among those chosen with a low degree of confidence. Second attempts at items initially answered wrongly have a higher-than-chance probability of being correct. And other results of this type. Much the same can be said concerning perception, e.g., of a faint visual or auditory stimulus. What has been done in that field is to develop signal detection theory (SDT). The model to be proposed in this chapter is SDT applied to multiple-choice tests. Just as SDT enables performance in different types of perceptual experiment to be interrelated, so mismatch theory enables performance in different formats of ability or achievement test to be interrelated. Over the years, many different formats have been tried at some time or other; among them: (1) varying the number of options; (2)


T.P. Hutchinson

including some items for which all options are wrong; (3) asking the examinee to accompany the response with a confidence rating; (4) having the examinee .mark as many options as are believed to be wrong; and (5) permitting the examinee to make a second attempt at items initially answered wrongly. More information concerning mismatch theory can be found in Hutchinson (1991). The first publication of the theory was in Hutchinson (1977), with much further development being in Hutchinson (1982).

Presentation of the Models The Main Ideas The important features of mismatch theory can be described using a single parameter to represent the ability of the examinee relative to the difficulty of the item. The symbol A will be used. It is implicit that A could be more fully written as A^ (where i labels the item and j labels the examinee); and that it would then be of interest to split A into item and examinee components, e.g., as Xtj = 1 + exp(0j - 6,). The central idea is as follows: Each alternative response that is available in a multiple-choice item gives rise within the examinee to a continuouslydistributed random variable that reflects how inappropriate the alternative is to the question posed—that is, how much "mismatch" there is between the question and the alternative. The distribution of this random variable, which will be denoted X, is different in the case of the correct option from what it is for the incorrect options. Specifically, the correct answer tends to have a lower value of X than do the incorrect alternatives. The greater the difference between the distributions of X, the easier is the question (or the cleverer is the examinee). Denote the two distributions by F and G, Pr(X > x) being F(x) for the correct option and G(x) for any incorrect option, where F < G. The derivatives —dF(x)/dx and —dG(x)/dx are the respective probability density functions (p.d.f.'s). It is supposed that for each examinee, there is a response threshold T, such that if the mismatch exceeds this for all the alternative options, no choice is made. If at least one alternative gives rise to a value of X less than T, the one with the lowest value of X is selected. Equations for the probabilities of a correct response, of a wrong response, and of omitting the item may readily be obtained. For a given item, ability is measured by how different F and G are. They are chosen so as to jointly contain a single parameter characterizing ability, A. A may then be found from the empirical probabilities of correct and wrong responses. To make any progress, it is necessary to make assumptions about F and G:

28. Mismatch Models for Test Formats


1. Exponential Model. One possible choice is to say that for the correct alternative, X has an exponential distribution over the range 0 to oo, with scale parameter 1; and for the incorrect alternatives, X has an exponential distribution with scale parameter A, with A > 1. That is, we take F(x) = exp(-x) and G(x) = exp(-z/A). 2. Normal Model. Another possible choice is to say that X has a normal distribution in both cases, but with different means. That is, F(x) = 1 - $(z) and G(x) = 1 - $(a; - A), with A > 0. 3. All-or-Nothing Model. A third possibility is that, for the correct alternative, X has a uniform distribution over the range 0 to 1; and for the distractors, X has a uniform distribution over the range 1 — A"1 to 1, with A > 1. The significance of this proposal is that it embodies the conventional idea of all-or-nothing information: if mismatch is less than 1 — A"1, the option generating it must have been the correct one; if mismatch exceeds 1 — A"1, it carries no information about which sort of option it came from, and the examinee's response is a random guess. (Thus it is appropriate to call this a two-state theory.) 4. Symmetric Uniform Model. Take X to have a uniform distribution over the range 0 to 1, or a uniform distribution over the range A to 1 + A, with 0 < A < 1, for the correct and incorrect options, respectively. This model is almost identical to a model introduced by Garcia-Perez (1985), who supposes that the truth value of each option is either known (with probability A) or not known (with probability 1 — A). That is, there are three possible states: a. Known to be correct: This state is entered with probability A for the correct option. b. Status not known: This state is entered with probability 1 — A for any option. c. Known to be incorrect: This state is entered with probability A for the incorrect options. If no option is in the first state and at least two options are in the second state, then the examinee may (with probability g) guess at random between those options in the second state, or may (with probability 1 — g) leave the item unanswered. However, the assumptions about F and G need not be arbitrary, but can be guided by what empirical results are found. For instance, a choice of F and G that enables second-attempt performance to be correctly predicted from first-attempt performance is preferable to a choice that does not. Naturally, mathematical tractability also plays a part in the choice of F and G; the Exponential and All-or-Nothing Models are most convenient in this respect.


T.P. Hutchinson

28. Mismatch Models for Test Formats

An alternative way of viewing this theory is in terms of "association" between options and the questions asked, instead of "mismatch." Association will be the opposite of mismatch, and the examinee will choose the option generating the highest association, rather than the lowest mismatch. There is no compelling reason to prefer either the language of mismatch or that of association.

Application to the Omission of Responses In what follows, m will be the number of options per item, and c, w, and u will be the probabilities of items being correctly answered, wrongly attempted, and omitted. Putting the model's assumptions that were described above into mathematical notation, u = F{T)[G{T)\m-\ (1) •=


m l

-dF(x)/dx[G(x)] ~ dx.


J — oo

As w = 1 — u — c, there are two equations connecting the two unknowns, A and T, with the two observables, c and w. In words, Eq. (1) says that the probability of not attempting the item is the probability that the mismatches from all options exceed T; and this is obtained by multiplying together the probabilities of the individual mismatches exceeding T, it being assumed these are independent. And Eq. (2) says that to get the probability of answering correctly, we must consider all values less than T of the mismatch from the correct alternative; if this value happens to be x, the mismatches from all incorrect options need to be greater than x if the correct option is to be chosen. Now, the probability that the correct option generated a mismatch between x and x + dx is (—dF(x)/dx)dx, the probability that all the other mismatches are greater is [G(x)]m~1, and we need to multiply these together and sum over all values of x that are less than T. Omissions have long been a practical problem in the scoring of tests. One can instruct examinees to attempt all items, and tolerate any error resulting from failure to fully follow this. Or one can use the "correction for guessing" in scoring, and put up with the lack of realism in the random guessing model on which this is based. But it would be better to understand more fully what is happening, and to have a greater variety of actions available in the face of omissions. It is a simple matter to derive the correction for guessing c-w/(N-l) from Eqs. (1) and (2) using the All-or-Nothing Model. If the Exponential Model is assumed instead, the ratio c/[w/(N — 1)] is found to be the appropriate measure of ability.


Application to Second-Choice Performance It is now presumed that all items are attempted and that a second attempt is permitted at items wrongly answered at first. Without loss of generality, X can be taken to have a uniform distribution over the range 0 to 1 in the case of correct options. That is, —dF/dx is 1 for x between 0 and 1 and is 0 elsewhere. Thus all the flexibility of choosing a model is embodied in the choice of G(x). It will be convenient to adopt this formulation now. It is evident that according to the mismatch theory, the probability that the mismatch takes a value between x and x + dx in the case of the correct option, that exactly k of the wrong options have a higher mismatch, and that exactly m — 1 — k of them have a lower mismatch is m- 1 k

- G(x, lm-l-fe dx.


Consequently, the probability of making the correct choice at first attempt can be seen to be (on setting k = m — 1 and integrating over all possible values of X) Cl

= f [G{x)]m-ldx, Jo


and similarly the probability of the correct alternative having the secondlowest mismatch is c*2 =

f\m - l)[G(z


- G(x)] dx.


The (conditional) probability of giving the wrong answer when the second choice is made is then c2 = ^ / ( l — Ci). For some choices of model, the expressions for C\ and c2 in terms of A are simple, and A can be eliminated from these equations, enabling us to obtain a relationship between Ci and c2: 1. If the All-or-Nothing Model is used, c2 = l/(m — 1), independent of ci, is obtained. 2. If the Exponential Model is used, c2 = (m — l)ci/(m — 2 + c{) is obtained.

What if All Options are Wrong? Items for which all the options are wrong are, perhaps, not revealing of partial information as such, but the proportion which examinees attempt is a measure of their willingness to answer, and thus helps interpret behavior on normal items. The most straightforward assumption to make is that the probability distribution of mismatch for all the alternatives listed for any nonsense


28. Mismatch Models for Test Formats

T.P. Hutchinson

item is the same as for the incorrect alternatives in the genuine items. Then the probability of leaving a nonsense item unanswered is [G(T)]m. And the. probability of answering is a = 1 — [G(T)}m. Once one has made a choice of model, a can be expressed in terms of T; T can be found from responses to conventional items (i.e., from u, c, and w); finally, an equation can be written to predict a from (any two of) u, c, and w. In the case of the All-or-Nothing Model, a may be found to be mw

(m —


+ mw

This equation has been given by Ziller (1957). In the case of the Exponential Model, a may be found to be 1—u




Separate Item and Examinee Parameters In Hutchinson (1991, Chap. 9), there are some proposals about how A can be decomposed into separate item and examinee parameters, 6* and Gj. For example, in the Exponential Model, A can be anything from 1 to oo. One might try log log Ay = Gj - bt, or alternatively log(Ai:, - 1) = Gj - bt. This reference also suggests how different wrong options may be allowed for, and the probability of examinee j selecting option k in item i modelled: 1. Take the mismatch distribution to be G(x; Xijk) for the jth examinee evaluating the fcth option in the zth item. (For compactness of notation, forget about using the symbol F for the distribution for the correct option; let this be one of the G's.) 2. The probability of selecting option k is (if it is assumed that omitting the item is not a possibility, so that the response threshold T can be taken as oo):

"None of the Above" Suppose that in each item, the final option is "none of the above" (NOTA), and that examinees attempt every item. Let NOTA" and NOTA+ items be those where NOTA is wrong and correct, respectively. There will be five types of response—for NOTA" items: correct content option, incorrect content option, and NOTA option; and for NOTA+ items: incorrect content option and NOTA option. Then mismatch theory will suggest that: _ = F(T){G(T)} m-2 -dF(x) dx



Wm- 1 m — 2 c_ +

• logn_

However, no practical trials of splitting A into performed.

and Gj have yet been


(10) n+ = where n_ is the probability of a NOTA response in a NOTA item, e_ is the probability of a correct response in a NOTA" item, and n+ is the probability of a NOTA response in a NOTA+ item. (The response threshold T is not necessarily the same as with ordinary items not having the NOTA option.) The probabilities of wrong responses are W- = 1 — u_ — c_ and w+ = 1 — n+. If the Exponential Model holds, it may be shown that the relation logn+ =

3. When A is split into option bik and examinee Gj components, Eq. (12) (regarded as a function of Gj) is an option response function (Bock, 1972; this volume). Indeed, a connection may be made between the Exponential Model and the particular equation suggested by Bock (Hutchinson, 1991, Sec. 9.5).


should hold (see Hutchinson, 1991, Sect. 5.12, which includes a discussion of the paper by Garcia-Perez and Frary, 1989).

Parameter Estimation In the empirical studies performed to date (see below), no attempt has been made to split A into separate examinee and item components. Instead, some expedient like taking a group of items of similar difficulty, and then regarding A as characteristic of the examinee only, has been adopted. That is, it has not been necessary to fit something like 2n + N parameters to a data matrix of size nN (n = number of items, and N = number of examinees); instead, the size of the data matrix has been something like 3iV (counts of the numbers of items responded to in each of three ways, for each examinee), with the number of parameters being about N. Consequently, parameter estimation has not been the major technical problem that it is in usual circ*mstances. It has been satisfactory to take a straightforward likelihood maximization or chi-squared minimization approach, using readily-available software (e.g., that of the Numerical Algorithms Group) to find the optimum. The NAG subroutines require that the


28. Mismatch Models for Test Formats

T.P. Hutchinson

user supply a subroutine for calculating the criterion statistic (for example, the log-likelihood) for a given value of the parameter being sought (that is, A). So all that is needed are formulae for the theoretical probabilities of the different types of response, followed by calculation of the criterion statistic from these theoretical probabilities and the numbers observed in the empirical data. By way of example, the data might be the number of 5-alternative items answered correctly at first attempt, at second attempt, and in three or more attempts, for each examinee. The relative proportions of these are predicted by Eqs. (4) and (5); naturally, it is necessary to choose the functional form of G{x), and this will have an ability parameter A. Occasionally, we may be lucky enough to get an explicit expression for our estimate of A. Continuing the example of the previous paragraph, suppose we are assuming the All-or-Nothing Model, and using minimum chi-squared as the estimation method. According to the model, the expected number of items in the three categories will be en, \{1 — c)n, and |(1 — c)n, for any particular examinee whose probability of being correct at first attempt is c (this is what is called C\ in Eq. (4), and could readily be written in terms of A). The statistic to be minimized is

{Ox - enf- I {02_-\{l-c)nf _ _ an

(O3 -_ f (1 - c)nf



7(1 -c)n

where the O's are the observed numbers of items in the three categories, for this examinee. Differentiating with respect to c and setting the result equal to 0, the estimated c is found to be the solution of c2 (1 -



3O22 + O2 •

As might be expected on commonsense grounds, the resulting estimate is not very different from the simple estimate, O\/n. For example, if O\/n is 0.5, and O-zIn is 0.2, and Oz/n is 0.3, then c is estimated by this method to be 0.486. In the future, methods will need to be developed for estimating examinee and item parameters when A is split into these. It is not clear whether it would be easy to adapt the methods that have been developed for conventional IRT, or whether new difficulties would be encountered.

Goodness of Fit As for parameter estimation, the issues with mismatch theory are rather different from those for conventional IRT; since A is not split into 6, and Gj, questions about whether this is valid or whether there is interaction between examinee and item have not yet been faced.


A central feature of mismatch theory is that it allows performance in one format of the test to be predicted from performance in another. And it is in evaluating predictions of this type that goodness of fit is assessed. The goodness of fit question has two sides to it. First, does the theory have any success? This can be answered by looking in the data for qualitative features predicted by the theory, and by correlating observed with predicted quantities. Second, is the theory completely successful, in the sense that the deviations of the data from the theory can reasonably be ascribed to chance? This can be answered by calculating some statistic like £(observed — expected)2/expected. (Indeed, this may be the criterion that is optimized in parameter estimation, in which case it already has been calculated.) These two sides to the question will be illustrated in the empirical examples below.

Examples Second- Choice Performance If an examinee gets an item wrong and is then permitted a second attempt at it, is the probability of correctly answering it greater than chance? This is an effective contrast between the assumption of all-or-nothing information and theories that incorporate partial information. Suppose an examinee responds to a 56-item test, each item having five alternatives, and answers 28 correctly at first attempt, leaving 28 to be attempted a second time. Then the All-or-Nothing Model will predict that about 7 items (28 x 1) will be correctly answered on the second attempt. But the Exponential Model will predict that about 16 items—that is, 28 x (5 — 1) x 1/(5 — 2 + 5)—will be correctly answered at second attempt. In terms of the two aspects of goodness of fit described above, theories incorporating partial information will have some success if c2 is above l/(m — 1) for most examinees; a model will be completely successful if ^(observed - expected)2/expected is close to what it would be expected to be if all deviations of data from the model were chance ones. These types of analyses were performed in Hutchinson (1991, Chap. 6) [originally, in Hutchinson (1986) and Hutchinson and Barton (1987)]. The data were from answer-until-correct tests of spatial reasoning and of mechanical reasoning, taken by secondary school pupils. In only a few cases were 4 or 5 attempts necessary in 5-alternative items, so the data for each examinee consisted of the numbers of items for which 1, 2, or 3+ attempts were necessary. These observed numbers can be denoted O\, O2, O3. For each of several choices of the model (that is, for each of several choices of G(x)), expressions for expected numbers E\, E2, E3 were written (with A being included as an unknown parameter). The goodness of fit statistic that


T.P. Hutchinson

28. Mismatch Models for Test Formats

TABLE 1. Comparison of the Goodness of Fit of Three Models, Using Three Datasets. Dataset (T)

(2) (3)

Exponential 453

442 370

Model Normal 559

All-or-Nothing 1074

428 360

803 420

was minimized was J2i=1(Oi - Ei)2/Ei. To arrive at an overall measure of the goodness of fit of a model, the best (i.e., minimum obtainable) values of this for each examinee were summed over examinees. Models having small totals are obviously preferable to models having larger totals. Table 1 compares the goodness of fit of the Exponential, Normal, and All-or-Nothing Models using three datasets:


TABLE 2. Correlation Between Examinees' Probabilities of Success at First Attempt and Second Attempt. Dataset (T) (2) (3)

Correlation 0.46 0.33 0.05

TABLE 3. Conditional Probabilities of Answering Correctly at the <jth Attempt, Given that the Previous q-1 Attempts Were Wrong. Dataset

1. 11 five-alternative spatial reasoning items of medium difficulty;

(1) (2) (3)

0.39 0.49 0.65

0.41 0.39 0.66

0.39 0.42





0.57 0.58

2. 11 five-alternative mechanical reasoning items of medium difficulty; 3. 9 three-alternative mechanical reasoning items of medium difficulty. As can be seen in the table, one important aspect of the results was that the All-or-Nothing Model was clearly the worst. Three further features of the data were: There was above-chance performance when making second and subsequent attempts at items initially answered wrongly; this is shown in Table 3.1 Positive correlation was found between an examinee's probability of success at first attempt and at second attempt, see Table 2. (And there was also a positive correlation between an examinee's probability of success within two attempts and at the third attempt if that was necessary.) Correlations between examinees' abilities estimated from easy, difficult and medium items were rather higher when the abilities were estimated using the Exponential Model than when using the All-or-Nothing Model.

When All Options Are Wrong As discussed above, an assumption about what G{x) is will imply some particular relation between (on the one hand) the proportions of genuine items answered wrongly and left unanswered, and (on the other hand) the proportion of nonsense items responded to. The analysis in Hutchinson (1991, Chap. 7) (originally, in Frary and Hutchinson, 1982) was of a test of chemistry that included four nonsense 1 Note to Table 3. The last line gives the guessing probabilities for fivealternative items.

items having no correct answer. There being so few nonsense items per examinee, the approach taken was as follows. Examinees were grouped into ranges according to their value of Eq. (6); this is their predicted proportion of nonsense items responded to. Then the actual mean proportion responded to by the examinees in each group was found, and the predicted and actual proportions compared. The process was repeated, but with examinees now being grouped according to their value of Eq. (7). Both Eqs. (6) and (7) enjoyed some degree of success, in the sense that the actual proportions tended to be high in the groups for which the predicted proportions were high, and low in the groups for which the predicted proportions were low. Both Eqs. (6) and (7) were less than perfect, in that they tended to overestimate the proportion of nonsense items that were attempted. See Table 42 for quantification of this. Equation (7) was a little more successful than Eq. (6). Despite the few opportunities each examinee had for responding to nonsense items, an analysis of individual examinees' data was also undertaken: the observed values of a (which could only be 0, \, \, f, or 1) were correlated with the values predicted by Eqs. (6) and (7). The correlations were found to be 0.46 and 0.52, respectively. 2

Note to Table 4- The last line gives the column means.


28. Mismatch Models for Test Formats

T.P. Hutchinson

Actual Mean Response Probability 0.84 0.67 0.56 0.49 0.48 0.29

Proportion Predicted by Eq. (7) 0.90-1.00 0.80-0.90 0.70-0.80 0.69-0.70 0.50-0.60 0.40-0.50 0.00-0.40

Actual Mean Response Probability 0.88 0.68 0.66 0.61 0.50 0.42 0.30





Confidence Rating If examinees are permitted to express their degree of confidence in the response they have selected, it is often found that the higher the level of confidence, the greater the probability of being correct. See Hutchinson (1991, Sec. 5.4) for references. Similarly, if examinees are persuaded to attempt items they first left unanswered, their performance is typically below their performance on the items they first did attempt, but is above the chance level. See Hutchinson (1991, Sec. 5.5) for references.

Multiple True-False Items Sometimes an item consists of an introductory statement, together with M further statements, each of which the examinee has to classify as true or false. It ought to be possible to compare performance on a conventional item with that on the equivalent multiple true-false item (which would have 1 true statement and N — 1 false ones). Hutchinson (1991, Sec. 5.8) shows that it is indeed possible. (On one set of data published by Kolstad and Kolstad (1989), the Exponential Model was superior to a two-state model. But the comparison is of limited usefulness because the data was aggregated over all examinees and all items.)

Discussion First, the empirical results that demonstrate partial information are not surprising. Unless one is committed to some strong form of all-or-nothing learning theory, they are qualitatively pretty much as expected. What is surprising is that no way of describing this mathematically has been de-


veloped previously—that is, why c - w/(m - 1) is well-known as "the" correction for guessing, but not c/[w/(m - 1)], and why Eq. (6) was proposed in 1957, but not Eq. (7). Second, the mismatch model does not tackle the question of why items are answered wrongly—the item characteristics that might explain the 6,, and the characteristics of cognitive development, learning experiences, and personality that might explain the 0j are not its concern. Third, notice that there are ideas from cognitive psychology that provide detailed mechanistic accounts of problem-solving; for a review of such models, see Snow and Lohman (1989). Mismatch theory is much more of a broad brush approach than these, while being more explicit about psychological processes than standard IRT is. Since both less abstract and more abstract theories flourish, the mismatch idea falls within the range of what might be useful. Fourth, as to future research needs, the most urgent line of future development is that of performing a greater number and variety of comparisons of the theoretical predictions with empirical results. Among the formats that ought to be studied are: second attempt permitted (e.g., answer-untilcorrect), nonsense items, NOTA items, confidence ratings, and multiple true-false items. These are not highly obscure types of test that are impracticable to administer; indeed, there is probably suitable data already in existence that could be confronted with theoretical predictions.

T.P. Hutchinson

Subject Index

Subject Index

Subject Index


Subject Index

Subject Index

Subject Index

Subject Index

Author Index

Author Index

Author Index

Author Index

Handbook of modern item response theory - PDF Free Download (2024)


What is the item response theory for beginners? ›

IRT is a mathematical theory about what happens when people take tests. It is all about probability—the probability that a test taker responding to a test item will answer it correctly.

Who is the father of item response theory? ›

IRT was initially developed in the 1950s and 1960s by Frederic Lord and other psychometricians (Lord, 1952; Lord & Novick, 1968) who had the goal of developing a method able to evaluate respondents without depending on the same items included in the test (Hambleton & Jodoin, 2003).

What is the summary of item response theory? ›

The item response theory (IRT), also known as the latent response theory refers to a family of mathematical models that attempt to explain the relationship between latent traits (unobservable characteristic or attribute) and their manifestations (i.e. observed outcomes, responses or performance).

What is the weakness of item response theory? ›

Item response theory (IRT) has some disadvantages. One disadvantage is that traditional IRT models are ill-equipped to handle position effects, where the probability of a correct response depends on the location of the item in the test.

What is an example of item response theory? ›

The item response function

Persons with lower ability have less of a chance, while persons with high ability are very likely to answer correctly; for example, students with higher math ability are more likely to get a math item correct.

How to use item response theory? ›

By estimating the item difficulty, discrimination, and guessing parameters for each item on a test, and the ability level of each test-taker, the 3PL model can be used to estimate the probability of a test-taker answering each item correctly based on their ability level, and to estimate the test-taker's overall ability ...

What is the three parameter item response theory? ›

Like all IRT models, it is seeking to predict the probability of a certain response based on examinee ability/trait level and some parameters which describe the performance of the item. With the 3PL, those parameters are a (discrimination), b (difficulty or location), and c (pseudo-guessing).

What is the difference between IRT and Rasch? ›

IRT is a descriptive statistical methodology originated by Frederic Lord. Rasch analysis is a prescriptive measurement methodology originated by Georg Rasch. One of Lord's IRT models resembles a Rasch model.

What is the difference between IRT and CTT? ›

There are multiple important differences between the CTT and IRT test theories. Classical test theory examines the test as a whole. Item response theory examines peoples' responses to individual questions. Classical test theory automatically assumes the presence of errors in participants' responses.

What are two advantages of item response theory? ›

Advantages and Benefits of Item Response Theory

Sample-independence of scale: Classical statistics are all sample dependent, and unusable on a different sample; results from IRT are sample-independent. within a linear transformation. Two samples of different ability levels can be easily converted onto the same scale.

Why is item response theory useful? ›

The item response theory (IRT) also known as latent trait theory, is used for the development, evaluation and administration of standardized measurements; it is widely used in the areas of psychology and education.

What is discrimination in item response theory? ›

Definition of IRT item discrimination

A high item discrimination parameter value suggests that the item has a high ability to differentiate examinees. In practice, a high discrimination parameter value means that the probability of a correct response increases more rapidly as the ability θ (latent trait) increases.

Why is item response theory better than classical test theory? ›

One of the main arguments for favoring IRT methods is that they allow using the local precision of the estimated scores, SE ( θ ^ ) , to test change for significance, whereas in CTT, one common population-level SEM is used for all persons.

What is higher order item response theory? ›

The descriptive higher-order item response theory model has two components: a measurement model which describes the probability of responding correctly to the given survey item; and a factor structure model which describes the relationship between the general factor and domain specific factors.

What is the learned response theory? ›

In classical conditioning, a conditioned response is a learned response to a previously neutral stimulus. It's the response that is produced after someone develops an association between a stimulus and another stimulus that naturally triggers a reaction.

What is item response theory in education? ›

Briefly, IRT (e.g., Embretson and Reise, 2000) is a psychometric test theory that relates the performance of an examinee on a test item to a latent trait (ability) that the test is intended to measure. From: Drug and Alcohol Dependence, 2010.

What is the importance of item response theory in education? ›

IRT subsequently became the most important psychometric method of validating scales because it provides a method for resolving many of the measurement challenges that need to be addressed when constructing a test or scale, and widely used in the development and assessment in the field of education.

What does the item response theory focus on quizlet? ›

IRT focuses on the relationship between items and the total score or latent dimension underlying the test. In IRT the relationship between the item and the overall construct being assessed is central. e.g. Beck Depression Inventory item 10.


Top Articles
Latest Posts
Article information

Author: Kareem Mueller DO

Last Updated:

Views: 6082

Rating: 4.6 / 5 (66 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Kareem Mueller DO

Birthday: 1997-01-04

Address: Apt. 156 12935 Runolfsdottir Mission, Greenfort, MN 74384-6749

Phone: +16704982844747

Job: Corporate Administration Planner

Hobby: Mountain biking, Jewelry making, Stone skipping, Lacemaking, Knife making, Scrapbooking, Letterboxing

Introduction: My name is Kareem Mueller DO, I am a vivacious, super, thoughtful, excited, handsome, beautiful, combative person who loves writing and wants to share my knowledge and understanding with you.