# 9 Validity Studies

The preceding chapters and the Dynamic Learning Maps® (DLM®) Alternate Assessment System 2014–2015 Technical Manual—Integrated Model provide evidence in support of the overall validity argument for results produced by the DLM assessment. This chapter presents additional evidence collected during 2019–2020 for four of the five critical sources of evidence described in Standards for Educational and Psychological Testing : evidence based on test content, response process, internal structure, and external variables. Additional evidence can be found in Chapter 9 of the 2014–2015 Technical Manual—Integrated Model and the subsequent annual technical manual updates .

## 9.1 Evidence Based on Test Content

Evidence based on test content relates to the evidence “obtained from an analysis of the relationship between the content of the test and the construct it is intended to measure” .

This section presents results from data collected during 2019–2020 regarding blueprint coverage. For additional evidence based on test content, including the alignment of test content to content standards via the DLM maps (which underlie the assessment system), see Chapter 9 of the 2014–2015 Technical Manual—Integrated Model .

### 9.1.1 Evaluation of Blueprint Coverage

While the external alignment study summarized in Chapter 9 of the 2014–2015 Technical Manual—Integrated Model provided evidence of the alignment of available testlets, the study did not address the alignment of assessment content administered to individual students. The instructionally embedded model blueprints are unique in that they specify a pool of Essential Elements (EEs) that are available for assessing; teachers are responsible for choosing the EEs for assessment from the pool that meet a pre-specified set of criteria (e.g., “Choose three EEs from within Claim 1”). For additional information about selection procedures, see Chapter 4 in the 2014–2015 Technical Manual—Integrated Model . Teachers are responsible for making sure blueprint coverage is attained during both the fall and spring embedded windows; they can also test beyond what is required by the blueprint to support instruction if they choose. Responses to fall and spring assessments are combined to calculate results used for summative purposes.

In 2019–2020, the fall window was available from September 2019 through December 2019, while the spring window was available from February 2020 through May 2020. However, due to the COVID-19 pandemic, the instructionally embedded spring window was significantly impacted and student responses during the spring window were limited. Because the limited samples from the spring window may not be representative of the full DLM student population, blueprint coverage results are summarized for assessments taken during the instructionally embedded fall window only. Using the same procedure used in prior years, teachers selected the EEs for their students to test on from among those available on the English language arts (ELA) and mathematics blueprints.

Table 9.1 summarizes the expected number of EEs required to meet blueprint coverage and the total number of EEs available for instructionally embedded assessment for each grade and subject. A total of 255 EEs (148 in ELA and 107 in mathematics) for grades 3 through high school were available; 7,890 students in those grades participated in the instructionally embedded fall window. Histograms in Appendix A summarize the distribution of total unique EEs assessed per student in each grade and subject.

Table 9.1: EEs Expected for Blueprint Coverage and Total Available, by Grade and Subject
English language arts
Mathematics
Grade Expected n Available N Expected n Available N
3   8 17 6 11
4   9 17 8 16
5   8 19 7 15
6   9 19 6 11
7 11 18 7 14
8 11 20 7 14
9–10 10 19 6 26
11–12 10 19
Note. High school mathematics is reported in the 9–10 row. There were 26 EEs available for the 9-11 band. While EEs were assigned to specific grades in mathematics blueprint (eight EEs in grade 9, nine EEs in grade 10, and nine EEs in grade 11), a teacher could choose to test on any of the high school EEs, as all were available in the system.

Table 9.2 summarizes the number and percentage of students in three categories: students who did not meet all blueprint requirements, students who met all blueprint requirements exactly, and students who exceeded the blueprint requirements during the instructionally embedded fall window. In total, 84% of students in ELA and 74% of students in mathematics met or exceeded blueprint coverage requirements.

Table 9.2: Number and Percentage of Students in Each Blueprint Coverage Category, by Subject during the Fall Window
English language arts
Mathematics
Coverage category n % n %
Not met 1,256 15.9 1,977 26.0
Met 5,350 67.9 4,147 54.5
Exceeded 1,269 16.1 1,485 19.5
Met or Exceeded 6,619 84.0 5,632 74.0

Before taking any DLM assessments, educators complete the First Contact survey for each student, which is a survey of learner characteristics. Responses from the ELA, mathematics, and expressive communication portions of the survey were included in an algorithm to calculate the student’s complexity band for each subject. For more information, see Chapter 4 of the 2014–2015 Technical Manual—Integrated Model . The complexity band was used to recommend the appropriate, corresponding linkage level during instructionally embedded assessment. Table 9.3 summarizes the percentage of students in each blueprint coverage category based on their complexity band for each subject. For the Foundational complexity band, there appears to be a higher rate of students not meeting coverage in both ELA and mathematics. For complexity band 1, the distribution of blueprint coverage is more evenly distributed across categories, while more students in complexity bands 2 and 3 exceeded coverage.

Table 9.3: Percentage of Students in Each Blueprint Coverage Category by Complexity Band and Subject during the Fall Window
Blueprint coverage
English language arts (%)
Mathematics (%)
Complexity band Not met Met Exceeded Not met Met Exceeded
Foundational 17.3 22.5 15.5 18.9 20.4 16.8
Band 1 38.5 42.1 40.2 41.3 45.0 43.2
Band 2 33.9 29.0 34.7 35.3 31.0 36.2
Band 3 10.4   6.4   9.6   4.5   3.6   3.7

## 9.2 Evidence Based on Response Processes

The study of test takers’ response processes provides evidence about the fit between the test construct and the nature of how students actually experience test content . Due to the COVID-19 pandemic, teacher survey responses and test administration observations were significantly reduced from prior years. The data collected from the limited samples of survey responses and teacher administration observations are not included in this chapter as they may not accurately represent the full DLM teacher population. Information on the number of test administration observations collected as well as the number of writing samples collected for interrater agreement is presented in this section. For additional evidence based on response process, including studies on student and teacher behaviors during testlet administration and evidence of fidelity of administration, see Chapter 9 of the 2014–2015 Technical Manual—Integrated Model .

Prior to the onset of the COVID-19 pandemic, test administration observations were conducted in multiple states during 2019–2020 to further understand student response processes. Students’ typical test administration process with their actual test administrator was observed. Test administration observations were collected by state and local education agency staff.

In 2019–2020, there were 223 observations collected in six states.

### 9.2.2 Interrater Agreement of Writing Sample Scoring

All students are assessed on writing EEs as part of the ELA blueprint. Teachers administer writing testlets at two levels: emergent and conventional. Emergent testlets measure nodes at the Initial Precursor and Distal Precursor levels, while conventional testlets measure nodes at the Proximal Precursor, Target, and Successor levels. All writing testlets include items that require teachers to evaluate students’ writing processes; some testlets also include items that require teachers to evaluate students’ writing samples. Evaluation of students’ writing samples does not use a high-inference process common in large-scale assessment, such as applying analytic or holistic rubrics. Instead, writing samples are evaluated for text features that are easily perceptible to a fluent reader and require little or no inference on the part of the rater (e.g., correct syntax, orthography). The test administrator is presented with an onscreen selected-response item and is instructed to choose the option(s) that best matches the student’s writing sample. Only test administrators rate writing samples, and their item responses are used to determine students’ mastery of linkage levels for writing and some language EEs on the ELA blueprint. We annually collect student writing samples to evaluate how reliably teachers rate students’ writing samples. However, due to the COVID-19 pandemic, interrater reliability ratings for writing samples collected during the 2019–2020 administration were postponed until 2021. For a complete description of writing testlet design and scoring, including example items, see Chapter 3 of the 2015–2016 Technical Manual Update—Integrated Model

During the spring 2020 administration, teachers were asked to submit student writing samples within Educator Portal. Requested submissions included papers that students used during testlet administration, copies of student writing samples, or printed photographs of student writing samples. To allow the sample to be matched with test administrator response data from the spring 2020 administration, each sample was submitted with limited information to enable matching to the observed educator ratings.

A total of 379 student writing samples were submitted from districts in eight states. In several grades, the emergent writing testlet does not include any tasks that evaluate the writing sample; therefore, emergent samples submitted for these grades are not eligible to be included in the interrater reliability analysis (e.g., grade 3 emergent writing samples). Additionally, writing samples that could not be matched with student data were excluded (e.g., student name or identifier was not provided). These exclusion criteria resulted in the availability of 226 writing samples for evaluation of interrater agreement, which will be conducted in 2021.

## 9.3 Evidence Based on Internal Structure

Analyses of an assessment’s internal structure indicate the degree to which “relationships among test items and test components conform to the construct on which the proposed test score interpretations are based” . Given the heterogeneous nature of the DLM student population, statistical analyses can examine whether particular items function differently for specific subgroups (e.g., male versus female). Additional evidence based on internal structure is provided across the linkage levels that form the basis of reporting.

### 9.3.1 Evaluation of Item-Level Bias

Differential item functioning (DIF) addresses the challenge created when some test items are “asked in such a way that certain groups of examinees who are knowledgeable about the intended concepts are prevented from showing what they know” . DIF analyses can uncover internal inconsistency if particular items function differently in a systematic way for identifiable subgroups of students . While identification of DIF does not always indicate weakness in a test item, it can point to construct-irrelevant variance, posing considerations for validity and fairness.

#### 9.3.1.1 Method

DIF analyses for 2020 followed the same procedure used in previous years and examined ethnicity in addition to gender. Analyses included data from 2015–2016 through 2018–2019 DIF analyses are conducted on the sample of data used to update the model calibration, which uses data through the previous operational assessment. See Chapter 5 of this manual for more information. to flag items for evidence of DIF. Items were selected for inclusion in the DIF analyses based on minimum sample-size requirements for the two gender subgroups: male and female; and five ethnicity subgroups: white, black, Indian, Asian, and multiple ethnicities.

The DLM student population is unbalanced in both gender and ethnicity. The number of female students responding to items is smaller than the number of male students by a ratio of approximately 1:2. Similarly, the number of non-white students responding to items is smaller than the number of white students by a ratio of approximately 1:2. Therefore, a threshold for item inclusion was retained from previous years whereby the focal group must have at least 100 students responding to the item. The threshold of 100 was selected to balance the need for a sufficient sample size in the focal group with the relatively low number of students responding to many DLM items. Writing items were excluded from the DIF analyses described here because they include non-independent response options. See Chapter 3 of the 2016–2017 Technical Manual Update—Integrated Model for more information on the process of scoring writing items.

Consistent with previous years, additional criteria were included to prevent estimation errors. Items with an overall proportion correct (p-value) greater than .95 or less than .05 were removed from the analyses. Items for which the p-value for one gender or ethnicity group was greater than .97 or less than .03 were also removed from the analyses.

Using the above criteria for inclusion, 4,764 (43%) items were selected for gender, and 2,351 (21%) items were selected for at least one ethnicity group comparison. The number of items evaluated by grade and subject for gender ranged from 206 in grade 11–12 ELA to 357 in grade 5 mathematics. The number of items evaluated by grade and subject for ethnicity ranged from 88 in grade 5 ELA to 211 in grade 5 mathematics. Because there are a total of seven ethnicity groups that students can be categorized in for DLM assessments, See Chapter 7 of this manual for a summary of participation by ethnicity and other demographic variables. there are up to six comparisons that can be made for each item, with the White ethnic group as the reference group and each of the other six ethnic groups (i.e., African American, Asian, American Indian, Native Hawaiian or Pacific Islander, Alaska Native, two or more races) as the focal group. Across all items, this results in 65,730 possible comparisons. Using the inclusion criteria specified above, 3,254 (5%) item and focal group comparisons were selected for analysis. Overall, 1,680 items were evaluated for one ethnic groups, 496 items were evaluated for two ethnic groups, 118 items were evaluated for three ethnic groups, and 57 items were evaluated for four ethnic groups. Table 9.4 shows the number of items that were evaluated for each ethnic focal group. Across all gender and ethnicity comparisons, sample sizes for each comparison ranged from 236 to 5,227 for gender and from 416 to 4,221 for ethnicity.

Table 9.4: Number of Items Evaluated for Each Ethnicity
Focal Group Items (n)
Asian    171
African American 2,351
American Indian      61
Two or more races    671

Of the 6,360 items that were not included in the DIF analysis for gender, 5,943 (93%) had a focal group sample size of less than 100, 116 (2%) had an item p-value greater than .95, and 301 (5%) had a subgroup p-value greater than .97. A total of 8,604 items were not included in the DIF analysis for ethnicity for any of the subgroups. Of the 62,476 item and focal group comparisons that were not included in the DIF analysis for ethnicity, 62,201 (> 99%) had a focal group sample size of less than 100, 43 (< 1%) had an item p-value greater than .95, and 232 (< 1%) had a subgroup p-value greater than .97. Table 9.5 and Table 9.6 show the number and percent of items that did not meet each inclusion criteria for gender and ethnicity, respectively, by subject and the linkage level the items assess. The majority of non-included comparisons come from ELA for both gender (n = 4,019; 63%) and ethnicity (n = 35,305; 57%).

Table 9.5: Comparisons Not Included in DIF Analysis for Gender, by Subject and Linkage Level
Sample
Size
Item
Proportion
Correct
Subgroup
Proportion
Correct
Subject and Linkage Level n % n % n %
English language arts
Initial Precursor 596 16.1   0   0.0   0   0.0
Distal Precursor 796 21.4   0   0.0 17   7.3
Proximal Precursor 728 19.6   7   9.3 92 39.7
Target 729 19.6 31 41.3 90 38.8
Successor 863 23.2 37 49.3 33 14.2
Mathematics
Initial Precursor 247 11.1   0   0.0   0   0.0
Distal Precursor 249 11.2   0   0.0 14 20.3
Proximal Precursor 280 12.6 19 46.3 13 18.8
Target 585 26.2   8 19.5 33 47.8
Successor 870 39.0 14 34.1   9 13.0
Table 9.6: Comparisons Not Included in DIF Analysis for Ethnicity, by Subject and Linkage Level
Sample
Size
Item
Proportion
Correct
Subgroup
Proportion
Correct
Subject and Linkage Level n % n % n %
English language arts
Initial Precursor 6,981 19.9   0   0.0   3   2.1
Distal Precursor 8,090 23.0   0   0.0 10   7.0
Proximal Precursor 8,100 23.0   0   0.0 69 48.3
Target 6,433 18.3 10 62.5 29 20.3
Successor 5,542 15.8   6 37.5 32 22.4
Mathematics
Initial Precursor 5,458 20.2   0   0.0   5   5.6
Distal Precursor 4,699 17.4   0   0.0 10 11.2
Proximal Precursor 5,234 19.3 12 44.4 37 41.6
Target 5,778 21.4   7 25.9 22 24.7
Successor 5,886 21.8   8 29.6 15 16.9

For each item, logistic regression was used to predict the probability of a correct response, given group membership and performance in the subject. Specifically, the logistic regression equation for each item included a matching variable comprised of the student’s total linkage levels mastered in the subject of the item and a group membership variable, with the reference group (i.e., males for gender, White for ethnicity) coded as 1 and the focal group (i.e., females for gender; African American, Asian, American Indian, Native Hawaiian or Pacific Islander, Alaska Native, or two or more races for ethnicity) coded as 0. An interaction term was included to evaluate whether non-uniform DIF was present for each item ; the presence of non-uniform DIF indicates that the item functions differently because of the interaction between total linkage levels mastered and the student’s group (i.e., gender or ethnic group). When non-uniform DIF is present, the group with the highest probability of a correct response to the item differs along the range of total linkage levels mastered, thus one group is favored at the low end of the spectrum and the other group is favored at the high end.

Three logistic regression models were fitted for each item:

\begin{align} \text{M}_0\text{: } \text{logit}(\pi_i) &= \beta_0 + \beta_1\text{X} \tag{9.1} \\ \text{M}_1\text{: } \text{logit}(\pi_i) &= \beta_0 + \beta_1\text{X} + \beta_2G \tag{9.2} \\ \text{M}_2\text{: } \text{logit}(\pi_i) &= \beta_0 + \beta_1\text{X} + \beta_2G + \beta_3\text{X}G\tag{9.3}; \end{align}

where $$\pi_i$$ is the probability of a correct response to the item for group $$i$$, $$\text{X}$$ is the matching criterion, $$G$$ is a dummy coded grouping variable (0 = reference group, 1 = focal group), $$\beta_0$$ is the intercept, $$\beta_1$$ is the slope, $$\beta_2$$ is the group-specific parameter, and $$\beta_3$$ is the interaction term.

Because of the number of items evaluated for DIF, Type I error rates were susceptible to inflation. The incorporation of an effect-size measure can be used to distinguish practical significance from statistical significance by providing a metric of the magnitude of the effect of adding group and interaction terms to the regression model.

For each item, the change in the Nagelkerke pseudo $$R^2$$ measure of effect size was captured, from $$M_0$$ to $$M_1$$ or $$M_2$$, to account for the effect of the addition of the group and interaction terms to the equation. All effect-size values were reported using both the Zumbo and Thomas (1997) and Jodoin and Gierl (2001) indices for reflecting a negligible, moderate, or large effect. The Zumbo and Thomas thresholds for classifying DIF effect size are based on Cohen’s (1992) guidelines for identifying a small, medium, or large effect. The thresholds for each level are .13 and .26; values less than .13 have a negligible effect, values between .13 and .26 have a moderate effect, and values of .26 or greater have a large effect. The Jodoin and Gierl thresholds are more stringent, with lower threshold values of .035 and .07 to distinguish between negligible, moderate, and large effects.

#### 9.3.1.2 Results

##### 9.3.1.2.1 Uniform DIF Model

A total of 388 items for gender were flagged for evidence of uniform DIF when comparing $$\text{M}_0$$ to $$\text{M}_1$$. Additionally, 269 item and focal group combinations across 249 items were flagged for evidence of uniform DIF. Table 9.7 and Table 9.8 summarize the total number of combinations flagged for evidence of uniform DIF by subject and grade for gender and ethnicity, respectively. The percentage of combinations flagged for uniform DIF ranged from 4% to 12% for gender and 5% to 11% for ethnicity.

Table 9.7: Combinations Flagged for Evidence of Uniform DIF for Gender
Grade Items flagged (n) Total items (N) Items flagged (%) Items with moderate or large effect size (n)
English language arts
3 21 287   7.3 0
4 24 273   8.8 0
5 29 295   9.8 1
6 29 313   9.3 0
7 13 310   4.2 0
8 24 287   8.4 2
9–10 19 208   9.1 0
11–12 19 206   9.2 0
Mathematics
3 23 235   9.8 0
4 26 319   8.2 0
5 19 357   5.3 0
6 19 287   6.6 0
7 23 311   7.4 0
8 26 344   7.6 1
9 19 209   9.1 0
10 21 251   8.4 1
11 34 272 12.5 0
Table 9.8: Combinations Flagged for Evidence of Uniform DIF for Ethnicity
Grade Items flagged (n) Total items (N) Items flagged (%) Items with moderate or large effect size (n)
English language arts
3 15 137 10.9 0
4 15 157   9.6 0
5 11 123   8.9 0
6 14 151   9.3 0
7 12 153   7.8 0
8 10 141   7.1 0
9–10 12 135   8.9 0
11–12 14 136 10.3 0
Mathematics
3 20 243   8.2 0
4 23 262   8.8 0
5 27 332   8.1 0
6 19 276   6.9 0
7 19 225   8.4 0
8 18 240   7.5 0
9 14 180   7.8 0
10 15 153   9.8 0
11 11 210   5.2 0

For gender, using the Zumbo and Thomas (1997) effect-size classification criteria, all combinations were found to have a negligible effect-size change after the gender term was added to the regression equation. When using the Jodoin and Gierl (2001) effect-size classification criteria, all but five combinations were found to have a negligible effect-size change after the gender term was added to the regression equation.

The results of the DIF analyses for ethnicity were similar to those for gender. When using the Zumbo and Thomas (1997) effect-size classification criteria, all combinations were found to have a negligible effect-size change after the ethnicity term was added to the regression equation. Similarly, when using the Jodoin and Gierl (2001) effect-size classification criteria, all combinations were found to have a negligible effect-size change after the ethnicity term was added to the regression equation.

Table 9.9 provides information about the flagged items with a non-negligible effect-size change after the addition of the group term, as represented by a value of B (moderate) or C (large). The $$\beta_2G$$ values in Table 9.9 indicate which group was favored on the item after accounting for total linkage levels mastered, with positive values indicating that the focal group had a higher probability of success on the item and negative values indicating that the focal group had a lower probability of success on the item. The focal group was favored on two combinations.

Table 9.9: Combinations Flagged for Uniform DIF With Moderate or Large Effect Size
Item ID Focal Grade EE $$\chi^2$$ $$p$$-value $$\beta_2G$$ $$R^2$$ Z&T* J&G* Window
English language arts
54104 Female 5 ELA.EE.RL.5.2 15.16 < .001    −0.79   .041 A B Spring
56991 Female 8 ELA.EE.RI.8.5   8.63 .003 −0.99   .040 A B Spring
32251 Female 8 ELA.EE.L.8.5.a   8.93 .003 −0.78   .038 A B Spring
Math
11704 Female 8 M.EE.8.G.4   6.93 .008 0.78 .036 A B Spring
41485 Female 10 M.EE.HS.S.ID.4 12.44 < .001    0.93 .051 A B Fall
Note. EE = Essential Element; Z&T = Zumbo & Thomas; J&G = Jodoin & Gierl.
* Effect-size measure.
##### 9.3.1.2.2 Combined Model

A total of 547 items were flagged for evidence of DIF when both the gender and interaction terms were included in the regression equation, as shown in equation (9.3). Additionally, 379 item and focal group combinations across 346 items were flagged for evidence of DIF when both the ethnicity and interaction terms were included in the regression equation. Table 9.10 and Table 9.11 summarize the number of combinations flagged by subject and grade. The percentage of combinations flagged ranged from 8% to 16% for gender and 5% to 20% for ethnicity.

Table 9.10: Items Flagged for Evidence of DIF for the Combined Model for Gender
Grade Items flagged (n) Total items (N) Items flagged (%) Items with moderate or large effect size (n)
English language arts
3 24 287   8.4 0
4 30 273 11.0 0
5 35 295 11.9 2
6 30 313   9.6 0
7 26 310   8.4 0
8 29 287 10.1 4
9–10 25 208 12.0 0
11–12 28 206 13.6 1
Mathematics
3 27 235 11.5 0
4 51 319 16.0 2
5 39 357 10.9 1
6 26 287   9.1 2
7 45 311 14.5 0
8 40 344 11.6 2
9 23 209 11.0 2
10 30 251 12.0 2
11 39 272 14.3 0
Table 9.11: Items Flagged for Evidence of DIF for the Combined Model for Ethnicity
Grade Items flagged (n) Total items (N) Items flagged (%) Items with moderate or large effect size (n)
English language arts
3 17 137 12.4 0
4 21 157 13.4 0
5 25 123 20.3 0
6 16 151 10.6 0
7 13 153   8.5 0
8 17 141 12.1 0
9–10 25 135 18.5 0
11–12 23 136 16.9 0
Mathematics
3 32 243 13.2 0
4 26 262   9.9 0
5 34 332 10.2 0
6 31 276 11.2 0
7 24 225 10.7 0
8 23 240   9.6 0
9 27 180 15.0 0
10   8 153   5.2 0
11 17 210   8.1 0

Using the Zumbo and Thomas (1997) effect-size classification criteria, all combinations were found to have a negligible effect-size change after the gender and interaction terms were added to the regression equation. When using the Jodoin and Gierl (2001) effect-size classification criteria, all but 18 combinations were found to have a negligible effect-size change after the gender and interaction terms were added to the regression equation.

The results of the DIF analyses for ethnicity were similar to those for gender. When using the Zumbo and Thomas (1997) effect-size classification criteria, all combinations were found to have a negligible effect-size change after the ethnicity and interaction terms were added to the regression equation. Similarly, when using the Jodoin and Gierl (2001) effect-size classification criteria, all combinations were found to have a negligible effect-size change after the ethnicity and interaction terms were added to the regression equation.

Information about the flagged items with a non-negligible change in effect size after adding both the group and interaction term is summarized in Table 9.12, where B indicates a moderate effect size, and C a large effect size. In total, 18 combinations had a moderate effect size.
The $$\beta_3\text{X}G$$ values in Table 9.12 indicate which group was favored at lower and higher numbers of linkage levels mastered. A total of 10 combinations favored the focal group at higher numbers of total linkage levels mastered and the reference group at lower numbers of total linkage levels mastered.

Table 9.12: Combinations Flagged for DIF With Moderate or Large Effect Size for the Combined Model
Item ID Focal Grade EE $$\chi^2$$ $$p$$-value $$\beta_2G$$ $$\beta_3\text{X}G$$ $$R^2$$ Z&T* J&G* Window
English language arts
38067 Female 5 ELA.EE.RI.5.9   9.79 .007 0.26 −0.03   .047 A B Fall
54104 Female 5 ELA.EE.RL.5.2 15.27 < .001    −0.56   −0.01   .042 A B Spring
32251 Female 8 ELA.EE.L.8.5.a   9.61 .008 0.39 −0.03   .040 A B Spring
34022 Female 8 ELA.EE.RI.8.1 17.75 < .001    −1.13   0.46 .060 A B Spring
38518 Female 8 ELA.EE.RI.8.4 11.11 .004 −2.00   0.05 .037 A B Fall
56991 Female 8 ELA.EE.RI.8.5 13.05 .001 2.83 −0.10   .060 A B Spring
55590 Female 11–12 ELA.EE.RL.11-12.2 12.54 .002 −0.84   0.26 .039 A B Fall
Math
12516 Female 4 M.EE.4.MD.2.d 20.66 < .001    −0.89   0.60 .041 A B Spring
21036 Female 4 M.EE.4.OA.1-2 21.18 < .001    −0.50   0.26 .042 A B Spring
8272 Female 5 M.EE.5.MD.3 10.84 .004 1.96 −0.15   .051 A B Spring
14901 Female 6 M.EE.6.NS.5-8 18.64 < .001    −1.27   0.26 .036 A B Fall
37348 Female 6 M.EE.6.NS.1   9.07 .011 1.31 −0.27   .036 A B Spring
11704 Female 8 M.EE.8.G.4   7.84 .020 1.10 −0.09   .041 A B Spring
43243 Female 8 M.EE.8.NS.2.b   9.15 .010 1.83 −0.09   .039 A B Fall
41185 Female 9 M.EE.HS.N.CN.2.a   8.81 .012 −3.27   0.28 .044 A B Spring
41683 Female 9 M.EE.HS.N.CN.2.a 11.95 .003 −2.24   0.12 .046 A B Fall
21354 Female 10 M.EE.HS.S.ID.4 11.85 .003 −0.77   0.17 .035 A B Spring
41485 Female 10 M.EE.HS.S.ID.4 13.11 .001 0.61 0.05 .054 A B Fall
Note. EE = Essential Element; Z&T = Zumbo & Thomas; J&G = Jodoin & Gierl; ELA = English language arts; IE = instructionally embedded window.
* Effect-size measure.

Appendix B includes plots labeled by the item ID, which display the best-fitting regression line for each sub-group, with jitter plots representing the total linkage levels mastered for individuals in each sub-group. Plots are included for the five combinations with a non-negligible effect-size change in the uniform DIF model (Table 9.9), as well as the eighteen combinations with non-negligible effect-size changes in the combined model (Table 9.12).

#### 9.3.1.3 Test Development Team Review of Flagged Items

The test development teams for each subject were provided with data files that listed all items flagged with a moderate effect size. To avoid biasing the review of items, these files did not indicate which group was favored.

During their review of the flagged items, test development teams were asked to consider facets of each item that may lead one gender group to provide correct responses at a higher rate than the other. Because DIF is closely related to issues of fairness, the bias and sensitivity external review criteria were provided for the test development teams to consider as they reviewed the items. After reviewing a flagged item and considering its context in the testlet, including the ELA text or the engagement activity in mathematics, test development teams were asked to provide one of three decision codes for each item.

1. Accept: There is no evidence of bias favoring one group or the other. Leave item as is.
2. Minor revision: There is a clear indication that a fix will correct the item if the edit can be made within the allowable edit guidelines.
3. Reject: There is evidence the item favors one gender group over the other. There is no allowable edit to correct the issue. The item is slated for retirement.

After review, all ELA items flagged with a moderate effect size were given a decision code of 1 by the test development teams. Three mathematics items were given a decision code of 3 and retired, while the ten remaining mathematics items flagged with a moderate effect size were given a decision code of 1. No evidence could be found in any of the items with a decision code of 1 indicating the content favored one gender group over the other.

As additional data are collected in subsequent operational years, the scope of DIF analyses will be expanded to include additional items and approaches to detecting DIF.

### 9.3.2 Internal Structure Within Linkage Levels

Internal structure traditionally indicates the relationships among items measuring the construct of interest. However, for DLM assessments, the level of scoring is each linkage level, and all items measuring the linkage level are assumed to be fungible. Therefore, DLM assessments instead present evidence of internal structure across linkage levels, rather than across items. Further, traditional evidence, such as item-total correlations, are not presented because DLM assessment results consist of the set of mastered linkage levels, rather than a scaled score or raw total score.

Chapter 5 of this manual includes a summary of the parameters used to score the assessment, which includes the probability of a master providing a correct response to items measuring the linkage level and the probability of a non-master providing a correct response to items measuring the linkage level. Because a fungible model is used for scoring, these parameters are the same for all items measuring the linkage level. Chapter 5 also provides a description of the linkage level discrimination (i.e., the ability to differentiate between masters and non-masters).

Chapter 3 of this manual includes additional evidence of internal consistency in the form of standardized difference figures. Standardized difference values are calculated to indicate how far from the linkage level mean each item’s p-value falls. Across all linkage levels, 11,246 (94%) of items fell within two standard deviations of the mean for the linkage level.

These sources, combined with procedural evidence for developing fungible testlets at the linkage level, provide evidence of the consistency of measurement at the linkage levels. For more information on the development of fungible testlets, see the 2014–2015 Technical Manual—Integrated Model . In instances where linkage levels and the items measuring them do not perform as expected, test development teams review flags and prioritize content for revision and re-field test, or retirement, to ensure the content measures the construct as expected.

## 9.4 Evidence Based on Relation to Other Variables

According to Standards for Educational and Psychological Testing, “analyses of the relationship of test scores to variables external to the test provide another important source of validity evidence” . Results from the assessment should be related to other external sources of evidence measuring the same construct.

### 9.4.1 Postsecondary Opportunities

During 2019–2020, evidence was collected to evaluate the extent to which the DLM alternate academic achievement standards are aligned to ensure that a student who meets these standards is on track to pursue postsecondary education or competitive integrated employment. The 2014–2015 Technical Manual—Integrated Model provides evidence of vertical alignment for the alternate academic achievement standards.

Further evidence describes the relationship of the DLM alternate academic achievement standards to the knowledge, skills, and understandings needed for pursuit of postsecondary opportunities. We developed two hypotheses about the expected relationship between meeting DLM alternate academic achievement standards and being prepared for a variety of postsecondary opportunities.

1. Nearly all academic skills will be associated with performance level descriptors at a variety of grades between grade 3 and high school. Few if any academic skills will first occur before grade 3 At Target or after high school At Target.
2. Because academic skills may be associated with multiple opportunities and with soft skills needed for employment and education, we expected Hypothesis 1 to hold for academic skills associated with employment opportunities, education opportunities, and soft skills.

Similar to academic education for all students, academics for students with significant cognitive disabilities develops across grades. Individuals use academic skills at varying levels of complexity, depending on specific employment or postsecondary education settings. Therefore, academic skills associated with achieving At Target in lower grades demonstrate where students are able to apply the least-complex version of the skill. Given the vertical alignment of DLM content and achievement standards, students are expected to continue learning new skills in subsequent grades and be prepared for more-complex applications of the academic skills by the time they transition into postsecondary education and employment.

A panel of experts on secondary transition and/or education of students with significant cognitive disabilities identified postsecondary competitive integrated employment and education opportunities. Their goal was to identify an extensive sampling of opportunities rather than an exhaustive list. Panelists also considered the types of educational and employment opportunities currently available to students with significant cognitive disabilities as well as opportunities that may be more aspirational (i.e., opportunities that may become available in the future). Panelists identified 57 employment opportunities and seven postsecondary education opportunities. Employment opportunities spanned sectors including agriculture, business, arts, education, health sciences, hospitality, information technology, manufacturing, and transportation.

Panelists next identified the knowledge, skills, and understandings needed to fulfill the responsibilities for the employment opportunities as well as eight common responsibilities across all postsecondary education opportunities. Finally, the panel identified the knowledge, skills, and understandings within soft skills (e.g., social skills, self-advocacy) applicable across multiple postsecondary settings. Subject-matter experts in English language arts and mathematics reviewed and refined the academic skill statements to provide clarity and consistency across skills. This resulted in 50 English language arts academic skills and 41 mathematics academic skills to be used in the next phase of the study.

The second set of panels, one for each subject, examined the relationship between the academic skills and the types of academic knowledge, skills, and understandings typically associated with meeting the DLM alternate academic achievement standards (i.e., achieving At Target). By identifying the lowest grade where a student achieving At Target is likely to consistently demonstrate the academic skill, the second panel identified the first point where students would be ready to pursue postsecondary opportunities that required the least-complex application of the skill.

Panels consisted of general educators and special educators who administered DLM assessments from across DLM states. Most panelists had expertise across multiple grade bands, and some had certification in both an academic subject and special education. Panels completed training and calibration activities prior to making independent ratings. Panels discussed ratings until consensus when there was not an initial majority agreement.

Panels identified the lowest grade in which students who achieve At Target on the DLM alternate assessment are at least 80% likely to be able to demonstrate each skill, showing the first point of readiness to pursue postsecondary opportunities that require the least-complex application of academic skills. In ELA, students achieving At Target are expected to first demonstrate 96% of those skills by grade 5. In mathematics, students meeting achievement standards are expected to first demonstrate 72% of the academic skills by grade 5 and 24% of skills in middle grades (grades 6–8).

Overall, findings from panels indicate that most academic skills needed to pursue postsecondary opportunities are first associated with meeting the DLM academic achievement expectations in earlier grades (i.e., 3–5). Given the vertical alignment of the DLM academic achievement standards, students who achieve At Target in early grades further develop these skills so that, by the time they leave high school, they are ready to pursue postsecondary opportunities that require more-complex applications of the academic skills.

Panelists also participated in focus groups to share their perceptions of opportunities, skills, and expectations for students with significant cognitive disabilities. Panelists believed the academic skills were important to postsecondary opportunities for all students, not only those who take the DLM assessment. Panelists indicated that students who were At Target in high school on the DLM assessment were likely to possess the needed academic knowledge, skills, and understandings to pursue a range of postsecondary opportunities.

Evaluations of panelists’ experiences from both panels and DLM Technical Advisory Committee members’ review of the processes and evaluation results provide evidence that the methods and processes used achieved the goals of the study. See for the full version of the postsecondary opportunities technical report.

## 9.5 Conclusion

This chapter presents additional studies as evidence for the overall validity argument for the DLM Alternate Assessment System. The studies are organized into categories where available (content, response process, internal structure, and relation to other variables), as defined by the Standards for Educational and Psychological Testing , the professional standards used to evaluate educational assessments.

The final chapter of this manual, Chapter 11, references evidence presented through the technical manual, including Chapter 9, and expands the discussion of the overall validity argument. Chapter 11 also provides areas for further inquiry and ongoing evaluation of the DLM Alternate Assessment System, building on the evidence presented in the 2014–2015 Technical Manual—Integrated Model and the subsequent annual technical manual updates , in support of the assessment’s validity argument.