It’s hard to believe it’s March…and time for Research Tuesday (again!). I remember being in grad school (or maybe undergrad) and doing a group project on articulation assessments. We were split into groups (or we split ourselves into groups, I don’t remember), and we were required to study an assigned articulation assessment, determine validity, specificity, etc… and critique the administration, picture stimuli, results, etc. It was an interesting assignment. I remember making the power point for it, and I remember talking with the Dean of the department and telling him I wanted to update the test we had (it was horribly out of date), and he said that’d be a great thesis or dissertation project. LOL. Too bad, I can’t remember the name of the test now (I could go look it up, but it’d take too much energy.)
Anyway…When I came upon today’s article, it reminded me of that assignment…and I thought it was important to review. So, without further adieu I bring you Research Tuesday!
Today’s article was recently accepted (February 2015) into ASHA’s Journal of Language, Speech, and Hearing Sciences in Schools.
Article: Psychometric Characteristics of Single-Word Tests of Children’s Speech Sound Production
Author: Peter Flipsen, Jr. and Diane A. Ogiela
Background: Back in 1984, McCauley and Swisher reviewed the characteristics of single-word articulation assessments and how they were constructed. They used the criteria established for norm-referenced tests in Standards for Educational and Psychological Tests by the American Psychological Association. None of the tests were found to provide the stringent evidence needed to show that they could reliably (and with validity) indicate the absence or presence of an articulation disorder. [Oh! That can’t be good.}
A survey by Skahan, Watson, and Lof (2007) reported that nearly 75% of clinician’s “always” included norm-referenced tests in their articulation assessments. Another 15% use these tests “sometimes.” [I think most states require a norm-referenced assessment of some sort. I know North Dakota does.] Obviously, norm-referenced articulation assessments are used widely.
Purpose: The purpose of this article was to review the psychometric characteristics of articulation assessments (single-word). The study was to determine if tests have improved since the original study in 1984 where no tests passed the review.
The original study looked at:
- Validity: does the test actually do what it’s supposed to do?
- Construct validity: “the degree to which a test maps onto the theoretical construct it is supposed to assess.”
- Content validity: “the degree to which a test measures the relevant behavior.”
- Concurrent validity: a determination of whether or not “categorizations of children as normal or impaired using the test agree closely with categorizations obtained by other methods.” [basically, if test A shows the child is disordered I should be able to document it another way too – language sample, clinician’s judgement, another test…you get the picture.]
- Predictive validity: “The presence of ’empirical evidence that could be used to predict later performance on another, valid criterion…'” [Basically, does the performance on the test predict speech production accuracy in adolescence? or later reading skills? or … ]
- Reliability: how consistently did the tests measure what they were supposed to be measuring? (test-retest, and inter-examiner reliability)
- Comparison sample: who was used as the normative population? Size? Cultural influences? Boys vs Girls, etc…
Questions: The purpose of the study was to examine currently available articulation assessments (ten of them) to determine if they met the requirements above. A second purpose of the study was to provide a listing of how each test measured up so that clinician’s could choose tests wisely, and assessment developers would improve the tests.
What they did: The authors chose 10 single-word articulation assessments to systematically review.
The authors examined 17 different criteria. The 10 originally included with the McCauley – Swisher study, and seven more related to validity. Seven of the 17 criteria were related to validity, five were related to reliability, and five were related to the test norms.
Results: This is a bit convoluted and I sincerely hope that you take the time to read the whole article for yourself. I know I will be missing some important information here, simply because I can’t replicate the whole article and you won’t read this post if it gets too terribly long.
Validity findings: 5 tests had a formal process of item analysis; 6 tests provided concurrent validity, diagnostic accuracy data was available for 6 tests, and 4 tests included findings from group-wise comparisons (typical vs non-typical). 4 of the tests did not address dialect, 3 tests included formal vowel analysis, and 6 of the tests had phonological processes/patterns analysis.
Reliability findings: 8 tests included test-retest reliability findings (but only 3 met the rule that all coefficients were at least .90), 7 of the tests included inter-examiner reliability (with 4 of them meeting the .90 rule). All 10 tests were determine to have clear administration, and 9 of them specifically stated examiner qualifications.
Normative findings: 6 of the tests provided descriptions of their normative samples, two tests failed because they did not include enough detail, 3 manuals did not include SES information, and 8 of the manuals provided enough information to allow classification of samples. [essentially, it’s important to look at the normative values to see if the sample included those students with known language disorders as well as typical students. The article states research indicates a strong co-morbidity of language and articulation disorders.
Discussion: The review specifically stated tests have improved over the years. Of the original study criteria, two tests met 7 of the criteria (CAAP2, SPAT-II-D), 4 tests met 6 of them (AAPS-3, HAPP-3, LAT, PAT-3), 3 tests met 5 (DEAP-A, GFTA-2, SHAPE), and one test met only 3 (BBTOP). [I have to say I love that the HAPP-3 and PAT-3 “beat” the GFTA-2 here…While I use the GFTA, it is NOT my go-to test!] Since none of the tests met the original study criteria, we have definitely improved! Obviously, there’s a bit further to go to make sure that we’re using stringent, reliability and valid assessments.
Suggested further research: Not really …. research, per se…but more suggestions to assessment developers. 1) include normative vowel information, 2) consider developmental milestones (in other words, quit counting /r/ as wrong at age 3 [I’m honestly not sure HOW I feel about this since we can’t agree on when those phonemes are developmentally appropriate…see my blog post on articulation norms], 3) increase the criterion requirement to .90 for maximum stringency when determining coefficients in reliability measures.
Your thoughts: One of the questions the article brings up is should manuals include separate scoring for males and females? Six of the tests did. Others included it for part (some ages) but not all…others didn’t include it at all. Is it important?
So…what are your “go-to” articulation assessments? Do you use other criteria as well (language sample, etc.)? When was the last time you looked at the manual of a test and really delved deeply into the reliability – validity – and normative sample? I can’t wait to hear your responses…
For other great Research Tuesday posts, be sure to check out Gray Matter Therapy’s Research Tuesday page.
Until then….Adventure on!
Mary
I wonder if there have been many changes for the GFTA-3 that is about to come out? We will probably be getting that and I’m hoping to get the app version which, I think, will have the KLPA-3 alongside it – I noticed they didn’t include that one but if you do both together, I wonder if it’s more valid/reliable/etc or not?
I don’t think so since the scoring was based on validity, reliability, etc… What the article said was ” Tests which focused mainly on
speech motor skills such as the Kaufman Speech Praxis Test for Children (Kaufman, 1995) or which
involved secondary analysis of stimuli developed for other tests such as the Khan-Lewis Phonological
Analysis (Khan & Lewis, 2002) were excluded. ” I don’t think the Khan Lewis validity would have impacted the GFTA’s validity… It’ll be interesting to see if the GFTA-3 has DIFFERENT validity/reliability, etc. since it’s going to have new normative standards, etc.