Liam Healy & Associateschartered occupational psychologistsReliability of Selection and Assessment Tools : A DiscussionIn the context of assessment Reliability has a specific technical definition and meaning which is quite different from the meaning it has in everyday use. Any observed assessment score consists of a true score (which we can never accurately know) and some measurement error. Different assessment tools will measure true scores with different degrees of accuracy, the term we use to refer to this is Reliability. We can
define
Reliability as the accuracy with which a tools can
measure true scores. If there were no measurement error,
then the scores we obtained would be perfectly reliable
and would always represent true scores i.e. they would measure whatever it
was they were measuring with total accuracy. There are a number of ways of
estimating reliability, they are all based on
correlation. Because a reliability coefficient is
essentially a correlation coefficient it can only have
value between zero and plus one where zero means no
reliability and plus one means perfect reliability
e.g. The reliability of this measure is r =
0.60 There are three main methods of estimating reliability: testretest measures, alternate form measures and internal consistency measures. Each of these methods is sensitive to a different source of error and so each will produce slightly different reliability estimates. What Values for Reliability are Acceptable? Some argue that an acceptable reliability coefficient is 0.70 or more, but there are no straightforward definitions of what it should be, but this may give you a guide of what sort of values to expect. For stable traits, shortterm equivalence (two week testretest) between alternateforms of the same measure should be in the range 0.65  0.80, the sameform retest should be in the range 0.75 to 0.90. These values may be lower in the case of state measures which assess states as opposed to stable traits. e.g. anxiety (although anxiety can be a trait as well). Trait measures (mainly personality instruments) are less stable than abilities and for a two year retest coefficient we would expect a value of 0.40  0.50 whereas we would expect a value of 0.60 or better for an ability measure. Internal consistency reliability values depend not only on the breadth of the characteristic being measured, but also on how well the measurement tool items sample it. The higher the value obtained then the narrower we would expect the construct being measured to be. We should treat very high values (greater than 0.95) and very low values (0.70 or less) with caution. Reliability is also affected by the number of people in the sample on which the reliability estimate is based. The larger the sample then the smaller the error surrounding the reliability estimate will be. The longer a measure is i.e. the more items it has, then the more accurately it will sample the domain being measured. As the length increases then so should the reliability, but only if the questions are still sampling the domain in question. The Correlations between Scale Scores and Reliability As the correlation between two scales changes then so does the reliability of their sum and their difference. As the correlation between two scales (predictor score and success criterion score) increases: For
sum scores (and difference scores when correlation
decreases) · Variance and reliability increase · A decreased amount of the variance is accounted for by error For
difference scores (and sum scores when correlation
decreases)
·
Variance and reliability
decrease · An increased amount of the variance is accounted for by error Remember that a strong correlation between two measures means that they overlap to some degree. With a difference score, the more overlap there is then the less difference there will be between the two scores. With sum scores, we are in effect adding the two measures together to produce one long test. If the two measures are similar then it takes on the psychometric characteristics that give tests their reliability, and so as correlation increases so will sum score reliability. We also need to consider Range Restriction. If two samples of people took the same measure and one sample produced a set of scores that covered the whole range of possible scores (low through to high), while the other produced a set of scores which covered a narrower range then the variance of the first set would be greater than that of the second. The term we would use to describe what has happened to the second set of scores is Range Restriction. Range restriction can happen by chance or because of some bias which has been introduced to the process. The commonest source of range restriction is something that happens all the time in personnel selection  the practice of basing selection on the top 10% (or some other proportion) of scorers. In this case we will find that the variance of scores is reduced because the top 10% of scores would all be quite high. For the statisticians out there, we can use the following formula to correct for range restriction: R_{1 = }1  [(SD_{2}^{2} / SD_{1}^{2}) x (1  R_{2})] Where
