Liam Healy & Associates

chartered occupational psychologists

Reliability of Selection and Assessment Tools : A Discussion

 

In the context of assessment Reliability has a specific technical definition and meaning which is quite different from the meaning it has in everyday use.

Any observed assessment score consists of a true score (which we can never accurately know) and some measurement error. Different assessment tools will measure true scores with  different degrees of accuracy, the term we use to refer to this is Reliability.

We can define Reliability as the accuracy with which a tools can measure true scores. 

If there were no measurement error, then the scores we obtained would be perfectly reliable and would always represent true scores i.e. they would measure whatever it was they were measuring with total accuracy. 

There are a number of ways of estimating reliability, they are all based on correlation. Because a reliability coefficient is essentially a correlation coefficient it can only have value between zero and plus one where zero means no reliability and plus one means perfect reliability e.g. The reliability of this measure is r =  0.60. Note that a measurement scale can be reversed in the case of a negative figure.

There are three main methods of estimating reliability: test-retest measures, alternate form measures and  internal consistency measures. Each of these methods is sensitive to a different source of error and so each will produce slightly different reliability estimates.

What Values for Reliability are Acceptable?  

Some argue that an acceptable reliability coefficient is 0.70 or more, but there are no straightforward definitions of what it should be, but this may give you a guide of what sort of values to expect.  

For stable traits, short-term equivalence (two week test-retest) between alternate-forms of the same measure should be in the range 0.65 - 0.80, the same-form retest should be in the range 0.75 to 0.90. These values may be lower in the case of state measures which assess states as opposed to stable traits. e.g. anxiety (although anxiety can be a trait as well).

Trait measures (mainly personality instruments) are less stable than abilities and for a two year retest coefficient we would expect a value of 0.40 - 0.50 whereas we would expect a value of 0.60 or better for an ability measure.   Internal consistency reliability values depend not only on the breadth of the characteristic  being measured, but also on how well the measurement tool items sample it. The higher the value obtained then the narrower we would expect the construct being  measured to be. We should treat very high values (greater than 0.95) and very low values (0.70 or less) with caution.   Reliability is also affected by the number of people in the sample on which the reliability estimate is based. The larger the sample then the smaller the error surrounding the reliability estimate will be.

The longer a measure is i.e. the more items  it has, then the more accurately it will sample the domain being measured. As the length increases then so should the reliability, but only if the questions are still sampling the domain in question.

The Correlations between Scale Scores and Reliability

As the correlation between two scales changes then so does the reliability of their sum and their difference. As the correlation between two scales (predictor score and success criterion score) increases:

 For sum scores (and difference scores when correlation decreases)  

·        Variance and reliability increase

·        A decreased amount of the variance is accounted for by error

 For difference scores (and sum scores when correlation decreases) 

·        Variance and reliability decrease 

·        An increased amount of the variance is accounted for by error

Remember that a strong correlation between two measures means that they overlap to some degree. With a difference score, the more overlap there is then the less difference there will be between the two scores. With sum scores, we are in effect adding the two measures together to produce one long test. If the two measures are similar then it takes on the psychometric characteristics that give tests their reliability, and so as  correlation increases so will sum score reliability. 

We also need to consider Range Restriction.   If two samples of people took the same measure and one sample produced a set of scores that covered the whole range of possible scores (low through to high), while the other produced a set of scores which covered a narrower range then the variance of the first set would be greater than that of the second.

The term we would use to describe what has happened to the second set of scores is Range Restriction.   Range restriction can happen by chance or because of some bias which has been introduced to the process. The commonest source of range restriction is something that happens all the time in personnel selection - the practice of basing selection on the top 10% (or some other proportion) of scorers. In this case we will find that the variance of scores is reduced because the top 10% of scores would all be quite high.  

For the statisticians out there, we can use the following formula to correct for range restriction:           

R1 = 1 - [(SD22 / SD12) x (1 - R2)]

Where              

  • R1   =            Reliability corrected for range restriction                           

  • SD2  =            SD of the restricted sample

  •  SD1  =            SD of the unrestricted sample  

  • R2     =            Reliability of the restricted sample