Liam Healy & Associates

chartered occupational psychologists

Bespoke Psychometric Test Construction

We are qualified psychometric test developers and have developed a range of bespoke interest, ability and personality assessments for clients. Test development takes months, not days. It represents a significant time and financial investment for an organisation.

Commissioning us to develop your own in-house test has a number of advantages:

  1. It allows you use a psychometric test which just focuses on those traits which are relevant to you.
  2. The administration, interpretation and reporting procedures can be customised to streamline with your wider HR processes and functions.
  3. You own the copyright.

A psychometric test looks deceptively simple, after all - it's just a list of questions and answers, isn't it? Before contacting us to enquire about bespoke psychometric test development, you should consider the following :

Designing a Psychometric Test

It is a simple enough task to write a 'test' that looks like a test - to the untrained eye it may look plausible enough. However, the quality of a test is determined by its psychometric and scaling properties, and not what the test items look like. Our tests are developed according to the guidelines laid down by Kline (1995) and other experts in the test development field.

Is it Easy to Write a Psychometric Test?

It is easy to write a poor quality one. If you have ever seen an aptitude test you may well have thought ‘This looks simple enough, I could write one of these!’ On the first point you would be correct, they do indeed look simple to produce, but the illusion of simplicity ends there.  It can take months or years to develop a single basic level test, and can involve hundreds, and even thousands of trial subjects.

Why is There So Much Involved in Developing a Test ? 

A properly designed and constructed test must have certain technical properties. A test item will only be included once it has passed through a  number of stringent quality control processes. There are tests produced by unqualified developers which do not reach these standards, but the major test publishers have worked closely with the various professional associations, the main one being the British Psychological Society, to produce standards to which test development will ideally adhere. 

Here are some considerations to take into account :

1.      Degree of Difficulty. 

If all of the questions in a test were so easy that everyone could answer them correctly, or so difficult that no-one could answer them correctly, then that test would tell us nothing about the differences between people on the particular ability or aptitude the test was measuring.

We need to avoid this, since in selection and recruitment particularly, it is those very differences in which we are interested.  So initially a very large number of test items are written, often five or ten times as many as will eventually be used. This is because most individual test items will fail this crucial test : an equal number of people must get a test question correct and incorrect. Several hundred people need to complete each test question before we can determine if this is case.  

2.      Degree of Accuracy (Reliability)

Test developers agree that tests and test use are prone to error. This error can come from the test itself e.g. poorly written or easily misunderstood items, or from the test administration process e.g. instructions not being adhered to properly, or time limits not being followed. 

This error affects how consistently the test will measure the characteristic it is designed to measure. This is known as Reliability. The more reliable the test then the more stable and accurate it is.   Think of reliability like a ruler. If the ruler is made from wood, then one would not expect the measurements of length it provided to vary much. If we measured something such as a person’s height one day, and then measured it again the next day with the same ruler we would expect a high degree of correlation between the two different measurements.  If on the other hand the ruler was made of rubber, then we would see a large variation from one day to the next. The ruler might  measure the same person's height (which we know has not had time to change) on two separate occasions and produce two very different values.  The exact same principle applies to tests, they have to be stable and accurate before they can be used.

The easiest way to establish whether a test possesses reliability is to administer it to a group of people, and then a few weeks later administer it to them again. If the test is stable and reliable we would expect to see a high positive correlation between the scores obtained on the two different occasions (the accepted criterion is r=0.7 or higher for the mathematically inclined).  This notion of reliability again boils down to the quality of the test questions.  

3.      Degree of Relevance to the Job  (Validity)

The fundamental question here is this - ‘Does the test measure what it claims to measure ?’. This may seem like a strange thing to ask. Surely, a test which contains numerical problems which a person is required to solve is measuring numerical ability?  This is not necessarily the case.

Many tests rely on a person having good verbal comprehension in order to complete them successfully - even numerical  or abstract reasoning tests if the instructions are complex.  In these cases,  although the test claims to measure numerical ability, and the employer may well interpret the test scores in that light, they may well be overlooking the fact that the test is to some unknown extent a measure of verbal ability. This is quite a challenge to overcome during the test development process and can be a common source of indirect discrimination. 

This concept of whether a test measures what it claims to measure is known as the ‘Validity’ of a test, and it is most often established by examining statistically the degree of correlation between one test, and another established test of the same characteristic or ability. In the case of validity the accepted degree of correlation is r=0.3 or higher.  

4.      Establishing a Benchmark Against Which to Interpret test Scores (Norms)

 If you scored 25 on a test what would you think ? Would you think that was a good score, a poor score, or average ?   In truth, a standalone 'raw score'  like that means nothing because you have no context in which to interpret it.  If you knew your score was 25 out of 100 what would you think then? You might think that was not such a good score. But why ? If the test was particularly difficult, your score might be amongst the best.  

If you then discovered your score was 25 out of 100, and that this was an average score, then you might now know what how good or bad it was– but not quite. There is still one piece of information missing. You now know your score is average, but average compared to whom ? 16 year old school leavers ? graduates ? You still do not have a definite idea of what your score means. 

If you finally discovered that your test score was 25 out of 100, and that this was average compared to the scores of graduates on the same test, you would now know what you score finally meant.  

This is what happens in ability testing - a test score is interpreted in relation to some comparison group. The aim is to produce a set of  these ‘norm groups’ to enable the employer to make comparisons of a candidate’s test score with the performance of a known group of people.  Norm groups take a long time to produce, and much of the work is done by test publishers prior to releasing the test, although it is a never ending task and norm groups are constantly being updated.

5.      Fairness and Discrimination in Test Use  

As well as a formal legal requirement that a test should not unfairly discriminate against particular ethnic or gender groups, there are ethical and practical reasons why employers should use tests that are fair. 

We know that males and females, and different ethnic groups, all have the same overall level of intellectual ability. This means that if a test systematically suggested that men scored lower than women on an ability or aptitude test, and those test results were used to select candidates for a job, a disproportionate number of women would be selected.   This would be fine if what the test scores suggested about men i.e. that they had less intellectual ability than women, was true, but it is not. We would find that the higher levels of work success predicted by the test in the case of women would not be found. 

The purpose of test is to discriminate, but only between people who have differing levels of ability on the characteristic in question, and not on the basis of irrelevant characteristics such as gender.  Consequently, ability and aptitude tests need to be carefully constructed and statistically analysed to make sure that they do not discriminate between people on anything other than the actual ability or aptitude in question.

With  personality tests, this is much less of an issue because we know that differences in personality test results may well reflect real differences between males and females. For instance, females tend to be reported as being more sensitive to other peoples’ feelings and more socially oriented than men.  Remember that personality is not the same thing as ability, so the issue is much less contentious.

A Standard Development Process Involves

  1. Defining the trait or characteristic to be measured in psychological terms.
  2. Item writing -  writing good test items is exceptionally difficult, the guidelines used are based on published research. A lot of items need to be written initially. Perhaps only 10% of the initial items will pass the various quantitative and qualitative quality control processes and make it into the final test.
  3. Response format - this needs to be chosen on the basis of the test function, and must avoid range restriction and allow the analysis of data at the Scalar level - in practice we deal with Interval data rather than Ratio data, as there is no absolute zero value in ability/personality assessment.
  4. Trial group choice - this generally needs to be screened and stratified.
  5. First trial - Standard Item Analysis is carried out (see any good textbook on the subject for P = * value cut-offs, mean and SD parameters) to reduce the number of items.
  6. Second Trial - item analysis repeated, and a First Order Factor Analysis carried out. Oblique or Orthogonal Exploratory Factor Analysis is used depending on the characteristics being measured. Second Order Factor Analysis is also carried out at this stage to establish  the macro structure of the characteristic being measured and to ensure internal scale coherence.
  7. Finally Standardisation and Reliability Analysis are carried out. The Alpha Coefficient, or Cronbach's Alpha, is the most widely used reliability analysis method. The standardisation and normative data production is straightforward, and done according to accepted methods. In calculating reliability, further items may be moved, or removed from certain scales so that all of the items in a particular scale contribute to that scale's reliability - in other words, the degree to which it is free of error.
  8. Finally, the administration, scoring and reporting functions are finalised, and organisational users trained in it's use.
  9. One more important activity also occurs at this stage - we decide upon success criteria, or the standards against which the predictive value (validity) of the test will be subsequently measured, and the analysis method to be used.
  10. Now your test is ready to use - but the development of the test will be an ongoing task with norms being added and updated as data from test takers is amassed.