Error, Part 1: Type I and reliability

Stats Talk
Jan 23, 2015
Error, Part 1: Type I and reliability

Statistics is far, far simpler than normal life. In most spheres of daily existence, there are hundreds of things that could go wrong whereas in statistics there are just two, very simply named Type I and Type II. If you can avoid both, you’ll do just fine. Type I error is essentially seeing something that isn’t there, and Type II error is failing to spot something that is. This post will take you through Type I next week’s will cover Type II, all with the help of Captain Statto and the crew of the pirate ship Regressor.

Imagine, if you will, the day Captain Statto sent Avery up to the crow’s nest with his telescope to have a look around. Statto thought there were no other ships around but wanted Avery to check before settling down for the evening with a bottle of rum. After a few minutes, Avery proudly shouts down that there’s a ship flying a Jolly Roger two points abaft the beam, starboard side. Battle stations! The crew drag themselves away from their games of Liar’s Dice and load the cannon. They patiently scan the horizon, but there’s no sign of another ship. It turned out that Avery’s Jolly Roger was in fact a seagull. Avery mistakenly rejected the null hypothesis (that there were no other ships in the area), committing a Type I error, based on a problem with the data collection instrument, namely Avery’s rum-soaked eyes.

Unreliable eyes

Avery’s eyes proved to be neither a valid nor a reliable method of data collection, and Type I errors often have to do with the instruments used. Validity was previously covered here, and basically refers to whether a scale measures what it claims to measure. Reliability, the other essential property of good measurement, has to do with whether a scale works consistently for different people and for the same people over time.

In quantitative analysis, there are two kinds of reliability: internal and test-retest. Internal reliability is used for multi-item scales and tests whether people give consistent patterns of answers. For example, the internal reliability is high when everyone who ticks A on question 1 also ticks B on question 2. Internal reliability is measured using Cronbach’s α statistic which calculates the average correlation between scale items, and values above .7 are indicative of good reliability. One thing to consider is the population on which reliability analysis is based. For example, there is a tendency to standardise assessment tools using undergraduate students as participants, often for course credit. Undergraduate students are systematically different from the normal population, in their age profile and average weekly consumption of alcohol among others, and this leaves the instrument open to the criticism that it is not reliable for the normal population. There’ll be more on the perils of sampling next week.

Test-retest reliability is concerned with whether the same person gives the same answer when they respond to items on a scale more than once. For anything concerning humans, the convention is to measure at two time-points two weeks apart. If it is after too short a period, scores may be inflated owing to memory effects while too long between responses opens the possibility of different scores due to fluctuations in the intensity of whatever is being measured; this is especially true of psychological problems like depression and anxiety. Test-retest reliability can use Pearson correlations between items or between scale total scores for each participant, and correlations of the order of .85 or .9 can be expected.

Risk of bias

In qualitative research, inter-coder reliability is used as a safeguard against bias in analysis. Analysis of interviews, for example, usually involves the development of a coding frame, a list of themes relevant to the study that might arise in the interview. The first step is for one researcher to note occurrences of all the codes in all the interviews. Using the same coding, a second rater then independently analyses a sample of interviews. A percentage agreement between the original and the second ratings is calculated and adjusted for chance using some version of κ adjustment If the minimum accepted κ coefficient of about .7 is reached for a code, it is considered reliable.

The level of significance of a statistical test result means the level of risk of Type I error that you’ve prepared to live with. Most people are happy with 5%, about a one-in-twenty chance of finding something that isn’t actually there. Once you’ve minimised the risk of measurement error, there are countless extraneous variables in even the best designed studies but being 95% sure of something is usually enough. So, having relieved Avery of look-out duties, Captain Statto can sail happily onwards for a further 19 days before expecting a similar seagull fail. Unless someone makes a Type II error next week…

Featured Jobs

Defence Science and Technology Laboratory

Porton Down, Salisbury, Wiltshire

January 27, 2019

Victoria University of Wellington

Wellington, New Zealand

January 31, 2019

ITV

London

February 13, 2019

Public First

London

February 08, 2019

European Patent Office

The Hague, Netherlands

February 15, 2019

Allianz

Rottterdam

February 08, 2019

University of Glasgow

Glasgow

January 28, 2019

Accenture

Warwick, UK

January 29, 2019

St. Luke's International University Graduate School of Public Health

Tokyo, JAPAN

February 09, 2019

The British Geological Survey (BGS)

Edinburgh

February 08, 2019

NHS Business Services Authority

Eastbourne

January 27, 2019

EUMETSAT

Darmstadt, Germany

January 24, 2019

RBS

Edinburgh

February 08, 2019

Liverpool Heart & Chest Hospital

Liverpool

January 29, 2019

Hanbury Strategy

London

February 08, 2019

GroupM

London

February 06, 2019

ITV

London

February 13, 2019

Our Partners

Logo for Logo University Of Manchester
Logo for Yougov
Logo for Ministry
Logo for Ons Logo
Logo for Un
Logo for Office Depot
Logo for Mit Logo

Like what you see?

Post a job