Learn More about Outliers
Outliers are extreme values in a dataset. They are numerically distant from the remainder of the data and therefore seem out of place. Outliers can occur because of the always present possibility of very high or low dietary intakes, but may also indicate errors in reporting, coding, or the underlying databases used to estimate intakes. Outliers are important because they can have a large influence on statistics derived from the dataset. For example, the mean intake of energy or some nutrient may be [glossary term:] skewed upward or downward by one or a few extreme values (Learn More about Normal Distributions).
Outliers can be detected visually by plotting the observations using a scatterplot. Further, statistical thresholds or cut points based on the study sample distributions can be used. Examples include intakes above or below the 25th and 75th percentiles plus two or three times the interquartile range, or a priori cut points established based on extremes of the distribution, such as above or below the 99th or 1st percentiles, respectively. Another statistical procedure is to identify individual intakes that have an undue influence on estimates of the [glossary term:] mean of the sample.
Researchers should carefully review data to identify outliers and exercise caution in discarding data. It is important to consider that the application of arbitrary cutoffs may result in the loss of a substantial number of subjects whose data contain no more measurement error than those within the "acceptable" range of self-reported dietary data (see Key Concepts about Measurement Error). In considering whether to exclude extreme data points, sensitivity analyses with and without the identified outliers may help to determine if findings and statistical tests are appreciably altered by their presence.
In dietary intake data collected using [glossary term:] short-term instruments, such as 24-hour dietary recalls (24HRs) (see 24-hour Dietary Recall Profile), some values may appear to be implausibly high or low. However, it is possible on any given day to have intakes of energy, nutrients, or food groups that are extremely high or low. For example, the review of recall data for a given respondent with a high intake of vitamin A might reveal a high consumption of carrots on a single day. How to identify outliers and when to exclude data from analyses requires careful consideration. Given that [glossary term:] long-term instruments are intended to measure long-term intakes, the consideration for short-term instruments, regarding the possibility of extreme intakes on any given day are not a concern.
In dietary intake data collected using long-term instruments such as food frequency questionnaires (FFQs) (see Food Frequency Questionnaire Profile), determining outliers has generally been conducted using one of two methods:
- Establish a priori high and low sex-specific cut points for energy only. Exclude data outside of the cut point.
- Consider all dietary constituents and use statistical cut points based on the distribution of reported intake estimates of each dietary variable to establish, review, and possibly exclude outliers in analyses.
For More Information
Kipnis V, Subar AF, Midthune D, Freedman LS, Ballard-Barbash R, Troiano RP, Bingham S, Schoeller DA, Schatzkin A, Carroll RJ. Structure of dietary measurement error: results of the OPEN biomarker study. Am J Epidemiol 2003 Jul 1;158(1):14-21; discussion 22-6. [View Abstract]
Rimm EB, Giovannucci EL, Stampfer MJ, Colditz GA, Litin LB, Willett WC. Reproducibility and validity of an expanded self-administered semiquantitative food frequency questionnaire among male health professionals. Am J Epidemiol 1992 May 15;135(10):1114-26; discussion 1127-36. [View Abstract]
Subar AF, Thompson FE, Kipnis V, Midthune D, Hurwitz P, McNutt S, McIntosh A, Rosenfeld S. Comparative validation of the Block, Willett, and National Cancer Institute food frequency questionnaires : the Eating at America's Table Study. Am J Epidemiol 2001 Dec 15;154(12):1089-99. [View Abstract]