One may be hard-pressed to find a topic in the world of Equal Employment Opportunity and Affirmative Action that is more disliked than adverse impact. Not only does an adverse impact analysis (a.k.a. impact ratio analysis and/or disparity analysis) involve complicated statistical calculations to arrive at its results, but these results are often used in investigations as the foundation for charges of discrimination. There is little wonder why HR practitioners sometimes avoid – or perhaps revile – the practice. It’s difficult, often poorly understood, and fraught with liability.
Despite these challenges, an adverse impact analysis is a tremendous diagnostic tool in evaluating employment practices and helping to ensure that fair treatment is commonplace at work. In the same way as nearly anyone can navigate the internet without knowing the coding that goes into it, a functional understanding of adverse impact is more readily available than many believe. Computer programs make the calculations manageable, and the foundational concepts are not overly complex. It is the goal of this article to demystify adverse impact and present a concise understanding of its concepts.
Defining Adverse Impact
As it is used today, the term adverse impact (AI) essentially means the same as when it was first written: a substantially different rate of selection in hiring, promotion or other employment decision which works to the disadvantage of members of a race, sex or ethnic group (Uniform Guidelines Questions & Answers #10).[i] In essence, AI indicates whether decisions made regarding a protected group left them at a substantive disadvantage. It should be noted that adverse impact simply describes differences between groups on a testing process. It is not a legal term that implies guilt, nor is it a psychometric term that implies unfairness or test bias.
The three most common methods for determining adverse impact are the 80% Rule, statistical significance tests, and practical significance tests. While the 80% Rule and practical significance tests each have their merits[ii], modern compliance proceedings and legal battles are waged primarily along the “statistical significance” front.
This deference to a compliance/legal/ framework influences related choices as well. Both descriptive statistics and statistical significance tests can be applied to adverse impact analyses, however the latter are preferred. Descriptive statistics merely show the mathematical difference relevant to the comparison being made. Statistical significance tests are more relevant for adverse impact analyses because they indicate whether the descriptive statistic is statistically meaningful and whether they can be regarded as a “beyond chance” occurrence.
The various approaches to adverse impact are often broken out into two primary types: Availability Comparisons and Selection Rate Comparisons. Availability Comparisons can be highly useful in determining whether one group may be underutilized, but additional details are required for a finding of adverse impact. The Selection Rate Comparison is the only type of analysis that alone can demonstrate adverse impact. For this reason, this article will focus on adverse impact as indicated by Selection Rate Comparisons.
Selection Rate Comparisons
A Selection Rate Comparison evaluates the selection rates between two groups (e.g., women and men, minorities and whites) on a selection procedure. Selection Rate Comparisons are most typically used in litigation settings, as they relate specifically to the type of adverse impact analysis called for in the Uniform Guidelines. These analyses can be used to evaluate a single event or multiple events, however particular caution must be used when combining multiple events (discussed below). There are four variables that are entered into any adverse impact analysis of this type:
- The number of focal group members selected (e.g., women hired)
- The number of focal group members not selected (e.g., women not hired)
- The number of reference group members selected (e.g., men hired)
- The number of reference group members not selected (e.g., men not hired)
Selection Rate Comparison for a Single Event
A Single Event Selection Rate Comparison is the most typical type of adverse impact analysis, and it is specifically explained in the Uniform Guidelines as a “rates comparison” (Section 4D) that compares the passing rates between two groups (e.g., men and women) on a selection procedure. This type of analysis can also be used to analyze the outcome of layoffs, demotions, or other similar personnel transactions where there are only two possible outcomes (e.g., promoted/not promoted; hired/not hired, etc.).
Two categories of statistical significance tests exist that can be used for analyzing adverse impact for Selection Rate Comparisons: exact and estimated. Exact tests provide the precise probability value of the analysis. Estimated techniques approximate the exact results without requiring lengthy calculations. Both exact and estimation techniques require use of a 2 x 2 contingency table, as displayed in Table 1.
|2 x 2 Contingency Table|
Table 1 2 Xx2 Contingency Table
Selection Rate Comparisons for Multiple Events
Proper methodology also exists for comparing the passing rates of gender and ethnic groups on several combined “events” or administrations of various practices, procedures, or tests. This technique may also be used to complete an overall adverse impact analysis on several jobs or groups of jobs with similar skill sets, or for comparing group passing rates on an overall selection or promotion process for multiple years. A Multiple Events Selection Rate Comparison is necessary when multiple years or tests are placed into a combined analysis. This is because statistical anomalies can occur when combining data across multiple strata.
While it may be tempting to simply aggregate several years of a particular testing practice into a combined adverse impact analysis, the results will sometimes be misleading unless a special “multiple events” technique is used. A statistical phenomenon called “Simpson’s Paradox”[iii] shows why this can be a problem. Note in Table 2 that although the selection rates for each group match within a given year, the combined data shows a 9% disparity in selection rates.
|Simpson's Paradox Example|
|Testing Year||Group||# Applicants||# Selected||Selection Rate %|
|2017 + 2018|
Table 2 Simpson's Paradox Example
To avoid pitfalls such as Simpson’s Paradox, two steps are necessary to properly aggregate data and conduct a Multiple Event Selection Rate Comparison:
Evaluate the events for pattern consistency. One must determine whether the “trend” in the passing rates of a group is consistently unfavorable. Different data “events” showing a group as both favored and unfavored are inappropriate to aggregate.
Calculate the statistical test results. This will assess whether adverse impact occurred in the overall analysis for all events combined using a test such as Mantel-Haenszel[iv].
Determining Statistical Significance
No matter which of the two Selection Rate Comparisons may be used, the resulting value still requires context. After all, how unexpected must an outcome be to be deemed “unusual” or “rare”? At what point would a court or other oversight agency determine the results are enforceable? This conceptual tipping point is referred to as statistical significance.
Statistically significant results of a selection process or test are extremely unlikely to occur by chance. Such a result signifies a point at which it can be stated – with a reasonable level of certainty – that a legitimate trend, and not a chance relationship, actually exists. Statistical significance tests result in a p-value (for probability). P-values of .05 (i.e. 5%) or less are said to be “statistically significant” in the realm of AI analyses. In practical terms, this is comparable to correctly selecting a single chosen card from a standard deck of 52 cards in no more than 2-3 attempts (2.6 attempts represent a 5% chance).
When a statistical test is conducted to evaluate whether an event is statistically significant, there is always a “power” associated with the test. This can be used to describe its ability to reveal a statistically significant finding if there is one to be found. Said another way, the “power” indicates how strongly one can rely on the finding. Three factors create statistical power:
Effect size. For Selection Rate Comparisons, this pertains to the size of the “gap” between the selection rates of the two groups. A larger gap more readily reveals statistical significance.
Sample size. The number of members in each group plays a key role in adverse impact analyses. Just as in a straw poll, a larger sample size improves reliability.
The type of statistical test used. This includes the actual formula of the adverse impact analyses (some tests are more powerful than others) and whether a one-tail or a two-tail test for significance is used (see discussion on one-tail versus two-tail tests below).
Researchers and practitioners generally have little control over the measured differences (i.e. effect size) of the groups being analyzed. As such, amassing a large sample size is perhaps the single most effective way to increase the power of an adverse impact analysis, thereby increasing the likelihood of a statistically significant result. Below are at least five ways this can be accomplished. It is important to note that the first four of these aggregation techniques require use of the appropriate multiple events type of analyses because statistical anomalies can occur when combining data, as discussed above.
Widen the timeframe.
Combine various geographic areas together.
Combine events from several jobs, job groups, or divisions.
Combine various selection procedures.
Combine different ethnic groups.
Despite years of debate, there is no absolute, bottom-line threshold regarding the minimum sample size necessary for conducting statistical investigations. Courts frequently take the stance that there is no clear minimum sample size. However, if one had to pick a firm, minimum number for adverse impact analyses, the consensus seems to be 30 with at least five expected for selection. It is important to note that statistical analyses where small numbers are involved suffer from a higher “sampling error,” thereby making the results less reliable than analyses involving larger data sets.
When considering the type of statistical test to use, there are both “estimated” and “exact” tests. Estimated tests provide an approximate probability of a circumstance. The latter, which calculate the exact probability of a circumstance, are considered the most powerful statistical tests for adverse impact calculations. While an exact test provides a more refined result, an estimated test may be more readily applied in some situations (e.g., smaller sample size).
One last methodology to note when determining statistical significance in AI analyses is use of a one-tail versus a two-tail test. A one-tail statistical test investigates the possibility of discrimination occurring in just one direction (e.g., against women). A two-tail test assumes that discrimination could have occurred in either direction (e.g., against men or against women) and spends its statistical power investigating discrimination in both directions. The courts have been almost totally consistent in their requirement of using a two-tail test for significance.
Adverse impact analyses are complex in nature and varied in form, but that need not dissuade practitioners from including them as an option in their tool belt. The insights gleaned from AI analyses are exceptionally useful in identifying areas of potential liability. They also provide key direction in the marshalling of resources to address concerns raised. While a number of resources are available to assist with adverse impact analyses, Biddle has provided a free online tool for calculating simple AI analyses at http://www.biddle.com/adverseimpacttoolkit/SelectionRateComparison.aspx.
Conducting adverse impact analyses is an invaluable step for organizations investigating their selection processes and cleaning up areas of those processes that may not be equitable. For the latter to happen however, one must recognize that AI analyses are only indicators of what has occurred. Simply identifying an issue will not resolve it; additional steps must be taken if long-lasting change is to take hold. Proper interpretation of the AI results and formulation of an action plan are critical. As such, one could rightly consider the conclusion of an adverse impact analysis the point at which the “real work” truly begins.
[i] The Uniform Guidelines on Employee Selection Procedures and the related Questions & Answers can be found at www.uniformguidelines.com.
[ii] See Biddle, D. A. (2011). Adverse Impact and Test Validation: a Practitioner’s Handbook (3rd ed.). Scottsdale, AZ: Infinity Publishing. (pp. 3-5).
[iii] See Finkelstein, M. O., & Levin, B. (2001), Statistics for Lawyers (2nd ed.). New York, NY: Springer (p. 237).
[iv] The Mantel-Haenszel technique was originally developed for aggregating data sets for cancer research. See Mantel, N. & Haenszel, W. (1959), Statistical aspects of the analysis of data from retrospective studies of disease. Journal of National Cancer Institute, 22, 719-748.