This is not an easy read issue, brace for some boring definitions and heavy references.
I came across the term endogeneity while reading through a paper a few years back. Thinking that I found a novel term that was the answer to some of the skepticism we have about behaviour psychology, I started to dig deeper. Obviously I was late to the party (again), and there was a pile of literature about Endogeneity or Endogeneity Bias.
Endogeneity arises in empirical research in a broad range of business and management fields, including strategy, international business research, initial public offerings and venture capital, corporate diversification and corporate governance.1
Definitions:
I gave the definitions in three different levels of complexity:
My own take with a simplistic, borderline ignorant definition (for warm-up)
A YouTube video which summarises the topic in 4 minutes and gives a good simple example. And a medium article with some more examples.
Scientific (statistics language heavy) explanation to make up for the first definition simplicity.
Warm Up: My own take
Endogeneity happens when you omit to include a variable to your analyses and that variable has a direct impact to the main variable. By omitting that variable, another variable seem like the cause of the trend. (this is such a simplification, I am sure I really annoyed a few people).
Youtube Video:
Example in the video below:
Education seems to be a major variable for wages. The higher the education, the higher the salary. But the research omitted the personality trait “competitiveness”, which drives education and wages. Education is only an easier variable that the researcher stumbled upon that showed correlation but not causality.
Here is the Video:
In this article there are also some good examples as well.
Scientific explanation:
Endogeneity occurs when a variable, observed or unobserved, that is not included in our models, is related to a variable we incorporated in our model.2
Technically, endogeneity occurs when a predictor variable (x) in a regression model is correlated with the error term (e) in the model. This can occur under a variety of conditions, but two cases are especially common in inequality research: (1) when important variables are omitted from the model (called “omitted variable bias”) and (2) when the outcome variable is a predictor of x and not simply a response to x (called “simultaneity bias”). At least part of the latter problem is often called “selection.”3
Seriousness of the issue:
Based on a study of over 100 articles in top journals, it is claimed that “researchers fail to address at least 66% and up to 90% of design and estimation conditions that make causal claims invalid” (Antonakis et al., 2010, p. 1086).
Real life implications:
A lot of correlation we fid analyzing basic trends in business and HR omit one or two important variables, because we focus on the data we already have and disregard behavioural complex variables. We also forget correlation is not causation, and when we see a good correlation, we focus on the variable and try to fix the wrong variables.
A possible solution:
People analytics teams empowered by serious data science. (Mostly we are talking about one genius guy or gal who knows her stuff)
They can look at the underlying trends, put every analyses through statistical tests and uncover a lot of these Endogeneity Biases. It is not a long or hard work, it just requires data science experts working side by side with HR and business.
Here is a good paper about how to avoid endogeneity bias.
If interested.
https://onlinelibrary.wiley.com/doi/10.1111/1467-8551.12113
https://www.stata.com/meeting/spain16/slides/pinzon-spain16.pdf
https://www.sciencedirect.com/topics/psychology/endogeneity