Learn more about statistical modeling at https://www.datacamp.com/courses/statistical-modeling-in-r-part-2
In thinking about effect size, keep in mind that there is not necessarily a single effect size for each explanatory variable. Often, the effect size of one variable depends on the value of the other variables.
This plot shows the probability of being married for the people in the CPS85 data set as a function of age, education, and sector. The effect size of sex --- that is, the difference between marriage rates for men and women --- is big in the clerical sector, but small in the service sector. Young men in the professional sector are more likely to be married than young men in the other sectors.
It's as if the various explanatory variables work together to determine the effect size of sex.
There's a name for this: statisticians call it an interaction effect.
It might seem obvious that the effect size of one explanatory variable might depend on the levels of the other explanatory variables. But it's easy to get confused. Many people end up thinking that the interaction is how one explanatory variable shapes or causes another explanatory variable.
In some model architectures such as lm(), you will only see an interaction effect if you ask specifically for it. In some other architectures, such as rpart(), the interaction is just an ordinary part of the story. Historically, the lm() architecture has dominated scientific research, and so the decision of whether to include an interaction effect is relevant. Let's take a moment to examine how interaction effects are included in lm() models.
As an example, consider a simple story: world records in the 100-m freestyle swim race. The graph shows these records over the course of the 20th century. Two features are evident from the data points:
Swimmers have gotten faster over the years.
Men's records are faster than women's.
And, even without formally training a model, you may be able to see that the effect size of sex --- the difference between men and women --- has gotten smaller over the years. That change in effect size is an interaction effect.
Let's look at two different models of the swim-record data: one having an lm() architecture and the other an rpart() architecture. The rpart model clearly shows an interaction effect; the difference between the sexes changes over the years. But the step-wise nature of the model is jarring. The model output changes only over decades, not years.
The linear model is more satisfactory, showing a gradual improvement in record times. But there's no interaction effect: the two model lines are parallel; the model says records are getting better at the same rate for men and women.
The reason the linear model doesn't show the interaction is because it never does unless you ask for it specifically. You ask for an interaction effect to be included by using a model formula with a star to connect the variables you want to involve in the interaction.
So, here, the model formula says "sex star year" rather than "sex plus year". You can see the interaction in the graph in two ways: the effect size of sex decreases with year and, equivalently, the slope giving the effect size with year is different for the two sexes.
There are good reasons why the lm() architecture includes interactions only if you specifically ask for them. To a large extent, this has to do with the demands of small data sets. That's a subject for another course. But at this point it's entirely adequate to work with some rules of thumb:
rpart() includes interactions naturally as part of the way they work
lm() and other model construction methods that we haven't yet discussed (such as glm()) will include interactions only if you ask for them.
Including interactions sometimes help, sometimes don't help, and sometimes hurts the performance of a model. When in doubt, cross validation is a good way to assess performance, so you always have a way to decide whether interactions are helping or not.