Dummy coding For Linear Regression
Typically in computer programs, an enum is used to represent a field having different values. For instance, while representing a set of athletes, the field stating the sport played would have the value represent by an enum.
However, enums label the elements in serial order.
#include <stdio.h>enum sport { cricket, football, basketball };int main() {
printf("Cricket = %d\n", cricket);
printf("Football = %d\n", football);
printf("Basketball = %d\n", basketball);
return 0;
}OUTPUT:Cricket = 0
Football = 1
Basketball = 2
Although it may seem to be an easy encoding for the programmers, if used in regression, it would break the model as the numbers used to represent the sport would signify value. In the athletes example, the linear regression model would consider the first sport, cricket in our example, in the enum to be inferior to all the other sports in the enum, while the second sport, football, would be considered inferior to every other sport except the first and so on.
One Hot Form
Using dummy variables, we convert the elements to be put in the sport field to not an enum, but a table. Each element will have an entry in the column of the table. For each record, if the element is selected the value corresponding to that element’s column would be high(1) while others would be low(0).
Reference Category
However, if all the columns are given to the regression model, it would result in multicollinearity and break the model. Thus we come up with the reference category. One of the elements, say football, is ignored in the table as its value can be inferred from the other.
Footnotes:
[1] https://www.statisticssolutions.com/dummy-coding-the-how-and-why/
[2] https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faqwhat-is-dummy-coding/
[3] https://dss.princeton.edu/online_help/analysis/dummy_variables.htm