Detection of Multicollinearity in Data sets using Statistical Testing. | by Erdogan Taskesen | Oct, 2023

Detecting multicollinearity in sets is an important step but also challenging. I will demonstrate how to detect variables with similar behavior in mixed data sets and how to deeper examine the relationships with interactive charts.

Erdogan Taskesen
Towards Data Science
by Erol Ahmed on Unsplash

Understanding the strength of relationships between variables in a data set is important because variables with statistically similar behavior can affect the reliability of . To remove the so-called multicollinearity we can use measures for continuous variables. However, when we also have categorical variables and thus mixed data sets, it becomes even more challenging to test for multicollinearity. Statistical tests, such as Hypergeometric testing and the Mann-Whitney U test can be used to test for associations across variables in mixed data sets. Although this is great, it requires various intermediate steps such as the typing of variables, one-hot , and multiple test corrections, among others. This entire pipeline is readily implemented in a method named HNet. In this , I will demonstrate how to detect variables with similar behavior so that multicollinearity can be easily detected.

Real-world data often contains measurements with both continuous and discrete values. We need to look at each variable and use common sense to determine whether variables can be related to each other. But when there are tens (or more) variables, where each variable can have multiple states per category, it becomes time-consuming and error-prone to manually check all the variables. We can automate this task by performing intensive pre- steps, together with statistical testing methods. Here comes HNet [1, 2] into play which uses statistical tests to determine the significant relationships across all variables in a . It allows you to input your raw unstructured data into the and then outputs a that sheds light on the complex relationships across variables. Let’s go to the next section where I will explain how to detect variables with similar behavior using statistical

Source link