Skip to Main Content The Effects of Random Undersampling with Simulated Class Imbalance for Big Data Abstract: With the recent explosion of big data, real-world data are increasingly being affected by larger degrees of class imbalance, likely hindering Machine Learning algorithm performance.
The contribution of our work is to show that good classification performance on big data, across different application domains, can be achieved without too much alteration to the mbalance binary options dataset. In order to demonstrate good classification performance with big data, we process four datasets, from different domains, generating several imbalanced variants.
Random undersampling is applied to balance the binary class in each of the created imbalance datasets generating class ratios.
All models were built using the Random Forest classifier, using the Spark and H2O machine learning libraries, and performance was recorded to find good ratios for undersampling big data without discarding too much of the majority class. We provide a comparison of all models created from the prepared datasets, generating 4, models.
We conclude that, in terms of imbalanced data, from 0. Moreover, when random undersampling the negative class tothere are similarities regarding the average performance compared to that of using the entire big dataset.