Challenges with Extreme Class-Imbalance Issue

Mar 10, 2019 - Nov 20, 2019

"Challenges with Extreme Class-Imbalance and Temporal Coherence: A Study on Solar Flare Data" [arXiv]

Azim Ahmadzadeh, Maxwell Hostetter, Berkay Aydin, Manolis K. Georgoulis, Dustin J. Kempton, Sushant Mahajan, and Rafal A. Angryk

IEEE International Conference on Big Data 2019


In analyses of rare-events, regardless of the domain of application, class-imbalance issue is intrinsic. Although the challenges are known to data experts, their explicit impact on the analytic and the decisions made based on the findings are often overlooked. This is in particular prevalent in interdisciplinary research where the theoretical aspects are sometimes overshadowed by the challenges of the application. To show-case these undesirable impacts, we conduct a series of experiments on a recently created benchmark data, named Space Weather ANalytics for Solar Flares (SWAN-SF). This is a multivariate time series dataset of magnetic parameters of active regions. As a remedy for the imbalance issue, we study the impact of data manipulation (undersampling and oversampling) and model manipulation (using class weights). Furthermore, we bring to focus the auto-correlation of time series that is inherited from the use of sliding window for monitoring flares' history. Temporal coherence, as we call this phenomenon, invalidates the randomness assumption, thus impacting all sampling practices including different cross-validation techniques. We illustrate how failing to notice this concept could give an artificial boost in the forecast performance and result in misleading findings. Throughout this study we utilized Support Vector Machine as a classifier, and True Skill Statistics as a verification metric for comparison of experiments. We conclude our work by specifying the correct practice in each case, and we hope that this study could benefit researchers in other domains where time series of rare events are of interest.

Snapshots of an X-class flare, peaking at 7:49 p.m. EST on Feb. 24, 2014, observed by NASA's Solar Dynamics Observatory, in the 304\AA~wavelength channel. (Images source:


In this study, we use a benchmark dataset, named as Space Weather ANalytics for Solar Flares (SWAN-SF), recently released by [1], and made entirely of multivariate time series, aiming to carry out an unbiased flare forecasting. We show the impact of disregarding an interesting phenomenon called “temporal coherence” on classification of events in spatiotemporal datasets.

We demonstrate the impact and biases of different class-imbalance remedies and discuss how they should be interpreted from the perspective of the subject under study by involving domain experts. We hope that this work raises awareness to interdisciplinary researchers and enables them to spot and tackle similar problems in their respective areas.

Fig. 1 shows the daily average number of sunspots in relation with the time span of each partition.

Figure 1. Time span considered for each partition of SWAN-SF dataset, the monthly report of average number of sunspots per day, and the daily variance represented with a gray ribbon (Source: WDC-SILSO, Royal Observatory of Belgium, Brussels).

From the multivariate time series in SWAN-SF, we extract a set of 6 statistical features such as mean, median, variance, skewness, kurtosis, and last-value. Using these features we build a new dataset that we run our experiments on. The obtained forecast dataset has a dimensionality of 144, resulting from the computation of the 6 above-mentioned features on the 24 physical parameters of the SWAN-SF dataset. Data points of this dataset are labeled by 5 different classes of flares, namely (from strongest to weakest) GOES X, M, C, B, and N. The latter represents flare-quite instances or GOES A-class events.

The SWAN-SF dataset is made of 5 partitions. These partitions are temporally non-overlapping and divided in such a way that each of them contains approximately an equal number of X- and M- class flares. The class-imbalance ratios of the 5 partitions, as well as the sample size of the 5 classes are illustrated in Fig. 2.

Figure 2. Counts of the five flare classes across different partitions.


In this study we run a series of experiments. Here in this post, I only list the experiments' brief settings and show the corresponding figures.

Moreover, in some of these experiments (e.g. experiment A and B) we utilize two particular sampling methodologies from a variety of proposed ones. Fig. 3 visualizes all those sampling approaches. For more details, please read the paper.

Figure 3. Visualization of different undersampling and oversampling approaches. On the left, the original proportion of the sub-classes (X, M, C, B, and N) and super-classes (XM and CBN) is depicted. This is a generic illustration and does not reflect the exact imbalance ratios in SWAN-SF.

Experiment Z. [Baseline]

This experiment is as simple as training SVM on all instances of one partition and testing the model on another partition. We try this on all possible partition pairs, resulting in 20 different trials, to illustrate how the difficulty of the prediction task varies as the partitions are chosen from different parts of the solar cycle. The results are visualized in Fig. 4, along with the impact of discussed class-imbalance remedies that we further discuss in the following sections.

Experiment A. [Undersampling]

During the training phase, we apply Undersampling 2 (US2 from Fig. 3) which is an X-class based undersampling. This enforces a $1:1$ balance, not only in the super-class level (i.e., |XM|=|CBN|) but also in the sub-class level (i.e., |X|=|M| and |C|=|B|=|N|). The trained model is then tested against all other partitions one by one to examine the robustness of the model. The undersampling step is only taken in the training partition, as undersampling of the test partition distorts reality and would not reflect the true model's performance. The consistent and significant impact of this remedy is evident in Fig. 4.

Figure 4. [Experiments Z, A, B, and C] Average TSS of SVM trained and tested on all possible permutations of partition pairs, using three different remedies for the class-imbalance issue (undersampling, oversampling, and misclassification weights), compared to a baseline performance where no remedy is employed. In all cases, global normalization is used.

Experiment B. [Oversampling]

Here, we use Oversampling 3 (OS3 of Table~\ref{tab:samplingExample}) and perform an experiment similar to experiment A. Again, no over- or undersampling takes place in the testing set. Comparing the results of oversampling with Undersampling 2, in Fig.~\ref{fig:allRemedies}, shows a close correspondence between the two models in terms of their mean TSS values; typically, differences are within applicable uncertainties.

Experiment C. [Mis-classification Weights]

Here, we apply class weights to the cost function of SVM. We use the imbalance ratio of the super-classes as the weights. For instance when working with Partition 3, since the minority-to-majority ratio is 1:20, we set w_{XM}=20 and w_CBN=1. As shown in Fig. 4, this solution outperforms both undersampling and oversampling approaches in terms of their TSS. It is worth pointing out that employing misclassification weights has the advantage of a data-driven tunability that may be better suited than over- and undersampling to achieve more robust forecast models.

Experiment D. [Data Splits]

To demonstrate the overfitting that occurs when we do not account for the temporal coherence, we train and test SVM on instances randomly chosen from the same partition. Technically, this is a k-fold cross validation using a random sub-sampling method with k=10. The results are then juxtaposed with those obtained by training SVM on one partition and testing it on another. We equipped SVM in both scenarios with misclassification weights (the same as in Experiment C), to eliminate the need for an additional sampling layer. Therefore, the only determining factor is whether the instances are sampled from the same partition or not. Let it be clear that sampling from a single partition does not mean any overlapping between the training and testing sets.

Figure 5. [Experiment D] Average TSS of SVM performance when trained and tested either using a 10-fold cross validation sub-sampling on a single partition (unifold) or assigning different partitions for training and testing (multifold). In all cases, global normalization is used and the SVM has been equipped with misclassification weights.

Experiment E. [Normalization]

To transform the feature space to a normalized space (using a zero--one normalization method) we apply global normalization on a pair of partitions by using global extrema (min and max). We then train SVM on one partition and test on the other. In another attempt, we apply normalization on each partition separately, using the local extrema of the corresponding partition. In order the minimize the impact of other factors on our experiment, we avoid employing any remedies for handling the class-imbalance issue.

Figure 6. [Experiment E] Average TSS of SVM performance impacted by two different data normalization approaches; global and local. In all $20$ trials shown, no special remedy to the class-imbalance problem has been employed

Experiment F. [Oversampling with or without Sub-Class Balance]

We use Oversampling 1 and 3 (OS1 and OS3 from Fig. 3) in the training phase to remedy the class-imbalance problem and then we test the trained model against all other partitions. We chose OS1 and OS3 since their differences make them an interesting pair; OS1, compared to OS3, replicates M-class instances with a significantly larger factor than it does with X-class instances, allowing a large number of 'easier' instances in terms of classification. Therefore, it is naturally expected that OS1 results in an 'easier' dataset, hence a higher classification performance. Here we use global normalization without meaning to imply that it performs necessarily better. Our results are shown in Fig. 7. For them, one sees a relatively similar, consistent performance, although the climatology-preserving Oversampling 1 gives a statistically higher performance. This said, it becomes clear that

Figure 7. [Experiment F] Average TSS of SVM performance impacted by two different oversampling methods from Fig. 3.: Oversampling 1 (OS1), where the climatology of sub-classes are preserved, and Oversampling 3 (OS3), where the sub-classes are forced to reach a 1:1 balance ratio by considering C-class to be the base. In all cases, global normalization is used.

different oversampling methods give non-identical performances. Therefore, comparison of any two forecasting models on similar datasets will be fair only if the employed sampling methodologies are identical.

Experiment G. [SVM with Other Statistical Features]

During training in this experiment, we apply Undersampling 2 from Fig. 3. We train and test SVM on all partition pairs using (i) last-value, (ii) standard-deviation, and (iii) {median, standard-deviation, skewness, kurtosis}. As illustrated in Fig. 8, the statistic standard-deviation results in statistically better performance than last-value, and using the four-number summary seems to outperform standard-deviation. This is a very good indication that different characteristics of time series carry some important pieces of information that may significantly improve reliability of a forecast model.

Figure 8. [Experiment G] Average TSS of SVM performance on 3 different feature spaces: 1) last-value, 2) standard-deviation, and 3) four-number summary of time series, namely {median, standard-deviation, skewness, and kurtosis. Undersampling 2 (i.e., US2 from Fig.3) is used to remedy the class-imbalance issue. In all cases, global normalization is used


We used SWAN-SF benchmark dataset as a case study to highlight some of the challenges in working with imbalanced datasets, which are very often overlooked by scientists of the domain. We also addressed an interesting characteristic of some datasets, that we called \textit{temporal coherence}, inherited from the spatial and temporal dimensions of the data. Using several different experiments, we showcased some pitfalls and overlooked consequences of disregarding those peculiarities, and we discussed the impact of different remedies in the context of flare forecast problem.

For more on the challenges with the class-imbalance issue in Flare Forecasting problem, or details on the SWAN-SF dataset, and also to access the cited publications, I encourage you to read the original manuscript in BigData 2019 proceedings. A link to its pre-print is available in arXiv only for non-commercial use.