Performance Metrics on Flare Prediction

In progress since Jul. 2019

Motivation. This is a short report on the challenges in working with the performance metrics widely used in the flare prediction studies. Any realistic flare forecast dataset inherits an extreme class-imbalance problem that must be dealt with properly. Otherwise, any impressive number can be reported as the performance of a model, by simply choosing the wrong performance metric, or a "bad" sampling methodology.

Objective. This short write-up touches the surface of some issues with the choice of a performance metrics, for an imbalance dataset, and juxtaposes the results with that of a balance dataset.

Review:

Let's start this by listing the performance metrics that we are interested in. Below you can see their formula, definition, and in some cases a link to a relevant study.

Just to make sure that the terminology we use here is clear, in this write-up, the terms "... true or false refers to the assigned classification being correct or incorrect, while positive or negative refers to assignment to the positive or the negative category." [Source]

Accuracy

Is one of the most useful and popular metrics to measure models' performance, ranges within [0, 1]. This report perfectly visualizes the fact that the class-imbalance issue renders this metric completely powerless to the extend that a model with a fixed prediction (always positive or always negative) may get the highest accuracy (i.e. 1.0).


Precision (PPV) & Recall (TPR)

Two measures, each ranging withing [0, 1], often used together to show the model's performance in the presence of class imbalance issue. Precision (aka, Positive Predictive Value) shows the fraction of correct positive guesses out of all positive guesses, while recall (aka, True Positive Rate) measures the fraction of correct guesses out of all positive observations.


F1-Score (F-Measure, F-Score)

A classic metric, varying within the range [0, 1], widely used for imbalance datasets in all domains. As shown on the right, this is the harmonic mean of precision and recall.


Negative Prediction Value (NPV) & True Negative Rate (TNR)

Negative Prediction Value is simply Pre for the negative class. Similarly, True Negative Rate is Rec for the negative class. They both lie within the interval [0, 1].


True Skill Statistic (TSS)

Bloomfiled et al. (2012)

A metric suggested by Bloomfield et al. 2012, to be also used to report performance of flare forecasting models. As shown on the right, this is simply True Positive Rate minus False Positive Rate (or, recall - FPR). TSS ranges within [-1, 1]; random or constant forecasts score 0, perfect forecasts score 1, and forecasts that are always wrong score −1.



Heidke Skill Score (HSS1)

Barnes & Leka (2008)

This is one definition of HSS proposed by Barnes & Leka (2008) that measures how much better a model performs compared to the model that always predicts a negative class to occur. It ranges within (-Inf, 1]. To be exact, it ranges from -N/P to 1 (assuming that the majority class is the negative class.)


Heidke Skill Score (HSS2)

Mason & Hoeksema (2010)

This is another definition of HSS as specified by the Space Weather Prediction Center (SWPC 2007) in their Forecast Verification Glossary. It quantifies the performance of a model by comparing it with the model that predicts randomly. This ranges within the interval [-1, 1].


Experiments:

The objective here is to design some simple experiments which can show us how different metrics reflect different models' performance. We do this in 2 rounds, once for an imbalance data, and then for a perfectly (1:1) balance data.

These experiments are conducted on a subset of all possible confusion matrices that a model could possibly obtain. Therefore, we do not need any actual data for these experiments. We simply create a table as a (discrete) subset of all possible confusion matrices.

A: All Confusion Matrices for an Imbalanced Dataset

Given a dataset of 5100 instances with:

  • the total number of positive instances: |P| = 100

  • the total number of negative instances: |N| = 5000

A subset of all possible confusion matrices is generated (shown on the right) as follows:

We cover a subset of the entire space of the confusion matrix by changing TP and FN with a step-size of 10, in opposite order, and then for each pair of those values, we vary FP and TN, with a step-size of 500, again in opposite order.

This keeps the sums TP+FN and FP+TN constant across all iterations, while ensuring that a uniform and discrete subset of the entire space is covered.

Based on the above table, we can compute any performance metrics.

The plot on the right illustrates performance of all the 121 models listed above. The X axis represents the Iter column.

Observation:

  • While both TSS and HSS2 range within the interval [-1, 1], HSS2 approaches -1 much more slowly than TSS.

  • TSS metric has a linear characteristic, unlike F1-Score and HSS2.

  • Note how HSS2 and F1-Score behave similarly, except that F1-Score is always positive and it does not account for TN.


Let's go further and see how other metrics reflect these models' performance.

Observation:

This crystallizes the fact that:

  • Accuracy, is indeed misleading in the presence of the class-imbalance issue.

  • PPV (Precision) and TPR (Recall) should be used together to show model's overall performance, owing to the fact that the former does not account for TN and FN, and the latter excludes TN and FP, hence the step-shape result.

  • The extreme exponential growth of NPV , which means the model is exponentially getting better in prediction of negative instances, is partly rooted in the imbalance of data. That is in such cases, overall improvement in prediction is often achieved by correct prediction of the negative class, and does not have anything to do with prediction of the positive class.

The above observations become more interesting when juxtaposed with similar results on a balance datasets. Let's design a new experiment to have a side-by-side comparison!

B: All Confusion Matrices for a balanced Dataset

Given a dataset of 200 instances with:

  • the total number of positive instances: |P| = 100

  • the total number of negative instances: |N| = 100

A subset of all possible confusion matrices is generated as follows:

Similar to what we did before, we cover a subset of the entire space of the confusion matrix by changing TP and FN with the step-size of 10, in opposite order, and then for each pair of those values, changing FP and TN, with the same step-size, again in opposite order.


The plot on the right illustrates performance of all the models listed above. The X axis represents the Iter column.

Observation:

  • Note how TSS and HSS2 perfectly co-occur (blue and yellow dotted lines). This is simply because their definitions are algebraically identical when the data is perfectly balanced (i.e., |P|=|N|). <<<<<

  • It seems that the only difference between F1-Score and the other two is its non-linear behavior and its range. Other than that, they both successfully reflect model's performance.

The plot on the right illustrates performance of all the models listed above. The X axis represents the Iter column.


Observation:

  • Since the data is now balanced, Accuracy is now capable of reflecting the models' actual performance.

  • Note that TPR (Recall) is not affected by the new ratio and shows the exact same results.

  • For both PPV and NPV, due to the balance in data, the growth factor is much lower (compared to the imbalance case).

So far, we compared all performances that could be achieved by predictive models, in two scenarios: once on an [imbalance] and then on a [balance] data. We can now keep the model fixed, with an arbitrary power of prediction, and play with the balance ratio of data. This would allow us to observe how different metrics react to the degree of imbalance.

C: One Model, Different Balance Ratios

Given is a dataset of 200 instances, and a model that correctly classifies 75% of positive instances and 25% of negative instances.

While the forecast model and the data-size are fixed, we change the balance ratio, varying from P=200 & N=0, all the way to P=0 & N=200.

On the right, the resultant 21 confusion matrices are shows.

The plots here depict performance of the above-mentioned model, in 21 iterations, where the balance ratio progressively changes. The X axis represents the Iter column.

Observation:

  • As shown previously by Bloomfield et al. (2012), TSS is indeed unbiased to the changes in the balance ratio, while neither HSS nor F1-Score show such characteristic.

  • Once again, we can see TSS and HSS2 converge as the imbalance ratio approaches to 1:1 (in the middle).

  • TSS, TPR, and TNR seem completely unbiased to the changes in the imbalance ratio.

  • It is interesting to observe that F1-Score indicates that the performance drops while it is in fact unchanged.

  • Note how TPR (recall) shows the 75% success in prediction of positive instances, and is unbiased to the balance ratio. Similarly, TNR (Specificity) shows bias, while reflecting 50% success of the model on prediction of the negative class.





Question: Can we use TNR and TPR to create a new metric, since none of them is biased to the imbalance ratio?

Answer: Yes. In fact, there is a metric called Youden's J statistic (J) [6], which does this: J = TPR + TNR - 1. The best performance that detects all positive and negative instances correctly (TPR =1 & TNR=1) will get j=1, and conversely, the worst model that gets all guesses wrong, is assigned to j=0.

However, this measure is indeed nothing but the TSS, which can be shown as follows:

TSS = Rec - FPR = Rec - (1 - TNR) = J

Lessons Learned:

Why F1-Score is not desirable?

    • [-] While generally this is a good metric for imbalance datasets, it is biased to the imbalance ratio of classes (see first plot in last experiment). That is, success of a model is not comparable to that of any other model or even of that very model, as long as the imbalance ratio is not preserved in all trials. For instance, this should not be used when a model is being evaluated on samples with different positive-to-negative ratio, to show how the model performs on different sampled data.

When TSS should/shouldn't be used?

    • [+] TSS is unbiased to the class imbalance ratio, and takes into account all entities of the confusion matrix. This should be used when we want to compare different models in the presence of different imbalance ratios.

    • [-] TSS treats both TPR and FPR the same. That is, the penalty for misclassification of a positive instance is no different than that of a negative instance. In real world data, however, the cost for misclassification of these two classes is far from equal. In flare prediction task, for instance, as explained in Bobra et al. (2015) [5], "... not predicting a flare that occurs (false negative) may be more costly than predicting a flare that does not occur (false alarm). Indeed, in the case of a satellite partly shielded to withstand an increase in energetic particles following a solar flare, the cost of a false alarm is the price paid to rotate the satellite so that the shielded part faces the particle flow, while the cost of a false negative may be the breakdown of the satellite."

When HSS should/shouldn't be used?

    • [!] In case of a perfect 1:1 balance between classes, HSS1, HSS2, and TSS are all equivalent. Therefore, while it is safe to use HSS, TSS can simply replace it.

    • [-] Under the same circumstances that F1-Score couldn't be used, HSS1 and HSS2 should be avoided.

Side notes

1

Despite what was said in Bobra et al. 2015 [5], it is not true that for HSS1 "negative scores denote a worse performance than always predicting no flare". Although, she does not provide a clear definition of a random model, the example below could help to understand why this is not a valid conclusion.

Example: The three confusion matrices below, for a data with 100 positive instances and 5000 negative ones, can be described as follows:

  1. The first one represents a model that always predicts "no-flare", hence HSS1 = 0. This seems to be a meaningful point to compare other models to.

  2. The second one, however, is the opposite mode, always predicting "flare", and results in the minimum value of HSS1, for this imbalance ratio (100:5000).

  3. And finally the last one, shows a more moderate model, with 60% success in positive instances, and 30% success in negative class.

A ----[ TP:0 | FN:100 | FP:0 | TN:5000 ]----> HSS1 = 0.0

B ----[ TP:100 | FN:0 | FP:5000 | TN:0 ]----> HSS1 = -49.0

C ----[ TP:60 | FN:40 | FP:3500 | TN:1500 ]----> HSS1 = -34.4

If the quoted statement was correct, then both B and C models should be understood as worse models compared to A. While we are not given a definition to compare A and B, it is clear that the model C should always be preferred over A which lacks any knowledge about the data.

2

HSS2, measures the improvement of the forecast over a random forecast. Let's see this through an example.

Example: Given a dataset with 100 positive instances and 5000 negative ones, HSS2 would reach 0 in three situations as follows:

  1. When the prediction is similar to random-guess, i.e., 50% of positive instances and 50% of negative instances are predicted correctly.

    • [TP:50 | FN:50 | FP:2500 | TN:2500] --> HSS2 = 0.0

  2. When the model always predicts a positive or a negative instance to happen.

    • [TP:0 | FN:100 | FP:0 | TN:5000] --> HSS2 = 0.0

    • [TP:100 | FN:0 | FP:5000 | TN:0 ] --> HSS2 = 0.0

  3. When the following condition is held: TP X TN = FN X FP.

The first situation simply comes from the definition of HSS2. The second situation gives HSS2 a big advantage over HSS1, since it puts the all-positive and all-negative models, right next to a random-guess, which is meaningful. The third case, is even more interesting. The condition can be restated as TP / FN = FP / TN to be more understandable. All models following such behavior are doomed to a random-guess like performance, since their improvement on one class is always tied to doing worse on the other class.

3

The range of the metric Youden's J statistics [6], contrary to what was stated in the original paper [5], and the Wikipedia page, is not [0, 1]. It in fact ranges from -1 to 1.

Example: See the confusion matrix below, which results in J < 0.

TP:90 | NF:10 | FP:5000 | TN:0 ----> J = TSS = -0.1

References:

[1] Woodcock, Frank. "The evaluation of yes/no forecasts for scientific and administrative purposes." Monthly Weather Review 104.10 (1976): 1209-1214. PDF[2] Bloomfield, D. Shaun, et al. "Toward reliable benchmarking of solar flare forecasting methods." The Astrophysical Journal Letters 747.2 (2012): L41.APA. PDF[3] Barnes, G., and K. D. Leka. "Evaluating the performance of solar flare forecasting methods." The Astrophysical Journal Letters 688.2 (2008): L107. PDF[4] SWPC (Space Weather Prediction Center): https://www.swpc.noaa.gov/[5] Bobra, Monica G., and Sebastien Couvidat. "Solar flare prediction using SDO/HMI vector magnetic field data with a machine-learning algorithm." The Astrophysical Journal 798.2 (2015): 135. PDF[6] Youden, W., "Index for rating diagnostic tests" PDF

Related Studies

[*] Mason, J. P., and J. T. Hoeksema. "Testing automated solar flare forecasting with 13 years of Michelson Doppler Imager magnetograms." The Astrophysical Journal 723.1 (2010): 634. PDF[*] Hyvärinen, Otto. "A probabilistic derivation of Heidke skill score." Weather and Forecasting 29.1 (2014): 177-181. PDF[*] Cohen, Jacob. "A coefficient of agreement for nominal scales." Educational and psychological measurement 20.1 (1960): 37-46. PDF