## Performance Metrics on Flare Prediction

In progress since Jul. 2019

**Motivation.** This is a short report on the challenges in working with the performance metrics widely used in the flare prediction studies. Any realistic flare forecast dataset inherits an extreme class-imbalance problem that must be dealt with properly. Otherwise, any impressive number can be reported as the performance of a model, by simply choosing the wrong performance metric, or a "bad" sampling methodology.

**Objective.** This short write-up touches the surface of some issues with the choice of a performance metrics, for an imbalance dataset, and juxtaposes the results with that of a balance dataset.

# Review:

Let's start this by listing the performance metrics that we are interested in. Below you can see their formula, definition, and in some cases a link to a relevant study.

Just to make sure that the terminology we use here is clear, in this write-up, the terms "... *true* or *false* refers to the assigned classification being correct or incorrect, while *positive* or *negative* refers to assignment to the positive or the negative category." [Source]

## Accuracy

Is one of the most useful and popular metrics to measure models' performance, ranges within `[0, 1]`

. This report perfectly visualizes the fact that the class-imbalance issue renders this metric completely powerless to the extend that a model with a fixed prediction (always positive or always negative) may get the highest accuracy (i.e. `1.0`

).

## Precision (PPV) & Recall (TPR)

Two measures, each ranging withing `[0, 1]`

, often used together to show the model's performance in the presence of class imbalance issue. *Precision* (aka, *Positive Predictive Value*) shows the fraction of correct positive guesses out of all positive guesses, while *recall* (aka, *True Positive Rate*) measures the fraction of correct guesses out of all positive observations.

## F1-Score (F-Measure, F-Score)

A classic metric, varying within the range `[0, 1]`

, widely used for imbalance datasets in all domains. As shown on the right, this is the harmonic mean of *precision* and *recall*.

## Negative Prediction Value (NPV) & True Negative Rate (TNR)

*Negative Prediction Value* is simply *Pre* for the negative class. Similarly, *True Negative Rate* is *Rec* for the negative class. They both lie within the interval `[0, 1]`

.

## True Skill Statistic (TSS)

Bloomfiled et al. (2012)A metric suggested by *Bloomfield et al.* 2012, to be also used to report performance of flare forecasting models. As shown on the right, this is simply *True Positive Rate* minus *False Positive Rate* (or, *recall* - *FPR)*. TSS ranges within `[-1, 1]`

; random or constant forecasts score `0`

, perfect forecasts score `1`

, and forecasts that are always wrong score `−1`

.

## Heidke Skill Score (HSS1)

Barnes & Leka (2008)This is one definition of HSS proposed by *Barnes & Leka* (2008) that measures how much better a model performs compared to the model that always predicts a negative class to occur. It ranges within `(-Inf, 1]`

. To be exact, it ranges from `-N/P`

to `1`

(assuming that the majority class is the negative class.)

## Heidke Skill Score (HSS2)

Mason & Hoeksema (2010)This is another definition of HSS as specified by the Space Weather Prediction Center (SWPC 2007) in their Forecast Verification Glossary. It quantifies the performance of a model by comparing it with the model that predicts randomly. This ranges within the interval `[-1, 1]`

.

# Experiments:

The objective here is to design some simple experiments which can show us how different metrics reflect different models' performance. We do this in 2 rounds, once for an imbalance data, and then for a perfectly (1:1) balance data.

These experiments are conducted on a subset of all possible confusion matrices that a model could possibly obtain. Therefore, we do not need any actual data for these experiments. We simply create a table as a (discrete) subset of all possible confusion matrices.

## A: All Confusion Matrices for an Imbalanced Dataset

Given a dataset of 5100 instances with:

- the total number of
**positive**instances:`|P| = 100`

- the total number of
**negative**instances:`|N| = 5000`

A subset of all possible confusion matrices is generated (shown on the right) as follows:

We cover a subset of the entire space of the confusion matrix by changing `TP`

and `FN`

with a step-size of `10`

, in opposite order, and then for each pair of those values, we vary `FP`

and `TN`

, with a step-size of `500`

, again in opposite order.

This keeps the sums `TP+FN`

and `FP+TN`

constant across all iterations, while ensuring that a uniform and discrete subset of the entire space is covered.

Based on the above table, we can compute any performance metrics.

The plot on the right illustrates performance of all the 121 models listed above. The X axis represents the `Iter`

column.

**Observation:**

- While both
`TSS`

and`HSS2`

range within the interval [-1, 1],`HSS2`

approaches`-1`

much more slowly than`TSS`

. `TSS`

metric has a linear characteristic, unlike`F1-Scor`

e and`HSS2`

.- Note how
`HSS2`

and`F1-Scor`

e behave similarly, except that`F1-Scor`

e is always positive and it does not account for`TN`

.

Let's go further and see how other metrics reflect these models' performance.

**Observation:**

This crystallizes the fact that:

*Accuracy,*is indeed misleading in the presence of the class-imbalance issue.*PPV*(*Precision*) and*TPR*(*Recall*) should be used together to show model's overall performance, owing to the fact that the former does not account for`TN`

and`FN`

, and the latter excludes`TN`

and`FP`

, hence the step-shape result.- The extreme exponential growth of
*NPV*, which means the model is exponentially getting better in prediction of negative instances, is partly rooted in the imbalance of data. That is in such cases, overall improvement in prediction is often achieved by correct prediction of the negative class, and does not have anything to do with prediction of the positive class.

The above observations become more interesting when juxtaposed with similar results on a balance datasets. Let's design a new experiment to have a side-by-side comparison!

## B: All Confusion Matrices for a balanced Dataset

Given a dataset of 200 instances with:

- the total number of
**positive**instances:`|P| = 100`

- the total number of
**negative**instances:`|N| = 100`

A subset of all possible confusion matrices is generated as follows:

Similar to what we did before, we cover a subset of the entire space of the confusion matrix by changing `TP`

and `FN`

with the step-size of `10`

, in opposite order, and then for each pair of those values, changing `FP`

and `TN`

, with the same step-size, again in opposite order.

The plot on the right illustrates performance of all the models listed above. The X axis represents the `Iter`

column.

**Observation:**

- Note how
`TSS`

and`HSS2`

perfectly co-occur (blue and yellow dotted lines). This is simply because their definitions are algebraically identical when the data is perfectly balanced (i.e.,`|P|=|N|`

). <<<<< - It seems that the only difference between
`F1-Score`

and the other two is its non-linear behavior and its range. Other than that, they both successfully reflect model's performance.

The plot on the right illustrates performance of all the models listed above. The X axis represents the `Iter`

column.

**Observation:**

- Since the data is now balanced,
*Accuracy*is now capable of reflecting the models' actual performance. - Note that
*TPR*(*Recall*) is not affected by the new ratio and shows the exact same results. - For both
*PPV*and*NPV*, due to the balance in data, the growth factor is much lower (compared to the imbalance case).

So far, we compared all performances that could be achieved by predictive models, in two scenarios: once on an [imbalance] and then on a [balance] data. We can now keep the model fixed, with an arbitrary power of prediction, and play with the balance ratio of data. This would allow us to observe how different metrics react to the degree of imbalance.

## C: One Model, Different Balance Ratios

Given is a dataset of 200 instances, and a model that correctly classifies 75% of positive instances and 25% of negative instances.

While the forecast model and the data-size are fixed, we change the balance ratio, varying from `P=200 & N=0`

, all the way to `P=0 & N=200`

.

On the right, the resultant 21 confusion matrices are shows.

The plots here depict performance of the above-mentioned model, in 21 iterations, where the balance ratio progressively changes. The X axis represents the `Iter`

column.

**Observation:**

- As shown previously by
*Bloomfield*et al. (2012), TSS is indeed unbiased to the changes in the balance ratio, while neither*HSS*nor*F1-Score*show such characteristic. - Once again, we can see
`TSS`

and`HSS2`

converge as the imbalance ratio approaches to`1:1`

(in the middle). `TSS`

,`TPR`

, and`TNR`

seem completely unbiased to the changes in the imbalance ratio.- It is interesting to observe that
`F1-Score`

indicates that the performance drops while it is in fact unchanged. - Note how
*TPR*(*recall*) shows the 75% success in prediction of positive instances, and is unbiased to the balance ratio. Similarly, TNR (*Specificity*) shows bias, while reflecting 50% success of the model on prediction of the negative class.

**Question:** Can we use `TNR`

and `TPR`

to create a new metric, since none of them is biased to the imbalance ratio?

**Answer:** Yes. In fact, there is a metric called *Youden's J statistic** *(*J*) [6], which does this: `J = TPR + TNR - 1`

. The best performance that detects all positive and negative instances correctly (`TPR =1`

& `TNR=1`

) will get `j=1`

, and conversely, the worst model that gets all guesses wrong, is assigned to `j=0`

.

However, this measure is indeed nothing but the `TSS`

, which can be shown as follows:

`TSS = Rec - FPR = Rec - (1 - TNR) = J`

## Lessons Learned:

### Why F1-Score is not desirable?

- [-] While generally this is a good metric for imbalance datasets, it is biased to the imbalance ratio of classes (see first plot in last experiment). That is, success of a model is not comparable to that of any other model or even of that very model, as long as the imbalance ratio is not preserved in all trials. For instance, this should not be used when a model is being evaluated on samples with different positive-to-negative ratio, to show how the model performs on different sampled data.

### When TSS should/shouldn't be used?

- [+] TSS is unbiased to the class imbalance ratio, and takes into account all entities of the confusion matrix. This should be used when we want to compare different models in the presence of different imbalance ratios.
- [-] TSS treats both TPR and FPR the same. That is, the penalty for misclassification of a positive instance is no different than that of a negative instance. In real world data, however, the cost for misclassification of these two classes is far from equal. In flare prediction task, for instance, as explained in
*Bobra et al.*(2015) [5], "... not predicting a flare that occurs (false negative) may be more costly than predicting a flare that does not occur (false alarm). Indeed, in the case of a satellite partly shielded to withstand an increase in energetic particles following a solar flare, the cost of a false alarm is the price paid to rotate the satellite so that the shielded part faces the particle flow, while the cost of a false negative may be the breakdown of the satellite."

### When HSS should/shouldn't be used?

- [!] In case of a perfect 1:1 balance between classes,
`HSS1`

,`HSS2`

, and`TSS`

are all equivalent. Therefore, while it is safe to use`HSS`

,`TSS`

can simply replace it. - [-] Under the same circumstances that
`F1-Score`

couldn't be used,`HSS1`

and`HSS2`

should be avoided.

## Side notes

# 1

Despite what was said in *Bobra et al.* 2015 [5], it is not true that for `HSS1`

"negative scores denote a worse performance than always predicting no flare". Although, she does not provide a clear definition of a random model, the example below could help to understand why this is not a valid conclusion.

**Example:** The three confusion matrices below, for a data with `100`

positive instances and `5000`

negative ones, can be described as follows:

- The first one represents a model that always predicts "no-flare", hence
`HSS1 = 0`

. This seems to be a meaningful point to compare other models to. - The second one, however, is the opposite mode, always predicting "flare", and results in the minimum value of
`HSS1`

, for this imbalance ratio (100:5000). - And finally the last one, shows a more moderate model, with 60% success in positive instances, and 30% success in negative class.

`A ----[ TP:0 | FN:100 | FP:0 | TN:5000 ]----> HSS1 = 0.0`

`B ----[ TP:100 | FN:0 | FP:5000 | TN:0 ]----> HSS1 = -49.0`

`C ----[ TP:60 | FN:40 | FP:3500 | TN:1500 ]----> HSS1 = -34.4`

If the quoted statement was correct, then both B and C models should be understood as worse models compared to A. While we are not given a definition to compare A and B, it is clear that the model C should always be preferred over A which lacks any knowledge about the data.

# 2

`HSS2`

, measures the improvement of the forecast over a random forecast. Let's see this through an example.

**Example:** Given a dataset with `100`

positive instances and `5000`

negative ones, `HSS2`

would reach 0 in three situations as follows:

- When the prediction is similar to random-guess, i.e., 50% of positive instances and 50% of negative instances are predicted correctly.
`[TP:50 | FN:50 | FP:2500 | TN:2500] --> HSS2 = 0.0`

- When the model always predicts a positive or a negative instance to happen.
`[TP:0 | FN:100 | FP:0 | TN:5000] --> HSS2 = 0.0`

`[TP:100 | FN:0 | FP:5000 | TN:0 ] --> HSS2 = 0.0`

- When the following condition is held:
`TP X TN = FN X FP`

.

The **first** situation simply comes from the definition of `HSS2`

. The **second** situation gives `HSS2`

a big advantage over `HSS1`

, since it puts the all-positive and all-negative models, right next to a random-guess, which is meaningful. The **third** case, is even more interesting. The condition can be restated as `TP / FN = FP / TN`

to be more understandable. All models following such behavior are doomed to a random-guess like performance, since their improvement on one class is always tied to doing worse on the other class.

# 3

The range of the metric *Youden*'s J statistics [6], contrary to what was stated in the original paper [5], and the Wikipedia page, is not `[0, 1]`

. It in fact ranges from `-1`

to `1`

.

**Example:** See the confusion matrix below, which results in J < 0.

`TP:90 | NF:10 | FP:5000 | TN:0 ----> J = TSS = -0.1`

## References:

[1] Woodcock, Frank. "The evaluation of yes/no forecasts for scientific and administrative purposes." *Monthly Weather Review*104.10 (1976): 1209-1214. PDF[2] Bloomfield, D. Shaun, et al. "Toward reliable benchmarking of solar flare forecasting methods."

*The Astrophysical Journal Letters*747.2 (2012): L41.APA. PDF[3] Barnes, G., and K. D. Leka. "Evaluating the performance of solar flare forecasting methods."

*The Astrophysical Journal Letters*688.2 (2008): L107. PDF[4] SWPC (Space Weather Prediction Center): https://www.swpc.noaa.gov/[5] Bobra, Monica G., and Sebastien Couvidat. "Solar flare prediction using SDO/HMI vector magnetic field data with a machine-learning algorithm."

*The Astrophysical Journal*798.2 (2015): 135. PDF[6] Youden, W., "Index for rating diagnostic tests" PDF

## Related Studies

[*] Mason, J. P., and J. T. Hoeksema. "Testing automated solar flare forecasting with 13 years of Michelson Doppler Imager magnetograms." *The Astrophysical Journal*723.1 (2010): 634. PDF[*] Hyvärinen, Otto. "A probabilistic derivation of Heidke skill score."

*Weather and Forecasting*29.1 (2014): 177-181. PDF[*] Cohen, Jacob. "A coefficient of agreement for nominal scales."

*Educational and psychological measurement*20.1 (1960): 37-46. PDF