A Curated Image Parameter Dataset from the Solar Dynamics Observatory Mission

Sep 2017 - Jul 2019


"A Curated Image Parameter Dataset from Solar Dynamics Observatory Mission" [article][pdf][media]

> [DOI: 10.3847/1538-4365/ab253a]

Azim Ahmadzadeh, Dustin J. Kempton, and Rafal A. Angryk

The Astrophysical Journal Supplement Series

ApJS-2019

Can we classify solar events solely based on a set of simple image parameters? In other words, how good a supervised model can learn to distinguish an Active Region (AR) from a Coronal Hole (CH) solar event, by only looking at the pixel colors and the regional textures?

The contribution of this study is as follows:

  • Analysis of each of the ten image parameters and determining their tuning variables (if any),

  • Tuning each of the ten image parameters, using F test,

  • Utilizing supervised classifiers for detection of AR and CH events based on the feature space of the tuned image parameters,

  • Comparing the extracted features from JP2 images versus FITS files, in terms of the classification performance, and

  • Providing a public API to access the dataset of 6+ years of the extracted features on SDO image, with the cadence of 6 minutes.



Abstract

We conduct a thorough investigation on performance of a set of ten image parameters on AIA solar images captured by the Solar Dynamics Observatory mission and present our dataset in the form of a publicly available online interface. We target the two solar events, namely active region and coronal hole, which are in particular of interest for classification and prediction of solar flares. Since the AIA’s four telescopes provide narrow-band imaging of seven extreme ultraviolet band passes and two ultraviolet channels, we carefully tune the image parameters one by one to achieve a higher performance on each of those channels, in classification of active regions and coronal holes. In addition to taking into account the differences between the channels, we perform our experiments on two different image formats, FITS and JP2. The FITS format is used to store the original AIA’s images after digitization, and the JP2 format is utilized by Helioviewer for reducing the size of their repository and making processing of such large datasets more feasible. The AIA images in JP2 format, similar to those in FITS format, are 4k images, but they are approximately 5 to 14 times smaller in size. This makes the computation time of any algorithm significantly faster on JP2 images. Our results of such comparison show that for our purpose, by using the JP2 format we can obtain the same level of classification accuracy for all of our image parameters, that we could achieve wiht the FITS images.

To access the data, please visit our web API.



Background

In 2010, Banda et al. collected a set of ten image parameters while they were working towards creating a Content-based Image Retrieval (CBIR) system on the SDO AIA images (Banda and Angryk, 2010b). These parameters were chosen based on their effectiveness in classification of the solar events and also their processing time (Banda and Angryk, 2010a). The concern regarding the running time of the implemented parameters is rooted in the ultimate goal of near real-time processing of the data and prediction of solar events. The processing window is therefore bounded by the rate of eight 4096 × 4096–pixel images being transmitted to earth every 10 seconds. The performance of these parameters was further experimented and confirmed by Banda et al. (2011, 2013). These parameters have also been used for classification of filaments in H-alpha images from the Big Bear Solar Observatory (BBSO) and similar success was reported by Schuh and Angryk (2014). Schuh et al. employed these ten image parameters for development of a trainable module for use in the CBIR system (Schuh et al., 2015).

In this research, we utilize machine learning models to tune a set of ten image parameters to reach their highest performance in classifying two important solar events, namely active regions and coronal holes. Along the way, we run several different experiments to pick the best settings for each of the parameters and use statistical measures to verify each of the steps.


















Heatmap plots of the ten image parameters extracted from an AIA JP2 image captured on 2017-09-06 at 12:55:00, from the 171–Å channel.


Gird-based

Segmentation

All parameters in the Table above, except for fractal dimension and Tamura directionality, capture some information about the distribution of the pixel intensity values of the images and none of them preserves the spatial information. However, the location and shape of the solar phenomena, similar to the temporal information, are the crucial aspects of our data. To preserve the spatial information of the data, we apply a grid-based segmentation on the images. This is a widely used technique already experimented on the AIA images by Banda and Angryk (2009, 2010a) that showed good results. Each 4096 × 4096–pixel AIA image is segmented by a fixed 64 × 64–cell grid. For each grid cell that spans over a square of 64 × 64 pixels of the image, the 10 image parameters will be calculated. In Fig. 1, such segmentation as well as the heat-map of the mean parameter, as an example, is visualized. Since each image comes in 10 different wavelength channels, calculation of each image parameter results in a data cube of size 64 × 64 × 10.

The image parameters can be categorized in two main groups; those which describe purely statistical characteristics of an image and those that capture the textural information. The former further divides into two subcategories. Parameters such as mean, standard deviation, skewness, kurtosis, relative smoothness, and Tamura contrast solely depend on the pixel intensity values of the image, while uniformity and entropy, in addition to the pixel values, depend on the choice of the bin size required for construction of the normalized histogram of the color intensities. The latter captures the characteristics of the image texture within the regions of interest (i.e., solar events).

Entropy: the Monkey Model Entropy (MME) which is identical to what Shannon introduced for decoding the communication bits, is still the most popular model in the image processing community. In this model, the random variable i_(x, y), i.e., the intensity value of the pixel at position (x,y), is assumed to be independent and identically distributed (i.i.d) and therefore the entropy is measured as shown here.

Uniformity: Similar to entropy, uniformity is also a popular statistical measure widely used to quantify the randomness of the color intensities and to characterize the textural properties of an image. Uniformity reaches its highest value when gray level distribution has either a constant or a periodic form. In other words, a smoother distribution will have a higher (i.e., close to one) uniformity. This is because of a very few dominant gray tone transitions and therefore having fewer color intensities with larger magnitude than the average

Fractal Dimension: Box counting method is one way of measuring fractal dimension (or structural complexity) of a shape. The general approach for the box counting method can be described as follows. The fractal surface, in an n-dimensional space, is first partitioned with a grid of n-cubes with the side length of ε. Let N(ε) denote the number of n-cubes overlapping with the fractal structure. If we repeat the counting process for the n-cubes of different sizes, the slope β of the regression line


Fractal dimension provides a measure to quantify the complexity of the shapes’ contour, with larger values indicating higher complexity. We tested our implementation of box counting method against this expected behaviour as a sanity check. In this experiment, two groups of test signals are generated as our fractal-like shapes. One set is created by adding an incrementally increasing random noise to a sine wave, and the other one, by adding an incrementally increasing frequency of another sine wave to the base sine wave. Measuring the dimension of each signal, a roughly linear growth of fractal dimension is observed that conforms to our expectation. The resultant comparison is illustrated on the right.

For more details, see my other study on this parameter.

Tamura Directionality: Tamura directionality is a measurement of changes in directions visually perceivable in image textures. In the statistics terms, this parameter calculates the weighted variance of the gradient angles, φ, for each peak, p, of the histogram of angles, h(φ), within each peak’s domain, ω_p , considering the angle corresponding to each peak be the mean value of the angles within that peak’s domain.

See my other study on this parameter here.





Data Sources

  • HEK: This is the source of the spatiotemporal data used in this study. The HEK system, as a centralized archive of solar event reports, is populated with the events detected by its Event Detection System (EDS) from SDO data. There are considered 18 different classes of events such as active region, coronal hole, and flare. For each event class, a unique set of required and optional attributes is defined. Each event must have a duration and a bounding box that contains the event in space and time. We use this information to map the meta data of the reported events to the corresponding AIA images. Among the existing classes, we are only interested in active region and coronal hole events for this study. Active regions are reported and assigned numbers daily by the Space Weather Prediction Center (SWPC) of NOAA (National Oceanic and Atmospheric Administration). Each active region observation, as Hurlburt et al. (2010) explain, is an event bounded within a 24-hour time interval. Therefore, HEK considers all NOAA active regions with the same active region number to be the same active region.

  • AIA: This is the source of the image data for our study. The AIA’s four telescopes provide narrow-band imaging of seven extreme ultraviolet (EUV) band passes (94–Å, 131–Å, 171–Å, 193–Å, 211–Å, 304–Å, and 335–Å) and two UV channels (1600–Å and 1700–Å). The captured 4k images of the Sun, which are full-disk snapshots with the cadence of 12 seconds, are compressed on board and without being recorded on orbit, are transmitted to SDO ground stations. The received raw data (Level 0) are archived on magnetic tapes in JSOC science-data processing facility. The uncompressed data is then exported as FITS files with the data represented as 32-bit floating values. At this point, images are already calibrated, however, some corrections and cleaning are still required due to the existence of a small residual roll angle between the four AIA telescopes. At this stage (Level 1.5), the data is ready for analysis. In some repositories including Helioviewer, the FITS files are converted to JP2 format to reduce the volume of their database. In this study, we use the level 1.5 FITS files and we refer to them as the “rawFITS”, as opposed to the “clipped FITS”.

JP2 v.s. FITS

As mentioned above, SDO image data is available in both JP2 and FITS formats. FITS, short for Flexible Image Transport System, is a data format for recording digital images of scientific observations. This format was proposed as a solution to the data transport problem. I studied the distribution of pixel intensities of FITS files here. For processing of the FITS files we use the nom-tam-fits Java library.


A 3-D view of an AIA FITS image from the 171–Å channel, with values ranging from 0 to 16383



Summary of Settings

In Summary, for each image parameter we managed to identify the variables and their domains, that play a role in tuning of that parameter. We will use these variables to find the best settings for the image parameters to obtain the highest accuracy in prediction of the solar events. The variables of interest for each of the four image parameters are summarized below:

  • Uniformity: the number of bins, n,

  • Entropy: the number of bins, n,

  • Fractal dimension: the Gaussian smoothing parameter used in Canny edge detector, σ,

  • Tamura directionality: the threshold, t , the minimum distance, d, and the maximum number of peaks, n, used in our peak detection method. These are based on the edge detection method required to compute this parameter.






Tuning Methodology

For each parameter, first, we find the set of n key constraints (or variables), and identify appropriate numeric domains, d_i, for each constraint i ∈{1, 2, ..., n}. As a result, we will have a feature space of dimension |d_1| × |d_2| × ··· × |d_n|, for that particular image parameter, where |d_i| is the cardinality of the domain set d_i. In addition, to describe a particular event, a region of interest must be processed that spans over a variable number of grid cells. The image parameters will extract features from this region. If the region spans over k grid cells, it will then be represented by a vector of length k, for each image parameter. To be able to compare different-length vectors, we use the seven-number summary on the resultant vectors. This would map each feature computed on a region to seven different values. This multiplies the dimensionality of our space by 7. Besides, since the AIA images are captured in 9 ultraviolet (UV) and extreme ultraviolet (EUV) wavelength channels that produce significantly different images of the Sun, a higher-level tuning

is expected to take such differences into consideration and look for the best setting per wavelength. This additional layer multiplies the dimension of the feature space by 9. Therefore, for each of the feature spaces defined by an image parameter that has n different variables, the final dimension will be equal to |d1| × |d2| × ··· × |dn| × 9 × 7.

Our methodology can be summarized in the following five steps:

  1. Determining the dimension of the feature space (i.e., identifying the constraints and their domains),

  2. Building the feature space for the period of one month (i.e., January 2012),

  3. Reducing the dimensionality of the feature space using F-test (i.e., finding the best settings per wavelengths),

  4. Building the (reduced) feature space for the period of one year (i.e., 2012),

  5. Measuring the quality of the parameter using supervised learning.




Dataset for Supervised Learning

For the learning and prediction phase, we employed the same methodology in collection of data that was used by Schuh et al. (2017) to collect one year worth of AIA images over the entire 2012 calendar year and the spatiotemporal data related to the solar events reported in this period. Here, we only briefly explain the data acquisition process and refer the interested reader to the article where the entire process is explained in great detail.

We target two solar event types, namely active region (AR) and coronal hole (CH), which are in particular of interest for heliophysicists and also because of their similar reporting characteristics that make region identification easier. In year 2012, HEK reported 13,518 AR and 10,780 CH event instances, at approximately a four hour cadence. Since there are more AR instances, we first collect all of those instances and then we look for CH instances within a time window of ±60 minutes from each report of an AR instance. Those AR instances that could not be paired with a temporally close CH instance are dropped. The report of each event contains both temporal and spatial information. We use the time stamps of the reports to retrieve the corresponding AIA images (in JP2 and FITS format). The spatial data of each instance consists of a center point for the reported event, its bounding box, and polygonal outline. We use the bounding boxes to extract the image parameters on the region corresponding to each event instance in our training and test phase. With such constraints, we managed to retrieve 2,116 unique pairs of AR and CH instances. As our supervised learning model requires a control class, an event type that points to a region of solar disk with no report of any other solar events, an artificial event called quiet sun (QS) is introduced. To collect a set of such instances temporally linked to our AR-CH collection, for each report of an AR event, the bounding box of that event is used to randomly search for regions that have no intersection with any reports of AR or CH events.




Dimensionality Reduction

To reduce the dimensionality of the computed feature spaces, the F-test in one-way analysis of variance (ANOVA) is used to pick the feature (per wavelength) which has the highest rank in separation of the three solar event-types. The score of each feature is computed as the ratio of between-group variability and within-group variability, where all the instances of each solar event type forms a single group. The ranking procedure is as follows: for each feature, or setting, all the instances of the three event-types reported by HEK will be collected. Using random under-sampling, we make sure that the number of instances in all three categories is the same to avoid imbalanced data sets. After computing the feature of interest on the image cells spanning the bounding boxes of events, the results will be summarized using the seven-number summary. After a ten-fold sampling, we use the F-test to rank the settings. We then aggregate the scores per setting on its seven-number summary, and finally sort the settings by their scores and report the highest per wavelength.

This plot illustrates the difference between the distribution of statistics of the best setting for an image parameter (A) and an arbitrary setting (B). Note how in A the three distributions are more distinguishable. In this particular case, the image parameter is Tamura directionality, the wavelength is 94–Å, and the statistics is first quartile.




Supervised Learning

Näıve Bayes: To measure the performance of the four image parameters after finding the best setting for each of them, we employ two classifiers, namely Näıve Bayes and Random Forest. NäıveBayes classifier is a simple statistical model that learns by applying the Bayes’ theorem with strong independence assumption, on the labeled data and predicts based on the maximum a posteriori rule. In the context of our data points, for an event instance e_t reported at time t, which can be of type AR, CH, or QS, it calculates the feature vector v_t = { x_1 , ... , x_n }, where n is the dimension of the defined feature space, and then predicts e_t ’s event type, ˆyt, as shown on the right.

Random Forest: Since Näıve Bayes classifier relies only on the probability of the occurrences of the events, the model fails to predict the less trivial cases. For the sake of completeness, we also employ Random Forest classifier for evaluation of the image parameters. This is an ensemble learning model that builds the decision trees on samples of data (a process called bootstrap aggregating) and predicts the class label by taking the majority vote of the trees classifying each data point. For our data, we generate a forest of 60 trees, each of which trying to predict the event types of the instances and at the end, the ensemble model makes the final decision by taking the majority vote of the trees.






For both classification models, we perform a k-fold cross-validation by sampling the events’ instances on all combinations of any group of 4 months in the year 2012, resulting in k=495 different trials. This allows having the test sets unbiased to the potential patterns in occurrence of solar events.

Evaluation of Parameters

The result of the above classifications, for the four image parameters that we have tuned, is shown below.

Note that:

  • the performance of the two models is based on single image parameters, and not their combinations. The fact that such high confidences are reached using basic image parameters that are not domain specific, should stress the importance of our choices of image parameters.

  • the relatively poor performance of both of the models in classification of QS does not really matter, since it is just a synthesized event. In other words, the correct classification of real events (AR and CH) should determine the quality of the models and therefore, the third plot in each row can be simply disregarded.

  • as one of our main contributions is to tune the ten image parameters, we expect a significant improvement in their quality when used for classification

  • as the above plots show, for Random Forest classifier, in almost all cases, JP2 format is shown to be the better input for the model, compared to both FITS and clipped FITS. Even for Näıve Bayes classifier which is not performing as good as Random Forest, there is no consistent superiority for FITS or clipped FITS images.

Image Parameters Dataset: General Analysis

Mean of the ten image parameters extracted from images queried for a period of one month (2012-01). With the cadence of 6 minutes, the plot represents 7440 AIA images from the wavelength channel 171–Å)

We hope that our public dataset interests more researchers of different backgrounds and attracts more interdisciplinary studies to solar images. While we aim to keep our API data up-to-date with the stream of data coming from the SDO, we would like to expand it by adding more interesting image parameters, specifically computed for different solar events, which could lead to a better understanding of solar phenomena and higher classification accuracy.

For more ...

The above text is just a short version of the original article. Please read the article for more on the details of each step towards production of the Image Parameter dataset, a statistical analysis of the dataset, and a an additional discussion on the impact of camera degeneration on the dataset that is in the appendix.

Several Studies Based on These Parameters:

"IMAGE RETRIEVAL ON COMPRESSED IMAGES: CAN WE TELL THE DIFFERENCE?"

Juan Banda - Rafal Angryk - Michael Schuh - Petrus Martens

To make the SDO repository more compressed and thus more portable.

"Toward Using Sparse Coding in Appearance Models for Solar Events Tracking"

Dustin Kempton - Michael Schuh

A component of the tracking algorithm known as Multiple Hypothesis Tracking.

"Multivariate Time Series Attribute Selection for Flaring Active Region Classification"

Dustin Kempton - Thaddeus Gholston - Sneha Devireddy

To evaluate AIA images from SDO in the goal of predicting flaring active regions.

"A Large-Scale Solar Image Dataset With Labeled Event Regions"

Michael Schuh - Rafal Angryk - Karthik Pillai - Juan M. Banda - Petrus Martens

To provide a new public benchmark of SDO 15000 images.

"Massive Labeled Solar Image Data Benchmarks for Automated Feature Recognition"

Michael Schuh - Rafal Angryk

To introduce standard benchmarks for automated feature recognition using solar image data from SDO mission.

"On Using SIFT Descriptors for Image Parameter Evaluation"

Michael Schuh - Rafal Angryk

To present a composite method for image parameter evaluation using SIFT descriptors and bag of words representation.

"Region-based Querying of Solar Data Using Descriptor Signatures"

Juan Banda - Rafal Angryk

To provide region-based querying capabilities to the existing Solar Dynamics Observatory (SDO) content-based image-retrieval (CBIR) system.

"Selection of Image Parameters as the First Step Towards Creating a CBIR System

for the Solar Dynamics Observatory"

Juan Banda - Rafal Angryk

For attribute evaluation that will be used in the CBIR system for SDO images.

[1] "Textural Features Corresponding to Visual Perception"

[2] "On Using SIFT Descriptors for Image Parameter Evaluation"

[3] "Selection of Image Parameters as the First Step Towards Creating a CBIR System for the Solar Dynamics Observatory"

Hideyuki Tamura, Shunji Mori, Takashi Yamawaki

Patrick M. McInerney, Juan M. Banda, Rafal Angryk

Juan M. Banda, Rafal Angryk

> PDF

> PDF

> PDF