Package somspace: Spatial classification of time series with Self-Organizing Maps

Yannis Markonis

2019-05-20

somspace is an R package for spatial classification of time series. Using Self-Organizing Map (SOM), a popular unsupervised learning algorithm [Kohonen 1984], somspace provides the tools to generate meaningful representations that capture the main spatial patterns of the analyzed time series. A quick introduction to SOM can be found in Wikipedia, while more demanding readers can get more insight by reading some of the numerous papers on the subject.

In this first presentation of the somspace package, we will follow the steps of our study on the spatial patterns of European hydroclimate. The aim of the study was to determine homogeneous regions of hydroclimatic variability, derived by the fluctuations of a single variable, namely the self-calibrated Palmer Drought Severity Index (scPDSI).

Data

The dataset used is the Old World Drought Atlas (OWDA), a tree-ring reconstruction of the scPDSI over Europe and parts of Northern Africa and Middle East for the last 2,000 summers Cook et al., 2016. The OWDA has been compiled by the spatial regression of 106 tree-ring records to a map with 5414 half-degree grid cells at a 0.5 x 0.5 resolution. Here, we use a subset [Markonis et al. 2018], which optimizes the temporal and spatial distribution of the dataset. The resulting data grid covers 35.25ºN – 62.75ºN and 4.25ºW – 36.25ºE (2647 grid cells) for the period 992–2012 and is freely available. Alternatively, the owda data object can also be used without any downloadingas. However, it covers an even smaller period of OWDA (500 years).

library(somspace)
str(owda)
## Classes 'data.table' and 'data.frame':   1355264 obs. of  4 variables:
##  $ Time  : int  1501 1501 1501 1501 1501 1501 1501 1501 1501 1501 ...
##  $ Lat   : num  37.2 37.8 38.2 38.8 39.2 ...
##  $ Lon   : num  -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 ...
##  $ scPDSI: num  -0.623 0.168 0.312 -0.722 -0.392 ...
##  - attr(*, "sorted")= chr  "Time" "Lon" "Lat"
##  - attr(*, ".internal.selfref")=<externalptr>

Data preparation

The package is built around the som() function of kohonen package. However to apply it in spatial data, we first have to create a somin object. The transformation of raw data to somin class is performed through the sominp() function. Its sole argument is the original dataset as a 4-column data.table object, containing: time, latitude, longitude and the values of the variable to be classified. Even though the input dataset can be gridded or not, with regular or irregular time step, the time series must have equal length and no missing values.

inp <- sominp(owda) #may take some time depending on dataset size 
str(inp)

The somin class is a list with three elements. The first one, input_for_som, is a matrix that will be used by kohonen::som() function to generate the SOM. The second one, coords, is a data.table with the point ids and their coordinates. The last one, input_dt is also a data.table with the raw data that were used to generate input_for_som.

SOM generation

The generation of SOM is quite straightforward:

owda_som <- somspa(inp)

This creates a somsp object, which contains the som, the nodes properties and the original data.

Since somspa() uses the function som(), it can also take som() arguments. For instance, to create a 6 x 6 map after 1000 iterations:

owda_som <- somspa(inp, grid = somgrid(6, 6), rlen = 1000)

SOM analysis

The somsp objects can be easily plotted by plot() or summarized by summary() functions.

plot(owda_som)

summary(owda_som)
##     node node_counts node_lat node_lon node_sd_lat node_sd_lon
##  1:    1         106    51.25     9.50   1.8785269    2.925263
##  2:    2         111    53.25    10.00   1.8710913    3.693948
##  3:    3         117    59.25    11.50   1.7241309    3.704834
##  4:    4          81    60.25    16.50   3.2110919    3.919325
##  5:    5         113    57.25    24.50   2.5492928    3.843344
##  6:    6         114    60.75    31.25   1.2543048    3.606703
##  7:    7         174    49.75     1.00   2.1143243    4.050415
##  8:    8          29    55.75    -3.00   3.3083526    6.507858
##  9:    9          83    53.25    22.00   1.4618018    2.787746
## 10:   10         110    56.25    29.00   2.1315621    3.052242
## 11:   11          32    61.25     7.25   2.4560656   11.346450
## 12:   12          86    53.50    33.25   1.2164091    2.334663
## 13:   13         104    46.25     2.00   1.8238319    2.478808
## 14:   14          67    50.25    15.50   1.0899711    2.045751
## 15:   15          29    51.25    20.00   0.6952513    1.479956
## 16:   16          34    52.25    28.00   0.8550208    1.085166
## 17:   17          33    57.25    33.50   3.5664020    3.505475
## 18:   18          30    50.75    32.75   0.5632418    1.679919
## 19:   19          87    43.25     1.50   1.0614947    3.586019
## 20:   20          46    48.25    16.50   0.9834991    1.863196
## 21:   21          32    51.25    24.50   0.8131976    2.101267
## 22:   22          44    49.75    25.00   1.8381959    2.604553
## 23:   23          59    48.25    31.00   0.9530167    1.827528
## 24:   24          63    47.75    29.50   2.1835277    4.240604
## 25:   25          51    37.75    -0.50   1.5264338    3.651376
## 26:   26          47    47.75    12.50   0.8872621    2.353231
## 27:   27          55    48.25    22.00   0.9447712    1.712747
## 28:   28          65    46.25    26.00   1.3095434    1.452989
## 29:   29          73    44.75    11.50   1.6308363    2.924191
## 30:   30          64    41.25    31.75   3.6701063    3.323647
## 31:   31          44    36.25     7.00   0.4726953    2.438352
## 32:   32          58    40.75    -2.00   1.5635606    3.125041
## 33:   33         106    45.25    19.00   1.1285023    2.116550
## 34:   34          73    40.75    20.00   2.2865499    5.240233
## 35:   35         104    42.75    24.00   1.4239414    3.245922
## 36:   36         123    38.75    32.00   1.1954198    4.008255
##     node node_counts node_lat node_lon node_sd_lat node_sd_lon

summary offers an overview of the classification results. In our example, we have generated a 6 x 6 classification resulting to 36 nodes. Each of them contains a varying number of grid cells ranging from 29 (node 8) to 174 (node 7). Their latitude and longitude are also available as the mean of their corresponding grid cells, as well as their standard deviations. The latter can be used as a preliminary measure of node homogeneity.

Additionaly, the average time series of specific SOMs can be plotted by plot_ts():

plot_ts(owda_som, n = 12)

The time series can also be manually extracted and further analysed or plotted:

node_12 <- owda_som$input_dt[node == 12, .(node, time, variable)] #in version 1.1.0 `variable` is changed to `values`

Multiple time series can also be plotted, by adding a vector argument in n.

plot_ts(owda_som, n = c(1, 7, 21, 33)) 

plot_ts(owda_som, 1:max(owda_som$summary$node)) #plots all soms

Further classification of SOMs in regions

The number of regions with similar characteristics can be further reduced by hierarchical cluster analysis implemented in somregs() function. Instead of returning a pre-defined number of regions, somregs() provides the series of homogeneous regions starting from 2 up to nregions and returns them as a regs object. In each new classification one of the prior regions is divided and the sub-region with the less nodes gets a new id. This can be useful for keeping track of the classification process and observing which regions remain undivided as the number of splits increases.

Except for the somsp and nregions, other arguments used in the hclust() function can be added, e.g., method for choosing the hierarchical cluster algorithm.

owda_regions <- somregs(owda_som, nregions = 13, method = "ward.D2") 

The output regs object are a two-element list. The regions element contains the corresponding information about the regions and the input_dt the time series of the original dataset. Similarly to somsp objects, regs can be plotted as maps or time series.

plot(owda_regions, regions = 2:13, nrow = 3, ncol = 4)

plot_ts(owda_regions, n = 9)

In this example, we can see that hydroclimatic variability over France and Germany is quite homogeneous, as this region remains undivided for a large number of classifications. On the contrary, Mediterranean appears to be divided into smaller regions in almost every step of the classification process.

Except for a qualitative comparison of the classification, we can apply various statistical criteria to see how the heterogeneity between regions varies as the number of regions increases (here using the clusterCrit package).

library(clusterCrit)
x <- owda_som$som$codes[[1]]
setorder(owda_regions$regions, node) 
classifications <- unique(data.frame(node = owda_regions$regions$node, owda_regions$regions[, 11:22]))
cluster_comparison <- as.matrix(intCriteria(x, as.integer(classifications[, 2]), "all"))

for(i in 3:13){
  cluster_comparison <- cbind(cluster_comparison, 
                              intCriteria(x, as.integer(classifications[, i]), "all"))
} 
colnames(cluster_comparison) <- 2:13

cluster_comparison <- apply(cluster_comparison, 2, unlist)
colnames(cluster_comparison) <- 2:13
cluster_comparison <- data.table(melt(cluster_comparison, variable.factor = T)) 
colnames(cluster_comparison) <- c("Method", "nRegions", "Value") 

homo_methods <- c("davies_bouldin", "c_index", 
                  "calinski_harabasz", "sd_scat", 
                  "sd_dis", "wemmert_gancarski")

ggplot(cluster_comparison[Method %in% homo_methods], aes(x = nRegions, y = Value)) +
  geom_point() +
  geom_line() +
  facet_wrap(~Method, scales = "free") +
  theme_bw()

For OWDA, most of the selected measures present milder decline in heterogeneity after the classification in 9 regions. This implies that this number might sufficiently represent the spatial heterogeneity of European hydroclimate. Of course, the suggested division should be carefully studied to see whether it agrees with similar findings in other studies and other measures as well.

Complex network analysis

Another useful feature of somspace is the analysis of spatial dependencies between regions with the complex networks methodology [Tsonis et al. 2006]. Any of the classificationa stored in the regs object can be analysed through complex networks and plotted as map with cnet(). The threshold, thres, refers to the cross-correlation coefficient between the aggregated time series of each region; only values above thres are considered to be network connections.

cnet(owda_regions, n = 9, thres = 0.3)

We can see that for a classification of n = 9 regions there is a clear zonal (west-to east) orientation of the spatial dependencies in European hydroclimate. The Iberian Peninsula, appears not to be possitively correlated with any of the other regions. If the cross-correlation matrix is studied, it can be seen that it is actually negatively correlated. This is one of the limitations of the complex networks approach and this is why the cross-correlation matrix should always be checked.

References

Users of the somspace package or readers interested in European hydroclimatic variability might also find the following references useful.

Cook, E. R., Seager, R., Kushnir, Y., Briffa, K. R., Büntgen, U., Frank, D., … & Baillie, M. (2015). Old World megadroughts and pluvials during the Common Era. Science Advances, 1(10), e1500561. Link

Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological cybernetics, 43(1), 59-69. Link

Markonis, Y., Hanel, M., Máca, P., Kyselý, J., & Cook, E. R. (2018). Persistent multi-scale fluctuations shift European hydroclimate to its millennial boundaries. Nature communications, 9(1), 1767. Link

Markonis, Y., & Strnad, F. (2019). Representation of European Hydroclimatic Patterns with Self-organizing Maps. EarthArXiv. May, 14. Link

Tsonis, A. A., Swanson, K. L., & Roebber, P. J. (2006). What do networks have to do with climate?. Bulletin of the American Meteorological Society, 87(5), 585-596. Link

Wehrens & J., & Kruisselbrink, J. (2018). Flexible Self-Organising Maps in kohonen 3.0. Journal of Statistical Software, 87, 7. Link