Misanthrope's Thoughts: Classification of the Hyper-Spectral and LiDAR Imagery using R (mostly). Part 1: Result Evaluation

Introduction

This us the first post of my series notes about hyperspectral imagery classification. See other parts: Part 2: Classification Approach and Spectral Profiles Creation

There was the EEEI Data Fusion Contest this spring. This year they wanted people to elaborate about hyper-spectral (142-bands imagery) and LiDAR data. The resolution of the data-set was about 5 m. There were 2 nominations: best classification and the best scientific paper.

I work with high-resolution imagery quite often, but classification is a very rear task for me though. I thought that this contest was a great opportunity to develop my skills. And not just a classification skills, but R skills as well... I decided to participate in best classification contest, and to use R for the most part.

I learned a lot and I will share my knowledge with you in a series of posts starting with this one. And like in some great novels, I will start from the very end - evaluation of my results.

Results of my classification (created in R, designed in QGIS)

Contest results

...are available here. As you may notice, I'm not on the list of the 10 best classification ;-) But there is almost unnoticeable 0.03% difference between my result (85.93% accuracy) and the result of the 10-th place. Not a bad result for a relatively inexperienced guy, don't you think? And I know, that I could have done better - I had 99% prediction accuracy for the training samples. It's funny, but my classification map looks better than map that took 7-th place!

How to evaluate classification using R

Due to I was not on the top ten list, I had to evaluate the result on my own. The organisers finally disclosed evaluation samples and I got a chance for a self assessment. So we have a set of .shp-files - each contains ground-truth polygons for one of the 16 classes and a classification map. We need to accomplish 3 goals:

Create a visual representation of missclassification.
Assess accuracy.
Create a confusion matrix.
Visualise classification map using EEEI colour palette.

Lets get a palette!

Official EEEI palette

To extract colour values from the palette above you may use GIMP. But I used a widget that every KDE-user (Linux) should have by default. You can probe and save colour values from any part of your screen. Quite useful!

'Color picker' in work

Now let's see the code for our tasks.

Load needed libraries

library(rgdal)
library(raster)
library(reshape2)
library(caret)
library(ggplot2)

Load and process data

# get classification raster
ras <- raster('~/GIS/IEEE_contest_2013/2013_IEEE_GRSS_DF_Contest/raster.tif',
              verbose = T)
# get list of shp-files for evaluation
shapes <- list.files(path = '~/GIS/IEEE_contest_2013/2013_IEEE_GRSS_DF_Contest/roi', pattern = '*shp')

# a list for accuracy assessment
accuracy_list <- list()

# create an empty dataframe to be filled vith evaluation results
# field names are not arbitrary!!!
eval_df <- data.frame(variable = character(),
                      value = character())

# create an empty dataframe to be used for plotting
# field names are not arbitrary!!!
plot_data <- data.frame(variable = character(),
                        value = character(),
                        Freq = integer())


for (f_name in shapes) {
  # delete '.shp' from the filename
  layer_name <- paste(sub('.shp', '', f_name))
  class <- readOGR("~/GIS/IEEE_contest_2013/2013_IEEE_GRSS_DF_Contest/roi",
                   layer = layer_name,
                   verbose = F)

  # extract values from raster
  probe <- extract(ras, class)

  # replace class numbers with names 
  samples <- list()
  for (lis in probe) {
    for (value in lis) {
      if (value == 0) {c_name <- Unclassified
      } else if (value == 1) {c_name <- 'Healthy grass'
      } else if (value == 2) {c_name <- 'Stressed grass'
      } else if (value == 3) {c_name <- 'Synthetic grass'
      } else if (value == 4) {c_name <- 'Trees'
      } else if (value == 5) {c_name <- 'Soil'
      } else if (value == 6) {c_name <- 'Water'
      } else if (value == 7) {c_name <- 'Residential'
      } else if (value == 8) {c_name <- 'Commercial'
      } else if (value == 9) {c_name <- 'Road'
      } else if (value == 10) {c_name <- 'Highway'
      } else if (value == 11) {c_name <- 'Railway'
      } else if (value == 12) {c_name <- 'Parking Lot 1'
      } else if (value == 13) {c_name <- 'Parking Lot 2'
      } else if (value == 14) {c_name <- 'Tennis Court'
      } else if (value == 15) {c_name <- 'Running Track'} 
      samples <- c(samples, c = c_name)
    }
  }

  # make layer_name match sample name
  if (layer_name == 'grass_healthy') {layer_name <- 'Healthy grass'
  } else if (layer_name == 'grass_stressed') {layer_name <- 'Stressed grass'
  } else if (layer_name == 'grass_syntethic') {layer_name <- 'Synthetic grass'
  } else if (layer_name == 'tree') {layer_name <- 'Trees'
  } else if (layer_name == 'soil') {layer_name <- 'Soil'
  } else if (layer_name == 'water') {layer_name <- 'Water'
  } else if (layer_name == 'residental') {layer_name <- 'Residential'
  } else if (layer_name == 'commercial') {layer_name <- 'Commercial'
  } else if (layer_name == 'road') {layer_name <- 'Road'
  } else if (layer_name == 'highway') {layer_name <- 'Highway'
  } else if (layer_name == 'railway') {layer_name <- 'Railway'
  } else if (layer_name == 'parking_lot1') {layer_name <- 'Parking Lot 1'
  } else if (layer_name == 'parking_lot2') {layer_name <- 'Parking Lot 2'
  } else if (layer_name == 'tennis_court') {layer_name <- 'Tennis Court'
  } else if (layer_name == 'running_track') {layer_name <- 'Running Track'} 

  # create a dataframe with classification results
  df <- as.data.frame(samples)
  dfm <- melt(df, id = 0)
  dfm['variable'] <- layer_name

  # add data to evaluation dataframe
  eval_df <- rbind(eval_df, dfm)

  # assess accuracy of current class
  mytable <- table(dfm)
  dmt <- as.data.frame(mytable)
  total_samples <- 0
  correct_predictions <- 0
  for (i in 1:nrow(dmt)) {
    predict_class <- toString(dmt[i,2])
    pc_frequency <- dmt[i,3]
    if (predict_class == layer_name) {
      correct_predictions <- dmt[i,3]
    }
    total_samples <- total_samples + pc_frequency
  }
  accuracy <- round(correct_predictions/total_samples, 2)
  accuracy_list <- c(accuracy_list, c = accuracy)


  # append data for plotting
  plot_data <- rbind(plot_data, dmt)
}

Create a fancy graph (that is shown on the map)

# create facets plot
EEEI_palette <- c('#D4D4F6',
                  '#5F5F5F',
                  '#710100',
                  '#00B300',
                  '#007900',
                  '#014400',
                  '#008744',
                  '#D0B183',
                  '#BAB363',
                  '#DAB179',
                  '#737373',
                  '#A7790A',
                  '#00EA01',
                  '#CA1236',
                  '#00B9BB')
plot_data <- plot_data[order(plot_data$variable),]
p = ggplot(data = plot_data,
           aes(x = factor(1),
               y = Freq,
               fill = factor(value)
           )
)
p <- p + facet_grid(facets = . ~ variable)
p <- p + geom_bar(width = 1) +
  xlab('Classes') +
  ylab('Classification rates') +
  guides(fill=guide_legend(title="Classes"))+
  scale_fill_manual(values = EEEI_palette)+
  ggtitle('Classification Accuracy')+
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
p

Let't finally get our accuracy result!

# accuracy assessment
fin_accuracy <- mean(unlist(accuracy_list))
fin_accuracy <- paste(round(fin_accuracy*100, 2), '%', sep = '')
print(paste('Total accuracy:', fin_accuracy), sep = ' ')

[1] "Total accuracy: 85.93%"

Confusion matrix

# confusion matrix creation
true <- as.factor(eval_df$variable)
predict <- as.factor(eval_df$value)
confusionMatrix(predict, true)

Enjoy statistics!

Confusion Matrix and Statistics

                 Reference
Prediction        Commercial Healthy grass Highway Parking Lot 1
  Commercial             850             0      14           151
  Healthy grass            0           868       0             0
  Highway                  0             0     718            14
  Parking Lot 1           41             0      54           641
  Parking Lot 2            0             0      19            73
  Railway                  0             0      11            44
  Residential            155             0     191             0
  Road                     0             0      20           107
  Running Track            0             0       0             0
  Soil                     0             0       1             0
  Stressed grass           0            61       0             0
  Synthetic grass          0             0       0             0
  Tennis Court             0             0       0             0
  Trees                    0           117       0             0
  Water                    0             0       0             0
                 Reference
Prediction        Parking Lot 2 Railway Residential Road Running Track
  Commercial                  0       0          74    2             0
  Healthy grass               0       0           0    0             0
  Highway                     0       9           0    0             0
  Parking Lot 1             104       8           0   11             0
  Parking Lot 2             155       4           0    3             0
  Railway                     2     917          11   61             0
  Residential                 0      40         918    0             0
  Road                       12      14           0  930             0
  Running Track               0       0           1    0           465
  Soil                        6       2           0   27             1
  Stressed grass              0      11           4    0             0
  Synthetic grass             0       0           0    0             3
  Tennis Court                0       3           0    0             0
  Trees                       0       0          10    0             0
  Water                       0       0           0    0             0
                 Reference
Prediction        Soil Stressed grass Synthetic grass Tennis Court Trees
  Commercial         0              0               0            0     0
  Healthy grass      0             14               0            0    34
  Highway            0              0               0            0     0
  Parking Lot 1      0              0               0            0     0
  Parking Lot 2      0              0               0            0     0
  Railway            0              0               0            0     0
  Residential        0              1               0            0     0
  Road              14              0               0            0     0
  Running Track      0              0               0            0     0
  Soil            1040              0               0            0     0
  Stressed grass     0            931               0            0    17
  Synthetic grass    0              0             503            0     0
  Tennis Court       0              0               0          245     0
  Trees              0             85               0            0  1004
  Water              0              0               0            0     0
                 Reference
Prediction        Water
  Commercial          0
  Healthy grass       3
  Highway             0
  Parking Lot 1      22
  Parking Lot 2       0
  Railway             0
  Residential         0
  Road                0
  Running Track       0
  Soil                0
  Stressed grass      0
  Synthetic grass     0
  Tennis Court        0
  Trees               0
  Water             118

Overall Statistics

               Accuracy : 0.859         
                 95% CI : (0.853, 0.866)
    No Information Rate : 0.088         
    P-Value [Acc > NIR] : <2e-16        

                  Kappa : 0.847         
 Mcnemar's Test P-Value : NA            

Statistics by Class:

                     Class: Commercial Class: Healthy grass Class: Highway
Sensitivity                     0.8126               0.8298         0.6984
Specificity                     0.9780               0.9953         0.9979
Pos Pred Value                  0.7791               0.9445         0.9690
Neg Pred Value                  0.9820               0.9839         0.9724
Prevalence                      0.0872               0.0872         0.0857
Detection Rate                  0.0709               0.0724         0.0599
Detection Prevalence            0.0910               0.0767         0.0618
                     Class: Parking Lot 1 Class: Parking Lot 2
Sensitivity                        0.6223               0.5556
Specificity                        0.9781               0.9915
Pos Pred Value                     0.7276               0.6102
Neg Pred Value                     0.9650               0.9894
Prevalence                         0.0859               0.0233
Detection Rate                     0.0535               0.0129
Detection Prevalence               0.0735               0.0212
                     Class: Railway Class: Residential Class: Road
Sensitivity                  0.9097             0.9018      0.8994
Specificity                  0.9883             0.9647      0.9848
Pos Pred Value               0.8767             0.7034      0.8478
Neg Pred Value               0.9917             0.9906      0.9905
Prevalence                   0.0841             0.0849      0.0862
Detection Rate               0.0765             0.0766      0.0776
Detection Prevalence         0.0872             0.1088      0.0915
                     Class: Running Track Class: Soil
Sensitivity                        0.9915      0.9867
Specificity                        0.9999      0.9966
Pos Pred Value                     0.9979      0.9656
Neg Pred Value                     0.9997      0.9987
Prevalence                         0.0391      0.0879
Detection Rate                     0.0388      0.0867
Detection Prevalence               0.0389      0.0898
                     Class: Stressed grass Class: Synthetic grass
Sensitivity                         0.9030                 1.0000
Specificity                         0.9915                 0.9997
Pos Pred Value                      0.9092                 0.9941
Neg Pred Value                      0.9909                 1.0000
Prevalence                          0.0860                 0.0420
Detection Rate                      0.0777                 0.0420
Detection Prevalence                0.0854                 0.0422
                     Class: Tennis Court Class: Trees Class: Water
Sensitivity                       1.0000       0.9517      0.82517
Specificity                       0.9997       0.9806      1.00000
Pos Pred Value                    0.9879       0.8257      1.00000
Neg Pred Value                    1.0000       0.9953      0.99789
Prevalence                        0.0204       0.0880      0.01193
Detection Rate                    0.0204       0.0837      0.00984
Detection Prevalence              0.0207       0.1014      0.00984

As you may see, the main source of misclassification are Parking Lot 1 and Parking Lot 2. The accuracy for other classes is above 90%, and it is great. Frankly, I still don't understand what is the difference between Parking Lots 1 and 2... Official answer was that Parking Lot 2 is cars (isn't detecting them using 5 m resolution imagery is a questionable task???)... But it seems that it is something else. It is hard to classify something that you don't understand what it is...