Misanthrope's Thoughts

What is the cause for the degradation of environment?
Capitalism, corruption, consuming society? - OVERPOPULATION!
Please, save the Planet - kill yourself...

Monday, April 21, 2014

Unify Extent and Resolution Script Updated

Script for unifying extent and resolution is now able to work with multi-band rasters. Also I fixed an error that caused output rasters to have different resolutions in some cases.

Sunday, March 16, 2014

Analysis of the Time Aspect of the Matches at The International 3 (Dota 2 Tournament)

It's been a while since I analysed something with R. I decided to pick into Dota 2 statistics. This game is the real eSport now with players fighting for respectable prise pools. And the main Dota 2 event is The International where best teams compete each other in astonishing battles.

So I grabbed some stats on The International 3 from here and started to think what can I do with it... The one major aspect of the game that was actively discussed prior the Tournament was the Dire side has an advantage over the Radiant via having Roshan Pit in their possession. Many people still think that Dire side is preferable in competitive games (they say that Roshan is killed by the Dire in 70% of cases). In fact The Alliance (TI3 winner) played like 90% of their games on the Dire side at the tournament. But is it a valid proof of the Dire side advantage? - I doubt. I think that in contrary - the Radiant side has an advantage over Dire, but I will add my arguments after I prove I'm right.

Ok, here is my hypothesis. There is no time limitation for the game. The match in Dota 2 lasts until the main building of the one of the sides is destroyed (or one of the team will give up). So if one of the sides (all the other things being equal) has an advantage it will cause the median time that was used to win the game for this side to be lower than to win for the other (and vice versa for time to loose). So lets take a look at the boxplot below:

dota 2 the international 3 winning time comparrison by side
Dire Vs Radiant winning time comparisson

Code:
library(ggplot2)

TI3 <- read.csv("~/R/Dota2_Stats/The_International-3")

# create winning time per side plot
p <- ggplot(TI3, aes(TI3[,5], TI3[,8])) +
  geom_boxplot(aes(fill = TI3[,5]))+
  labs(title = 'Winning time per side at TI-3',
       x = 'Side',
       y = 'Time, min.',
       fill = 'Side')
  
print(p)
Clearly, the Radiant side generally wins slightly quickly then Dire (and have higher number of wins: 82 against 76). This means that not the Dire, but the Radiant team has an advantage in game. But why? (You may skip the rest of the paragraph if you never played Dota). There are several reasons. Radiant advantages are: easier rune control, ability to deny creeps at the hard lane, camera angle (it is not parallel to the terrain surface and facing north towards the Dire side). Camera angle was never discussed as the advantage/disadvantage because most people just got used to it, but Radiant has slight, but sure vision advantage. Seems Roshan accessibility and a safer ancient camp does not help that much to the Dire. 

What else can we do with time analysis? We can compare win and loss times for all the teams that competed at TI3:
Teams winning and loosing time comparison
Code:
# get the list of teams
Teams <- unique(TI3[,3])

# create dataset to store data about winning/loosing times
A_time <- data.frame('Team'= character(),
                     'Result' = character(),
                     'Time' = numeric(),
                     stringsAsFactors = FALSE)

# extract time data and write it to the A_time data frame
for (team in Teams) {
  A <- subset(TI3, TI3[,3] == team | TI3[,4] == team)
    
  for (i in 1:nrow(A)) {
    winner <- A[i,][5]
    dire <- A[i,][4]
    radiant <- A[i,][3]
    time <- A[i,][8]
    if ( (winner == 'DIRE' & dire == team) | (winner == 'RADIANT' & radiant == team) ) {
     result <- paste(team, 'WIN')
    }
    else {
     result <- paste(team, 'LOSS') 
    }
    A_time[(nrow(A_time)+1),] <- c(team,result, time)
  }
}

# create plot for winning time per team
p <- ggplot(A_time, aes(A_time[,2], A_time[,3])) +
     geom_boxplot(aes(fill = A_time[,1]))+
     theme(axis.text.x = element_text(angle=90, hjust = 0),
          axis.title = element_text(face = 'bold', size = 14),
          title = element_text(face = 'bold', size = 16),
          legend.position = 'right'
          ) +
     labs(title = 'Win and loss times at The International 3',
          x = 'Results',
          y = 'Time, min.',
          fill = 'Teams')
  
print(p)

Generally my assumption was correct: better teams wins quicker. Alliance and NaVi  (1-st and 2-nd places) is a nice conformation to it (DK and IG (TI-2 champion) have the similar pattern as well despite shared 5-th place). But Orange and TongFu (3-rd and 4-th place) tends to loose quicker than win. This could be explained by the general playstile of this two Asian teams which often aims at the late game. This infamous style with prolonged no action farming stage is often referred as 'Chinese Dota'. But DK and IG are Chinese teams too.   Seems that both TongFu and Orange were able overcame the odds and jumped over their heads in the given tournament. They took places that DK and IG should have get (DK and IG were more favourable teams than Orange and TongFu before the tournament).

Monday, February 3, 2014

About Corruption and Economical Health of Countries

There is a quite interesting article about the corruption level in EU countries was published at BBC recentely. Of course the map is the most interesting part. 


The thing that I totally noticed in the very first moment of observing it, is that the countries with the highest corruption level have the lowest credit ratings (see this interactive map).


When will all these bastards understand that corruption hurts everyone?

Saturday, January 4, 2014

Unifying Extent and Resolution of Rasters Using Processing Framework in QGIS

My post about modification of extent and resolution of rasters drew quite a bit of attention and I decided to make a small New Year's present to the community and create a QGIS Processing script to automate the process.

The script was designed to be used within Processing module of QGIS. This script will make two or more rasters of your choice to have the same spatial extent and pixel resolution so you will be able to use them simultaneously in raster calculator. No interpolation will be made - new pixels will get predefined value. Here is a simple illustration of what it does:
Modifications to rasters A and B


To use my humble gift simply download this archive and unpack files from 'scripts' folder to your .../.qgis2/processing/scripts folder (or whatever folder you configured in Processing settings). At the next start of QGIS you will find a 'Unify extent and resolution' script in 'Processing Toolbox' in 'Scripts' under 'Raster processing' category:

If you launch it you will see this dialogue:
Main window


Note that 'Help' is available:
Help tab
Lets describe parameters. 'rastersare rasters that you want to unify. They must have the same CRS. Note that both output rasters will have the same pixel resolution as the first raster in the input.
Raster selection window
'replace No Data values with' will provide value to pixels that will be added to rasters and replace original No Data values with the value provided here. Note that in output rasters this value will not be saved as No Data value, but as a regular one. This is done to ease feature calculations that would include both of this rasters, but I'm open to suggestions and if you think that No Data attribute should be assigned I can add corresponding functionality.

Finally you mast provide a path to the folder where the output rasters will be stored in 'output directory' field. A '_unified' postfix will be added to the derived rasters file names: 'raster_1.tif' -> 'raster_1_unified.tif'

If CRSs of rasters won't match each other (and you will bypass built-in check) or  an invalid pass will be provided a warning message will pop up and the execution will be cancelled:
Example of the warning message
When the execution is finished you will be notified and asked if rasters should be added to TOC:


Happy New Year!

P.S. I would like to thank Australian government for making the code they create an open source. Their kindness saved me a couple of hours. 

Wednesday, September 18, 2013

Count Unique Values In Raster Using SEXTANTE and QGIS

I decided to make my script for counting unique values in raster more usable. Now you can use it via SEXTANTE in QGIS. Download script for SEXTANTE and extract archive to the folder that is is intended for your Python SEXTANTE scripts (for examlple ~./qgis2/processing/scripts). If you don't know where this folder is located go Processing -> Options and configuration -> Scripts -> Scripts folder, see the screenshot:


Processing options

Now restart QGIS and in SEXTNTE Toolbox go to Scripts. You will notice new group named Raster processing. There the script named Unique values count will be located:


Launch it and you will see its dialogue window:

Unique values count script main window
Main window

Note that Help is available:
Unique values count script help tab

Either single- or multi-band rasters are accepted for processing. After the raster is chosen (input field) one need to decide whether to round values for counting or not. If no rounding is needed - leave 'round values to ndigits' field blank. Otherwise enter needed value there. Note that negative values are accepted - this will round values to ndigits before decimal point.

When the set up is finished hit the Run button. You will get something like this:
Result window
Result window

Saturday, August 17, 2013

How to Count Unique Values in Raster

Recently I demonstrated how to get histograms for the rasters in QGIS. But what if one need to count exact number of the given value in a raster (for example for assessment of classification results)? In this case the scripting is required.

UPD: the script below was transformed in more usable SEXTANTE script, see this post.

We'll use Python-GDAL for this task (thanks to such tutorials as this one) it is extremely easy to learn how to use it). The script I wrote seems to be not optimal (hence its working just fine and quickly so you may use it freely) due to it is not clear to me whether someone would like to use it for the floating data types (which seems to be not that feasible due to great number of unique values in this case) or such task is performed only for integers (and in this case code might be optimised). It works with single and multi band rasters.

How to use

The code provided below should be copied into a text file and saved with '.py' extension. Enable Python console in QGIS, hit "Show editor" button in console and open the script file. Then replace 'raster_path' with the actual path to the raster and hit 'Run script button'. In the console output you will see sorted list of unique values of raster per band:

QGIS Python console view
Script body (to the right) and its output (to the left)

Script itself

#!/usr/bin/python -tt
#-*- coding: UTF-8 -*-

'''
#####################################################################
This script is designed to compute unique values of the given raster
#####################################################################
'''

from osgeo import gdal
import sys
import math

path = "raster_path"

gdalData = gdal.Open(path)
if gdalData is None:
  sys.exit( "ERROR: can't open raster" )

# get width and heights of the raster
xsize = gdalData.RasterXSize
ysize = gdalData.RasterYSize

# get number of bands
bands = gdalData.RasterCount

# process the raster
for i in xrange(1, bands + 1):
  band_i = gdalData.GetRasterBand(i)
  raster = band_i.ReadAsArray()

  # create dictionary for unique values count
  count = {}

  # count unique values for the given band
  for col in range( xsize ):
    for row in range( ysize ):
      cell_value = raster[row, col]

      # check if cell_value is NaN
      if math.isnan(cell_value):
        cell_value = 'Null'

      # add cell_value to dictionary
      try:
        count[cell_value] += 1
      except:
        count[cell_value] = 1

  # print results sorted by cell_value
  for key in sorted(count.iterkeys()):
    print "band #%s - %s: %s" %(i, key, count[key])

Saturday, August 3, 2013

Classification of the Hyper-Spectral and LiDAR Imagery using R (mostly). Part 2: Classification Approach and Spectre Profile Creation

Introduction

This is the second part of my post series related to hyper-spectral and LiDAR imagery using R. See other parts:

In this part I will describe my general approach to classification process and then I will show you how to create cool spectral response plots like this:
spectral response graph
Colour of boxes corresponds to the band wavelength


Classification approach

I decided to use Random Forest algorithm for the per-pixel classification. Random Forest is a reliable machine-learning algorithm and I already successfully used it for my projects. Notice that the winner of the contest used Random Forest too, but with per-object classification approach.

Random Forest is implemented in R in 'randomForest' package and cool thing is that 'raster' package is able to implement it to rasters. But not only initial 144 hyper-spectral bands + LiDAR were 'feeded' to Random Forest, but in addition several indices and DSM derivatives were used. Here is what I did for classification:
  1. Cloud shadow identification and removal.
  2. Self-casted shadow identification and removal.
  3. Creation of the spectral profiles for the 16 classes. This needed to choose which bands will be used for indices calculation.
  4. Indices calculation: NDVI, NHFD, NDWI.
  5. DSM processing: elevated objects extraction, DSM to DTM conversion, calculation of slope, aspect, normalised DEM, topographic position and terrain roughness indices.
  6. Random Forest model creation and adjustment. Classification itself.
  7. Final filtering using python-GDAL.
  8. Result evaluation.

Spectral profiles

For calculation of indices one need to choose the bands as an input. Which one to pick if there are 144 bands in hyper-spectral image? The spectral profile graphs which should help with that. But these won't be common spectral profiles that we all used to, like this one:
Spectral profile from one of my researches
We need to see dispersion of band values for each class much more than mean values, because dispersion will significantly affect classification. If some classes will have overlapping values in the same band it will decrease accuracy of classification. Ok, at our spectral profile we should see wavelength, band number and dispersion of given classes to make a decision which band to pick for indices calculation. This means that box plots will be used instead of linear plots. Also boxes should be coloured accordingly to the corresponding wavelength

Needed libraries

library(ggplot2)
library(reshape2)
library(raster)
library(rgdal)

Load raster and take samples using .shp-file with classes

# Load raster
image <- stack('~/EEEI_contest/all_layers.tif')

# Load point .shp-file with classes
s_points_dFrame <- readOGR( '~/EEEI_contest',
                    layer="sampling_points",
                    p4s="+proj=utm +zone=15 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0")
s_points <- SpatialPoints(s_points_dFrame)
dFrame <- as.data.frame(s_points_dFrame)

# Extract data at sampling points
probe <- extract(image, s_points, method='bilinear')

# Combine sampling results with original data 
sample <- cbind(dFrame, probe)

Spectral profile creation

At this point we have 'sample' dataframe that looks like this:
'pattern' is a class name and 'pattern_id' is the integer that corresponds to each class. These fields, as well as 'lat' and 'lon' belonged to the shp-file. Ohter fields were created with 'extract' function.
# get numbers for bands instead of names in headers of columns
for (i in c(7:ncol(sample))){
  colnames(sample)[i] <- i-6
}

Creation of the proper names for bands and palette creation

# preparation for palette creation (establish wavelengths)
palette <- c()
violet <- c(380:450)
blue <- c(452:495)
green <- c(496:570)
yellow <- c(571:590)
orange <- c(591:620)
red <- c(621:750)
NIR <- c(750:1050)

# process band names (future captions) and palette
for (i in c(7:150)){
  # calculate wavelength for a given band
  wave <- (i-7)*4.685315 + 380
  wave <- round(wave, digits = 0)
  
  # rename colunms in 'sample'
  band_num <- paste('(band', i-6, sep = ' ')
  band_num <- paste(band_num, ')', sep = '')
  colnames(sample)[i] <- paste(wave, band_num, sep = ' ')
  
  # add value to palette
  if (is.element(wave, violet)) {palette <- c(palette, '#8F00FF')   
  } else if (is.element(wave, blue)) {palette <- c(palette, '#2554C7')   
  } else if (is.element(wave, green)) {palette <- c(palette, '#008000')   
  } else if (is.element(wave, yellow)) {palette <- c(palette, '#FFFF00')   
  } else if (is.element(wave, orange)) {palette <- c(palette, '#FF8040')   
  } else if (is.element(wave, red)) {palette <- c(palette, '#FF0000')   
  } else if (is.element(wave, NIR)) {palette <- c(palette, '#800000')   
  }
  
}

# Remove unneeded fields
sample <- subset(sample, select = 1:150)
sample[, 5] <- NULL
sample[, 5] <- NULL
sample <- sample[,3:148]
Now our dataframe looks like this:

Reshape dataframe for future plotting:
#
sample <- melt(sample, id = c('pattern', 'pattern_id'))
#
#
Now our dataframe looks like this:

We want to create spectral profiles for each class. This means we need to create plotting function. Keep in mind that 'ggplot2' doesn't work with local variables. We need to create generic plotting function and run it in a cycle that will create global subsets from our dataframe:
plotting <- function(data_f, plot_name){
  p <- ggplot(data_f, aes(data_f[,3], data_f[,4])) +
       geom_boxplot(aes(fill = data_f[,3]),
                    colour = palette,
                    # make outliers have the same colour as lines    
                    outlier.colour = NULL) +
       theme(axis.text.x = element_text(angle=90, hjust = 0),
             axis.title = element_text(face = 'bold', size = 14),
             title = element_text(face = 'bold', size = 16),
             legend.position = 'none') +
       labs(title = paste('Spectral response for class\n',
                          plot_name,
                          sep = ' '),
            x = 'Wavelength, nm',
            y = 'Response')+
       scale_fill_manual(values = palette)
  print(p)
}

Final stage: plotting (EDIT: name of the dataframe was fixed to match previous code parts)
# get unique classes

# uniue class numbers:
u <- unique(sample[,'pattern_id'])
# unique class names:
u_names <- unique(unlist(sample[,'pattern']))

for (i in u){
  # works only if class numbers (u) are consequtive integers from 1 to u
  data_f <- subset(sample, pattern_id == i)  
  plot_name <- u_names[i]
  plotting(data_f, plot_name)
}

If you are curious here you are the rest of the plots: