It is amusing coincidence that another MOOC that I took this week (Geospatial Intelligence & the Geospatial revolution) mentioned [natural] disasters. About the other course see my recent Disasters: Myth or the Reality post.
In Geospatial Intelligence they gave a weird assignment: one need to mark the location on the world map where the next international natural disaster will occur O_o. This is not and easy task by any means and the lecturer suggested to use one's 'gut feeling' if one's knowledge is insufficient (I suppose it is close to impossible to find someone who can make such a prediction taking into account all the types of the disasters). Though the link to the International Disasters Database was given, so I accepted the challenge (to make a data-driven prediction). To predict the exact location of the next disaster one would need a lot of data - far more that you can get out of that database so my goal was to make prediction at the country level. (BTW the graphs from my post about disasters seems to be based on the data from this database - I saw one of them at that site)
I passed a query to the database and saved the output to process it with R. The dataframe looks like this:
Example of disasters dataset |
So how to predict the country with the next disaster? I came up with the idea to calculate cumulative average occurrence of disasters per country per year and plot it on the graph to see the trends. If I would just calculate average occurrence of disasters per country for the whole time of the observations I would have significant issues choosing from countries that would have close numbers. Plus the total average disasters per year can be misleading by itself due to it can be high because of high amount of disasters in the beginning of XX century but relatively low number in XXI.
The formula for the calculation of the cumulative average for the given year that I used was:
Cumulative_Average = Total_Occurences / ( Given_Year - (Starting_Year - 1) ) ,
where: Total_Occurrences is the sum of occurrences of disasters for given country in time interval between the starting year and the given year (inclusive).
Here is the plot I got for the short-list countries (plotting the results for all the 180 countries from the dataset makes plot unreadable):
Cumulative average number of disasters |
It is clear that China and Indonesia are the two most likely candidates for the next disaster to strike, with a China having a lead. I'm not ready to provide insight on the reasons of the increasing number of natural disasters in the countries at the plot now (especially for Turkey and Iran). Maybe it is just that the events become documented more often?... It should be investigated further.
The code
Here is the code to create the plot above. 'sqldf' package was really helpful for divide data for the short list countries from the rest of 180 countries.
library(ggplot2) library(sqldf) library(grid) #library(gridExtra) # Load natural disasters data --------------------------------------------- dis <- read.csv("~/R/Disasters/Natural_disasters.csv") # Create data frame with average number of disasters per year ------------- average_events <- data.frame(country = character(), year = numeric(), disasters_per_year = numeric(), stringsAsFactors = F) countries <- unique(dis$country) starting_year <- min(dis$year) - 1 # we subtract 1 year to have numbers greater than 0 further on for (country in countries) { data <- dis[dis$country == country,] # we need data for one country at a time disasters_count <- 0 years <- unique(data$year) for (year in years) { total_years <- year - starting_year y_data <- data[data$year == year,] n_disasters <- sum(y_data$occurrence) disasters_count <- disasters_count + n_disasters average_disasters <- disasters_count / total_years row <- data.frame(country = country, year = year, disasters_per_year = average_disasters) average_events <- rbind(average_events, row) } } # Plot data about average number of disasters per country per year -------- # Data for 180 countries is hard to plot, lets filter mots affected. # Let's use SQL to query data: subset data for countries that had more than 0.6 disasters per year # in any year after 2000 danger <- sqldf('SELECT * FROM average_events WHERE country IN (SELECT DISTINCT country FROM average_events WHERE disasters_per_year >= 0.6 AND year > 2000)') p <- ggplot(danger, aes (x = year, y = disasters_per_year)) + geom_line(size = 1.2, aes(colour = country, linetype = country)) + labs(title = 'Cumulative average number of disasters per year', x = 'Year', y = 'Average number of disasters cumulative') + guides(guide_legend(keywidth = 3, keyheight = 1)) + theme(axis.text.x = element_text(angle=0, hjust = NULL), axis.title = element_text(face = 'bold', size = 14), title = element_text(face = 'bold', size = 16), legend.position = 'right', legend.title = element_blank(), legend.text = element_text(size = 12), legend.key.width = unit(1.5, 'cm'), legend.key.height = unit(1, 'cm')) plot(p)
No comments:
Post a Comment