As access to the GIS and mapping is becoming easier every year the more people and companies create maps. Unfortunately often they just do not know what they are actually showing at their maps. This issue is being mentioned over and over again.
Here is the example that I discovered recently: Cyberthreat Real-Time Map by Kaspersky antivirus company. Here how it looks like:
Amongst the other info they show the Infection rank for each country... based on total threats detected.... You may have already guessed what is the fail, but I let me explain it anyway.
See, the №1 infected country is Russia, which is the home country for Kaspersky and where this antivirus is quite popular. So we can conclude that the rankings that supposed to demonstrate the severity of virus activities merely demonstrates the number of Kaspersky software installations across the globe.
Lets test this hypothesis. I don't have the data about the number of installation of Kaspersky software per country, but it is safe to assume that this number is proportional to the population of the given country. Also it is easier to get infection rankings for countries from the map than the number of the threats detected. If I had total threats data per country I would compare it to the population. Having infection rankings it is more rational to compare it to the population rankings instead. So I picked 27 random countries and compared their infection and population rankings. The result is demonstrated at the plot below:
Infection rank vs. Population rank |
The linear model is fairly close to Inrection rank = Population rank. It is clear that the phenomena that is presented as an Infection rank just reflects a total software installations per country and not the severity of the 'cyberthreat'. In order to get the actual Infection rank the number of detected threats have to be normalised by the number of software installations.
It's super-awesome having someone outside the infosec echo chamber pointing this out! My co-author, Jay Jacobs, and I did a similar analysis of the ZeroAccess botnet a couple years ago - http://rud.is/b/2012/10/08/diy-zeroaccess-geoip-analysis-so-what/ and also cover this pretty well in our book, Data-Driven Security http://dds.ec/amzn
ReplyDeleteIf you do want some security data to play with, hit us up at @hrbrmstr & @jayjacobs on Twitter or bob at rudis dot net.
I don't think that software installations per country have much to do with this graph. Without any Kaspersky installations, we would also expect to see the same graph: more people in a country -> more computers -> more infections.
ReplyDeleteYou have a point, and this logic was implied in my post. See... Does the total number of infection gives us some useful information? - No! It only tells us about the population number (and you mentioned it yourself). So the number of infections has to be normalised by population (by computers to be more precise) to be meaningful. In this particular case the data was acquired from the certain single source - Kaspersky antivirus, so instead of normalising data by computers or population we have to use Kaspersky installation to get the correct result (in other case you won't be able to explain why Russia is the most infected country).
Delete