Wednesday, April 1, 2015

On why ArcMap gets quantiles wrong

Recently, I discovered that ArcMap gets quantiles wrong - really wrong.  I'm sure that there are many instances where ArcMap gets quantiles right, but recently it seemed like nearly every one that I've encountered had it wrong.  I think that it is important that every Esri user be aware of this limitation and the potential remedy.

When I started digging in deeper I discovered that ArcMap seems to get it wrong when there are highly skewed distributions, which leaves me wondering if Esri is aware that their quantiles algorithm just isn't cutting it for most scientific applications in which skewed data is common, if not the norm.

Below I show an example from a colleague of mine's.
The image to the right is the result of four Brownian Bridge movement models for four individual deer combined into one average. Purple indicates high values - areas where the biologists are nearly 100% that the deer traveled through. Yellow shows less probably paths.

When I did a quantiles classification after removing zero and isolated the top 10% of values I noticed that there were 2676 cells out of 474,135 total.  That is less than 1%, not even close to 10%.

The remedy for this situation is to take the raster data to vector.  You can use either points or polygons.  I tend to use polygons.  The workflow goes something like this:

1.  Remove zero values using SetNull
2.  Multiply by 10,000,000
3.  Convert to integer
4.  Convert from raster to polygon or point
5.  Use the sort tool to sort descending by the grid_code
6.  Calculate the cumulative values using the following python cod block in the field calculator:
total = 0
def cumsum(inc):
 global total
 total+=inc
 return total
7.  Select the top 10%, 25%, 50%, 90%, etc.

The resulting map (left) illustrates the difference. Black areas show the total movement paths, red is the top 10% using the vector-based approach, and yellow (barely visible) is ArcMaps quantiles classification.  The new cell count is  47,413 which is the top 9.99% of cells.  This is pretty good in my opinion!

The lesson here is beware of quantiles when your data is highly skewed and always double check your work.  Thanks Marcus Blum for providing the data for this example.



2 comments:

  1. Ran into the same problem. Monumental waste of time. I managed to deal with it in R using raster package.
    z<-raster(testrast)
    r<-quantile(z,probs = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1))

    ReplyDelete
  2. Hi Michelle,

    Thanks for providing the R solution. Much cleaner with only two lines of code. Did you find my solution to be a waste of time or was it the fact that quantiles in ArcGIS provided the wrong answer that led to the monumental waster of time?

    ReplyDelete