A visual introduction to rollply

library(proj4) 
library(ggplot2) 
library(rgdal)
library(tidyr) 
library(rollply) 
library(plyr) 

Rollply is a small function built upon plyr’s ddply function to facilitate moving-window-based computations. If you have a data.frame, give the dimensions over which the window should move, and rollply will make the subsets, apply the function on them and then combine the results into a data.frame.

In short, rollply extends the split-apply-combine strategy to moving-window computations, using a similar syntax. This tutorial thus assumes some basic familiarity with plyr, and in particular the function ddply. Let’s start with a simple example.

Examples

A time-series example

A simple use of moving-windows is adding a trendline to a time series plot. We will use the CO2 data from the Mauna Loa NOAA Observatory as a environmentally-conscious example.

# Download and format data
url <- "ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_mm_mlo.txt"
hawaii <- read.table(url)[ ,c(3,4)]
names(hawaii) <- c('date','CO2')
hawaii[hawaii$CO2 < 0, "CO2"] <- NA # mark NAs as such

# Display original trend
CO2.plot <- ggplot(hawaii) + geom_line(aes(date, CO2)) + ylab("CO2 (ppm)")
print(CO2.plot)

There is a clear trend here! Let’s smooth out the season effect (the wiggles in the black curve). We’ll use a window with a size of one year to compute a yearly average.

# with smoothed trend
hawaii.smoothed <- rollply(hawaii, ~ date, wdw.size = 1,
                           summarize, CO2.mean = mean(CO2, na.rm = TRUE), )
CO2.plot + geom_line(aes(date, CO2.mean), data = hawaii.smoothed, color = 'red')

And voilà! A rather nice, although a bit depressing trend line for our data. When working on time-series, this represents an alternative to specialized packages such as zoo that also provide tools to apply functions over rolling windows.

Let’s take a more complex example that works on two-dimensional data.

Exploring french town names

If you open a map of France, you’ll notice that towns and villages tend to have names that follow patterns. For example, Brittany’s towns are famous for having names starting with a “ker-”. Many towns in Lorraine end in “-ange” (a legacy from the german ending “-ingen”).

Can we visually explore the distribution of french towns names ? rollply can help here.

A moving-window approach essentially boils down to the following steps:

  1. Build a grid over the whole country
  2. For each point of the grid, take all the data poitns (towns) that are less than xx kilometers from it.
  3. Check the names of the towns and count the ones matching a pattern. Return the number of matching towns.
  4. Combine the results in a data.frame

Like ddply (in package plyr), rollply takes care of points 1,2 and 4. We just need to define a function that does number 3.

Let’s download a dataset of town names with their geographical coordinates:

# Download and prepare dataset
# Source and decription: 
# https://publicdata.eu/dataset/listes-des-communes-par-rgions-dpartements-circonscriptions

tmpfile <- tempfile()
url <- paste0('http://www.nosdonnees.fr/wiki/images/b/b5/',
         'EUCircos_Regions_departements_circonscriptions_communes_gps.csv.gz')
download.file(url, destfile = tmpfile)
dat <- read.csv2(tmpfile, stringsAsFactors = FALSE)
file.remove(tmpfile)
dat <- dat[with(dat, latitude != "" | ! grepl(",", latitude)), 
           c('nom_commune', 'latitude', 'longitude')]
colnames(dat) <- c('name', 'latitude', 'longitude')

dat[ ,'name']      <- as.factor(tolower(dat[ ,'name']))
dat[ ,'latitude']  <- as.numeric(dat[ ,'latitude'])
dat[ ,'longitude'] <- as.numeric(dat[ ,'longitude'])

# We use an equirectangular projection to work on true distances
dat <- na.omit(dat)
dat <- data.frame(dat, proj4::project(dat[ ,c('longitude','latitude')],
                                      '+proj=eqc'))
dat <- dat[ ,c('name','x','y')]
# Visualise distribution of towns
str(dat)
## 'data.frame':    33814 obs. of  3 variables:
##  $ name: Factor w/ 34141 levels "aast","abainville",..: 1264 2487 2798 2821 3515 4005 4708 5689 5730 6645 ...
##  $ x   : num  574507 585626 587480 561534 600452 ...
##  $ y   : num  5146469 5159442 5152029 5155736 5129790 ...
ggplot(dat) + geom_point(aes(x, y), alpha=.1)

Nice, let’s see whether “ker”-named towns mainly occur in Brittany.

# This is our custom function : it accepts a data frame and a regular 
# expression and returns the number of matches in the column "name", formatted
# withing a data.frame (this plays well with ddply).
how_many_with_name <- function(df, regexp) { 
  data.frame(ker = sum(grepl(regexp, df[ ,'name'])))
}

dat_roll <- rollply(dat, ~ x + y, wdw.size = 1e4, grid_npts = 10e3,
                    how_many_with_name, regexp = "^ker")

ggplot(dat_roll) +
  geom_raster(aes(x, y, fill = ker)) +
  scale_fill_distiller(palette = 'Greys')

It seems there are indeed many towns with a name starting with “ker” in Brittany (and a couple in Alsace/Lorraine, too). However, our plot is pretty ugly: we cannot see the actual country shape! When nothing is specified, rollply computes its values over a rectangular grid than spans the maximum width and height of the original dataset.

Here, the spatial distribution of towns reflects pretty nicely the overall shape of the country, so instead of a building a rectangular grid, we can choose to build it only within the alpha-hull of the set of towns.

dat_roll <- rollply(dat, ~ x + y, wdw.size = 1e4,
                    grid_npts = 10e3, # number of grid points 
                    grid_type = "ahull_fill", # grid type: fills an alpha-hull with the given number of points
                    grid_opts = list(alpha = .05, # shape parameter of the hull
                                     verbose = TRUE), 
                    how_many_with_name, regexp = "^ker")
## Building alphahull...
## run 1, error=-0.53% (4713 points)
## run 2, error=-0.27% (7267 points)
## run 3, error=-0.08% (9229 points)
## run 4, error=-0.01% (9942 points)

Note that building an alpha hull-based grid can be quite computationally expensive, as it requires iterating to find the suitable number of points. However, one can pregenerate grids using the build_grid_* functions family and supply them directly to rollply using the grid argument, so this computation can be only done once (see below).

So, are there really more town named ker-something in Brittany than elsewhere? I’ll let you judge (spoiler: yes!):

ggplot(dat_roll) +
  geom_raster(aes(x, y, fill = ker)) +
  scale_fill_distiller(palette = 'Greys')

Building grids: helper functions

As seen in the french towns example, rollply uses internally a grid of coordinates. For each points of this grid it selects the observations within the window, then applies the function on this subset. The user can either provide a grid as a data.frame or rollply will take care of building one automatically.

Several helper functions are provided to build nice grids, they all start with build_grid_.

For this example, we will use samples from a vegetation survey in a meadow in Yosemite National Park, California.

# We request a grid with approximately this number of points:
npts <- 500
base.plot <- ggplot(NULL, aes(x,y)) +
               geom_point(data = meadow, shape='+') +
               xlab('UTM X') +
               ylab('UTM Y') 

grids <- list(identical  = build_grid_identical(meadow[ ,c('x','y')], npts),
              squaretile = build_grid_squaretile(meadow[ ,c('x','y')], npts),
              ahull_crop = build_grid_ahull_crop(meadow[ ,c('x','y')], npts),
              ahull_fill = build_grid_ahull_fill(meadow[ ,c('x','y')], npts))

plot_grid <- function(grid_type) {
  base.plot +
    geom_point(data=grids[[grid_type]]) +
    annotate('text', x = min(meadow$x), y = min(meadow$y),
            label = paste(nrow(grids[[grid_type]]), "points"),
            hjust = 0, vjust = 1)
}
plot_grid('identical')

plot_grid('squaretile')

plot_grid('ahull_crop')

plot_grid('ahull_fill')

Note that the outline as given by the alpha-hull of a set of points depends on a parameter, alpha that determines its intrincateness. A reasonable default is provided, but it is a good idea to try different values to get an idea of the best fit. For more information on the alpha hull, please refer to the package documentation.

Notes

Performance considerations

rollply inherits plyr’s pros and cons. As in the latter’s functions, parallelism or progress report is just one argument away (set .parallel = TRUE or .progress = "time"). However, rollply does a lot of data.frame subsetting which remains an expensive operation in R.

Bugs & comments

rollply has well-known bugs and is still in active development! Development happens at github. Do not hesitate to post issues or pull requests.