Using PRIO-GRID in R

Walkthrough for using the PRIO-GRID data in R.

Using PRIO-GRID in R

Walkthrough for using the PRIO-GRID data in R.

Introduction

I’ve started working on a new project that relies on the PRIO-GRID data, a truly unique and kinda mind-boggling effort to systematize spatial data analysis. Basically, PRIO-GRID breaks up the Earth into a grid of little squares, or cells, each of which is 0.5 x 0.5 decimal degrees (roughly 50 x 50 km at the equator; you can read more about the data here). They then conformed a wide variety of available spatial data to this grid system, giving you access to things like the population of each 50 x 50 km square everywhere in the globe. Very cool.

The data come in three flavors:

  1. Shapefiles: information for plotting and doing spatial analysis on the grid system
  2. Static variables: data at the cell-level on stuff that doesn’t change (e.g., how mountainous that cell is)
  3. Yearly variables: data at the cell-level that changes over time (e.g., rain, population, etc.)

Interacting with the shapefiles will require some specific R packages.

R for Spatial Data

There are two main packages for dealing with spatial data in R:

  • spdep: older, richer set of functions, not-tidy
  • sf: newer, fewer functions, tidy and beautiful

I would rather only use sf, but there are some nice functions in spdep we’ll need to use in this example that I’m not sure I can replicate in sf.

First, we read the shape files into R using st_read. For some reason, the function call separates the file path into the folder the shape file is in (dsn), and then the shape file itself (layer).

# read data
prio = st_read(dsn = "../../../../../data/priogrid_cellshp", 
                layer = "priogrid_cell", 
                stringsAsFactors = F) %>% 
  mutate(gid = as.character(gid))
## Reading layer `priogrid_cell' from data source `/Users/JuamnTellez/Dropbox/data/priogrid_cellshp' using driver `ESRI Shapefile'
## Simple feature collection with 259200 features and 5 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -180 ymin: -90 xmax: 180 ymax: 90
## CRS:            4326

Let’s look at the data:

prio %>% 
  head()
## Simple feature collection with 6 features and 5 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 163.5 ymin: 89.5 xmax: 166.5 ymax: 90
## CRS:            4326
##      gid xcoord ycoord col row                       geometry
## 1 259168 163.75  89.75 688 360 POLYGON ((163.5 89.5, 163.5...
## 2 259169 164.25  89.75 689 360 POLYGON ((164 89.5, 164 90,...
## 3 259170 164.75  89.75 690 360 POLYGON ((164.5 89.5, 164.5...
## 4 259171 165.25  89.75 691 360 POLYGON ((165 89.5, 165 90,...
## 5 259172 165.75  89.75 692 360 POLYGON ((165.5 89.5, 165.5...
## 6 259173 166.25  89.75 693 360 POLYGON ((166 89.5, 166 90,...

This is where sf really shines. As you can see, the data is just a tibble, with the spatial polygons stored as a column of lists in geometry. What this does is foreground the stuff you care about (i.e., the unit level characteristics, such as covariates, and outcomes) and sticks the stuff you mostly use for plotting and other stuff in that last column.

Compare this to what you get from rgdal::readOGR, which is a “SpatialPolygonsDataFrame” class object that is also slower to work with.

The PRIO-GRID data is huge because it covers the whole globe. This makes it harder to work with, especially once you are trying to do any kind of modeling. One way to make the data more manageable is to just focus on one region of the globe (e.g., Africa), or, as I will show here, just get rid of all the water.

Getting Rid of the Water

Yes, the PRIO-GRID includes all of the world’s oceans, which means who-knows-how-many grids that you have no use for (I guess unless you are maybe studying maritime conflict?). Unfortunately, the PRIO-GRID data (at time of writing) does not include a simple way to distinguish grids based on whether or not they are water or land.

One quick way to exclude irrelevant cells is to merge in the yearly PRIO-GRID variables on gid, and use the Gleditch and Gleditsch and Ward country codes (gwno) to identify what country each cell belongs to. Any cell without a gwno is water, so we drop it.

# read in PRIO yearly data
prio_year = 
  # just want gid and gwno to identify water
  vroom("../../../../../data/priogrid_cellshp/prio-grid-yearly.csv", 
        col_select = c("gid", "gwno", "year", "pop_gpw_sum")) %>% 
  # we got multipe years of data and changing country, 
  # so i'm arbitrarily picking all countries in most recent year
  filter(year == 2005) %>% 
  mutate(gid = as.character(gid)) %>% 
  select(-year)


# merge in gwno to identify country grids
prio = left_join(prio, prio_year)
head(prio)
## Simple feature collection with 6 features and 7 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 163.5 ymin: 89.5 xmax: 166.5 ymax: 90
## CRS:            4326
##      gid xcoord ycoord col row gwno pop_gpw_sum                       geometry
## 1 259168 163.75  89.75 688 360   NA          NA POLYGON ((163.5 89.5, 163.5...
## 2 259169 164.25  89.75 689 360   NA          NA POLYGON ((164 89.5, 164 90,...
## 3 259170 164.75  89.75 690 360   NA          NA POLYGON ((164.5 89.5, 164.5...
## 4 259171 165.25  89.75 691 360   NA          NA POLYGON ((165 89.5, 165 90,...
## 5 259172 165.75  89.75 692 360   NA          NA POLYGON ((165.5 89.5, 165.5...
## 6 259173 166.25  89.75 693 360   NA          NA POLYGON ((166 89.5, 166 90,...

With the gwno identifier merged in, we can now identify land/water and even subset to specific countries or sets of countries. Note that I also merged in a population estimate for plotting below. The sf package also works nicely with ggplot. For instance, here’s GRID-level population in Venezuela, below:

# plot Venezuela as sanity check
prio %>%
  filter(gwno == 101) %>% 
  #mutate(pop_cut = cut_number(pop_gpw_sum, 10)) %>% 
  ggplot(aes(fill = pop_gpw_sum)) + 
  geom_sf() + 
  scale_fill_viridis_c(option = "inferno", 
                       end = .9, 
                       trans = "log10", 
                       labels = scales::comma, 
                       name = "") + 
  labs(title = "GRID-level population in Venezuela") + 
  theme(panel.grid.major = element_blank())

ggsave(filename = "featured.png", device = "png")

We can also drop all the water, as promised:

## throw out water tiles: places with no gwno == not countries
prio_land = 
  prio %>% 
  drop_na(gwno)

We’re now down to a dataset that is 25% of its original size, which will help a lot down the road.

Finding Neighbors

Doing most kinds of spatial analysis requires knowning how the different units in your data relate to one another. With spatial data, it’s common to look at whether two units are neighbors or not. Being a neighbor might be defined as sharing a common border, the distances between the capitals, or centroids, or… whatever. The often arbitrary nature of this choice is one of the tricky aspects of spatial modeling.

To describe these relationships, we use an N by N adjacency matrix, where N is the number of units in our data and any entry (i,j) tells us whether units i and j are neighbors or not.

To do this in R, I find the easiest thing is (unfortunately) to switch from sf to spdep package and use the spdep::poly2nb() function.

# convert back to sp object to use nice sp dep functions
prio_land_sp = as(prio_land, Class = "Spatial")

Below, we specify queen == FALSE because we want to use Rook’s contiguity (as opposed to Queen’s). You should look into the difference between these two (also, perhaps, another arbitrary decision point). Note that this step takes a long time.

# get neighbors
nb = poly2nb(prio_land_sp, 
             queen = FALSE, 
             row.names = prio_land_sp$gid)

The output of our nb object looks like this, and tells us a bunch of information about what neighborhood relations among our units looks like, including whether any units were found to have no neighbors:

load("../../../static/nb.rda")
nb
## Neighbour list object:
## Number of regions: 64818 
## Number of nonzero links: 249856 
## Percentage nonzero weights: 0.005947008 
## Average number of links: 3.854732 
## 39 regions with no links:
## 211700 209569 165491 154315 147345 146125 144700 144508 143252 143106 142536 141627 140378 140383 140186 139637 138217 137308 133811 132987 132268 132135 131746 127298 122245 121546 120953 120955 119518 119418 114426 113014 105121 101287 100780 98650 90142 62356 58318

Doing this with sf package

There’s actually a way to do this without leaving sf, though I found it on a Github comment thread, and without digging more into it am not sure how robust it is (though it does pass general sniff test).

# let's do this again, but with sf package; faster, but seems dodgier, based off a github comment:
# https://github.com/r-spatial/sf/issues/234
st_rook = function(a, b = a) st_relate(a, b, pattern = "F***1****")
prio_land = prio_land %>% mutate(NB_ROOK = st_rook(.))

Checking things turned out alright

Something prudent at this point would be to check whether the neighborhood list we got actually makes sense.

First, let’s look at a place that the nb object tells us has no neighbors:

# plot place with no neighbors, Alaskan island
prio_land %>% 
  filter(gwno == 2) %>% 
  ggplot() + 
  geom_sf() + 
  # add the place with no neighbors
  geom_sf(data = filter(prio_land, gid == 211700), fill = "red") + 
  # zoom in on location
  coord_sf(xlim = c(-180, -140), ylim = c(50, 70), expand = FALSE)

Checks out: looks like some tiny Alaskan island that shouldn’t have neighbors.

Now, how about a place that should have neighbors? Let’s look at a random cell in Canada:

# plot place and its neighbors (in canada) # CHECKS OUT!
prio_land %>% 
  filter(gwno == 20) %>% 
  ggplot() + geom_sf() + 
  # random place in canada
  geom_sf(data = filter(prio_land, gid == 249328), fill = "red") + 
  # neighbors from nb object
  geom_sf(data = filter(prio_land, gid %in% c(249329,249327, 248608)), fill = "blue") + 
  # zoom in on location
  coord_sf(xlim = c(-100, -50), ylim = c(75, 85), expand = FALSE)

A bit hard to make out, but if you zoom in you will see that the nb object correctly identified this GRID’s neighbors.

Autocorrelation

Having an adjacency matrix of relations between units opens up possibilities for fitting different kinds of spatial models to your data. A common first step is to test whether the variable you are interested in exhibits a high degree of spatial clustering, or “autocorrelation”, which has implications for what kind of model you should fit to your data.

The first step is converting our nb object to a list using nblistw(). We include zero.policy = TRUE because there are a few unconnected cells in our data. The style argument is important here: it defines how you weigh the connections in your data. A common approach is to row-standardize the adjacency matrix using style = "W". The strength of this approach vs. others is another decision-point in spatial analysis.

# store as list (most modeling packages require this)
lw = nb2listw(nb, style = "W", zero.policy = TRUE)
print(lw, zero.policy = TRUE) ## to look at lw contents
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 64818 
## Number of nonzero links: 249856 
## Percentage nonzero weights: 0.005947008 
## Average number of links: 3.854732 
## 39 regions with no links:
## 211700 209569 165491 154315 147345 146125 144700 144508 143252 143106 142536 141627 140378 140383 140186 139637 138217 137308 133811 132987 132268 132135 131746 127298 122245 121546 120953 120955 119518 119418 114426 113014 105121 101287 100780 98650 90142 62356 58318
## 
## Weights style: W 
## Weights constants summary:
##       n         nn    S0      S1       S2
## W 64779 4196318841 64779 34305.1 260102.1

A common test of spatial autocorrelation is the Moran’s I test, which can be implemented with the spdep::morant.test() function. Below, we test whether there is spatial autocorrelation in population:

moran.test(prio_land_sp$pop_gpw_sum, listw = lw, zero.policy = TRUE)
## 
##  Moran I test under randomisation
## 
## data:  prio_land_sp$pop_gpw_sum  
## weights: lw  n reduced by no-neighbour observations
##   
## 
## Moran I statistic standard deviate = 212.41, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      6.055005e-01     -1.543734e-05      8.126835e-06

There’s clearly strong autocrrelation (.61) which is statistically significant (p-value is basically zero). You can also apply the Moran’s I test to a vanilla model you’ve already fit as a test of whether a spatial econometric model would be more appropriate.

Conclusion

That’s just the tip of the iceberg but a good place to get started. Big thanks to Rob Williams and Jeremy Springman for help getting me started!

Avatar
Juan F. Tellez
Assistant Professor of
Political Science

I research and teach on armed conflict, security and development, and quantitative methods.