I’ve started working on a new project that relies on the PRIO-GRID data, a truly unique and kinda mind-boggling effort to systematize spatial data analysis. Basically, PRIO-GRID breaks up the Earth into a grid of little squares, or cells, each of which is 0.5 x 0.5 decimal degrees (roughly 50 x 50 km at the equator; you can read more about the data here). They then conformed a wide variety of available spatial data to this grid system, giving you access to things like the population of each 50 x 50 km square everywhere in the globe. Very cool.
The data come in three flavors:
Shapefiles: information for plotting and doing spatial analysis on the grid system
Static variables: data at the cell-level on stuff that doesn’t change (e.g., how mountainous that cell is)
Yearly variables: data at the cell-level that changes over time (e.g., rain, population, etc.)
Interacting with the shapefiles will require some specific R packages.
R for Spatial Data
There are two main packages for dealing with spatial data in R:
spdep: older, richer set of functions, not-tidy
sf: newer, fewer functions, tidy and beautiful
I would rather only use
sf, but there are some nice functions in
spdep we’ll need to use in this example that I’m not sure I can replicate in
First, we read the shape files into R using
st_read. For some reason, the function call separates the file path into the folder the shape file is in (
dsn), and then the shape file itself (
# read data prio = st_read(dsn = "../../../../../data/priogrid_cellshp", layer = "priogrid_cell", stringsAsFactors = F) %>% mutate(gid = as.character(gid))
## Reading layer `priogrid_cell' from data source `/Users/JuamnTellez/Dropbox/data/priogrid_cellshp' using driver `ESRI Shapefile' ## Simple feature collection with 259200 features and 5 fields ## geometry type: POLYGON ## dimension: XY ## bbox: xmin: -180 ymin: -90 xmax: 180 ymax: 90 ## CRS: 4326
Let’s look at the data:
prio %>% head()
## Simple feature collection with 6 features and 5 fields ## geometry type: POLYGON ## dimension: XY ## bbox: xmin: 163.5 ymin: 89.5 xmax: 166.5 ymax: 90 ## CRS: 4326 ## gid xcoord ycoord col row geometry ## 1 259168 163.75 89.75 688 360 POLYGON ((163.5 89.5, 163.5... ## 2 259169 164.25 89.75 689 360 POLYGON ((164 89.5, 164 90,... ## 3 259170 164.75 89.75 690 360 POLYGON ((164.5 89.5, 164.5... ## 4 259171 165.25 89.75 691 360 POLYGON ((165 89.5, 165 90,... ## 5 259172 165.75 89.75 692 360 POLYGON ((165.5 89.5, 165.5... ## 6 259173 166.25 89.75 693 360 POLYGON ((166 89.5, 166 90,...
This is where
sf really shines. As you can see, the data is just a
tibble, with the spatial polygons stored as a column of lists in
geometry. What this does is foreground the stuff you care about (i.e., the unit level characteristics, such as covariates, and outcomes) and sticks the stuff you mostly use for plotting and other stuff in that last column.
Compare this to what you get from
rgdal::readOGR, which is a “SpatialPolygonsDataFrame” class object that is also slower to work with.
The PRIO-GRID data is huge because it covers the whole globe. This makes it harder to work with, especially once you are trying to do any kind of modeling. One way to make the data more manageable is to just focus on one region of the globe (e.g., Africa), or, as I will show here, just get rid of all the water.
Getting Rid of the Water
Yes, the PRIO-GRID includes all of the world’s oceans, which means who-knows-how-many grids that you have no use for (I guess unless you are maybe studying maritime conflict?). Unfortunately, the PRIO-GRID data (at time of writing) does not include a simple way to distinguish grids based on whether or not they are water or land.
One quick way to exclude irrelevant cells is to merge in the yearly PRIO-GRID variables on
gid, and use the Gleditch and Gleditsch and Ward country codes (
gwno) to identify what country each cell belongs to. Any cell without a
gwno is water, so we drop it.
# read in PRIO yearly data prio_year = # just want gid and gwno to identify water vroom("../../../../../data/priogrid_cellshp/prio-grid-yearly.csv", col_select = c("gid", "gwno", "year", "pop_gpw_sum")) %>% # we got multipe years of data and changing country, # so i'm arbitrarily picking all countries in most recent year filter(year == 2005) %>% mutate(gid = as.character(gid)) %>% select(-year) # merge in gwno to identify country grids prio = left_join(prio, prio_year) head(prio)
## Simple feature collection with 6 features and 7 fields ## geometry type: POLYGON ## dimension: XY ## bbox: xmin: 163.5 ymin: 89.5 xmax: 166.5 ymax: 90 ## CRS: 4326 ## gid xcoord ycoord col row gwno pop_gpw_sum geometry ## 1 259168 163.75 89.75 688 360 NA NA POLYGON ((163.5 89.5, 163.5... ## 2 259169 164.25 89.75 689 360 NA NA POLYGON ((164 89.5, 164 90,... ## 3 259170 164.75 89.75 690 360 NA NA POLYGON ((164.5 89.5, 164.5... ## 4 259171 165.25 89.75 691 360 NA NA POLYGON ((165 89.5, 165 90,... ## 5 259172 165.75 89.75 692 360 NA NA POLYGON ((165.5 89.5, 165.5... ## 6 259173 166.25 89.75 693 360 NA NA POLYGON ((166 89.5, 166 90,...
gwno identifier merged in, we can now identify land/water and even subset to specific countries or sets of countries. Note that I also merged in a population estimate for plotting below. The
sf package also works nicely with
ggplot. For instance, here’s GRID-level population in Venezuela, below:
# plot Venezuela as sanity check prio %>% filter(gwno == 101) %>% #mutate(pop_cut = cut_number(pop_gpw_sum, 10)) %>% ggplot(aes(fill = pop_gpw_sum)) + geom_sf() + scale_fill_viridis_c(option = "inferno", end = .9, trans = "log10", labels = scales::comma, name = "") + labs(title = "GRID-level population in Venezuela") + theme(panel.grid.major = element_blank())
ggsave(filename = "featured.png", device = "png")
We can also drop all the water, as promised:
## throw out water tiles: places with no gwno == not countries prio_land = prio %>% drop_na(gwno)
We’re now down to a dataset that is 25% of its original size, which will help a lot down the road.
Doing most kinds of spatial analysis requires knowning how the different units in your data relate to one another. With spatial data, it’s common to look at whether two units are neighbors or not. Being a neighbor might be defined as sharing a common border, the distances between the capitals, or centroids, or… whatever. The often arbitrary nature of this choice is one of the tricky aspects of spatial modeling.
To describe these relationships, we use an N by N adjacency matrix, where N is the number of units in our data and any entry (i,j) tells us whether units i and j are neighbors or not.
To do this in R, I find the easiest thing is (unfortunately) to switch from
spdep package and use the
# convert back to sp object to use nice sp dep functions prio_land_sp = as(prio_land, Class = "Spatial")
Below, we specify
queen == FALSE because we want to use Rook’s contiguity (as opposed to Queen’s). You should look into the difference between these two (also, perhaps, another arbitrary decision point). Note that this step takes a long time.
# get neighbors nb = poly2nb(prio_land_sp, queen = FALSE, row.names = prio_land_sp$gid)
The output of our
nb object looks like this, and tells us a bunch of information about what neighborhood relations among our units looks like, including whether any units were found to have no neighbors:
## Neighbour list object: ## Number of regions: 64818 ## Number of nonzero links: 249856 ## Percentage nonzero weights: 0.005947008 ## Average number of links: 3.854732 ## 39 regions with no links: ## 211700 209569 165491 154315 147345 146125 144700 144508 143252 143106 142536 141627 140378 140383 140186 139637 138217 137308 133811 132987 132268 132135 131746 127298 122245 121546 120953 120955 119518 119418 114426 113014 105121 101287 100780 98650 90142 62356 58318
Doing this with sf package
There’s actually a way to do this without leaving
sf, though I found it on a Github comment thread, and without digging more into it am not sure how robust it is (though it does pass general sniff test).
# let's do this again, but with sf package; faster, but seems dodgier, based off a github comment: # https://github.com/r-spatial/sf/issues/234 st_rook = function(a, b = a) st_relate(a, b, pattern = "F***1****") prio_land = prio_land %>% mutate(NB_ROOK = st_rook(.))
Checking things turned out alright
Something prudent at this point would be to check whether the neighborhood list we got actually makes sense.
First, let’s look at a place that the
nb object tells us has no neighbors:
# plot place with no neighbors, Alaskan island prio_land %>% filter(gwno == 2) %>% ggplot() + geom_sf() + # add the place with no neighbors geom_sf(data = filter(prio_land, gid == 211700), fill = "red") + # zoom in on location coord_sf(xlim = c(-180, -140), ylim = c(50, 70), expand = FALSE)
Checks out: looks like some tiny Alaskan island that shouldn’t have neighbors.
Now, how about a place that should have neighbors? Let’s look at a random cell in Canada:
# plot place and its neighbors (in canada) # CHECKS OUT! prio_land %>% filter(gwno == 20) %>% ggplot() + geom_sf() + # random place in canada geom_sf(data = filter(prio_land, gid == 249328), fill = "red") + # neighbors from nb object geom_sf(data = filter(prio_land, gid %in% c(249329,249327, 248608)), fill = "blue") + # zoom in on location coord_sf(xlim = c(-100, -50), ylim = c(75, 85), expand = FALSE)
A bit hard to make out, but if you zoom in you will see that the
nb object correctly identified this GRID’s neighbors.
Having an adjacency matrix of relations between units opens up possibilities for fitting different kinds of spatial models to your data. A common first step is to test whether the variable you are interested in exhibits a high degree of spatial clustering, or “autocorrelation”, which has implications for what kind of model you should fit to your data.
The first step is converting our
nb object to a list using
nblistw(). We include
zero.policy = TRUE because there are a few unconnected cells in our data. The
style argument is important here: it defines how you weigh the connections in your data. A common approach is to row-standardize the adjacency matrix using
style = "W". The strength of this approach vs. others is another decision-point in spatial analysis.
# store as list (most modeling packages require this) lw = nb2listw(nb, style = "W", zero.policy = TRUE) print(lw, zero.policy = TRUE) ## to look at lw contents
## Characteristics of weights list object: ## Neighbour list object: ## Number of regions: 64818 ## Number of nonzero links: 249856 ## Percentage nonzero weights: 0.005947008 ## Average number of links: 3.854732 ## 39 regions with no links: ## 211700 209569 165491 154315 147345 146125 144700 144508 143252 143106 142536 141627 140378 140383 140186 139637 138217 137308 133811 132987 132268 132135 131746 127298 122245 121546 120953 120955 119518 119418 114426 113014 105121 101287 100780 98650 90142 62356 58318 ## ## Weights style: W ## Weights constants summary: ## n nn S0 S1 S2 ## W 64779 4196318841 64779 34305.1 260102.1
A common test of spatial autocorrelation is the Moran’s I test, which can be implemented with the
spdep::morant.test() function. Below, we test whether there is spatial autocorrelation in population:
moran.test(prio_land_sp$pop_gpw_sum, listw = lw, zero.policy = TRUE)
## ## Moran I test under randomisation ## ## data: prio_land_sp$pop_gpw_sum ## weights: lw n reduced by no-neighbour observations ## ## ## Moran I statistic standard deviate = 212.41, p-value < 2.2e-16 ## alternative hypothesis: greater ## sample estimates: ## Moran I statistic Expectation Variance ## 6.055005e-01 -1.543734e-05 8.126835e-06
There’s clearly strong autocrrelation (.61) which is statistically significant (p-value is basically zero). You can also apply the Moran’s I test to a vanilla model you’ve already fit as a test of whether a spatial econometric model would be more appropriate.