- Geospatial data visualization
- Merging GeoDataFrames and Pandas DataFrames
Choropleth maps color code geographic regions according to numeric values. They are an impactful way to display variation in a variable between regions. The Python package GeoPandas combines pandas and Shapely, a package for manipulating points, lines and polygons. It also depends on Fiona for file access, and Descartes and Matplotlib for plotting. With GeoPandas we can carry out geospatial operations using Python and Pandas that would normally require a spatial database.
Brief introduction to GeoPandas
GeoPandas has three basic classes of geometric objects:
and two main data structures that are subclasses of pandas Series and DataFrame. GeoSeries is a vector where each entry is a set of shapes representing one observation. A GeoDataFrame has the same tabular structure as a pandas DataFrame with one important difference, it contains a “geometry” column. The geometry column holds a series of latitude-longitude pairs that define the geometric object, be it a point, line or polygon. GIS Shapefile’s containing geometric location and attribute information of geographic features, can easily be converted to GeoDataFrames in Python by reading in the shapefile using
gpd.read_file(), gpd being the abbreviation for GeoPandas. For more information and examples of GeoPandas in action I recommend looking at the GeoPandas documentation. Lets now look at how to make a choropleth using the FAO land use data from an earlier post. The code for this post is in the GitHub repository.
Creating a Choropleth
Importing World Map shapefile
The first thing we need is a basemap on which to plot our data. For this I downloaded a 1:110m Cultural Vectors Countries shapefile from Natural Earth. The shapefile consists of a .shp file containing the geometry coordinates, a .dbf file containing attributes for each geometry and a .shx file that is a shape index linking the attributes to the geometry. The .shp file depends on the other two, so all three files must be imported into the working directory although only the .shp file is read into the notebook.
In addition to the usual Python packages we will import GeoPandas and Descartes.
Assign the shapefile a name and read in the file as a GeoDataFrame (gdf) with
gpd.read_file() function. I only want to read in the columns containing the country name and the all important geometry column, so add these as a list to the read_file function. For clarity rename the ‘ADMIN‘ and ‘Name‘ columns to ‘Country‘ and ‘Country_short‘. Call
.plot() on the world_gdf to plot the basemap.
Antarctica takes up quite a bit of the map and we don’t have any data for it. We can drop the Antarctica polygon creating more space for the rest of the map. Print the GeoDataFrame row for Antarctica and use
.drop() to drop that row from the gdf.
Now our basemap is ready we need some data to plot on it.
Import pandas DataFrame
Import the countries_land_use.csv file created in the EDA Data viz notebook. This contains data on the number of hectares used for different land use categories (e.g. Arable, Pasture), in most world countries from 1961 to 2017. What we want to create is a GeoDataFrame that contains the world map and the country land use data, so that we can plot hectare values for a specific land use in each country.
To combine the GeoDataFrame and the DataFrame use
.merge() to merge the DataFrame on a shared column, in this case we can use the ‘Country‘ column. Before doing so we must make sure that the ‘Country‘ column values are the same in both the datasets, otherwise there will be rows in the DataFrame that will not map to rows in the GeoDataFrame. To check that the country names in the land use DataFrame are the same as they appear in the world GeoDataFrame use
.to_list() to return a list of country names from both the GeoDataFrame and DataFrame.
For those countries where the name differs between the World_gdf and the countries_land_use DataFrame I have renamed the country in the ‘Country‘ column of countries_land_use DataFrame. This is done by first making a dictionary with the key:value pair set to ‘old_name’:’new_name’. The new names are mapped to the ‘Country’ column using the
.map() function and passing the dictionary to the function. The
.fillna() function is called to fill the ‘Country‘ column values that are not in the dictionary with the existing values in the DataFrame ‘Country’ column. If we don’t do this pandas will map the values in the dictionary to the key in the DataFrame ‘Country’ column and fill all other column rows with NaN.
Renaming has generated duplicate values in the ‘Country‘ column, the USSR was renamed as Russia so that now occurs twice. Removing duplicates and appending new rows to the DataFrame was discussed in Raw Materials and Raw Data I. The code is shown below but for more explanation I suggest you look at the earlier blog post.
As well as dropping the duplicates I’ve subset the countries_land_use DataFrame by Arable land use and Permanent meadows & pasture, as it is these two categories that will be used for the choropleths.
Merge GeoDataFrame and pandas DataFrame
Merging a GeoDataFrame and a pandas DataFrame is for the most part the same as merging two pandas DataFrames. An outer merge is used so that all rows in the world_gdf are retained even if those countries are not in the arable DataFrame. If these rows are dropped the choropleth will have gaps for the missing countries.
There is an important difference when merging a GeoDataFrame and DataFrame compared to merging two DataFrame’s. The merged dataset should be a GeoDataFrame and this depends on which dataset the merge function is called, the spatial or DataFrame dataset. It is recommended to call the merge function on the spatial dataset (
gdf.merge()). The merge function will work if called on the DataFrame and the GeoDataFrame is passed into the function but the result will be a DataFrame not a GeoDataFrame.
Another important point is that choropleths will not display a country for which the value is NaN. To display all countries, even those with zero value, use
.fillna(0) to fill any NaN with zero.
Plotting the Choropleth
To plot the Choropleth call
.plot() on the GeoDataFrame and pass the variable column we want to plot as values on the map. Here, I’ve plotted the number of hectares used for arable land in 2017.
The default colorbar is out of scale with the map, and we also need a title on the plot and colorbar. To customize the plot and colorbar define the plot axes as
ax and the legend axes as
cax, then pass these to the
.plot() function. The below example uses the
mpl_toolkits make_axes_locatable function to vertically align the plot and legend axes. More examples of choropleth customization are provided in the GeoPandas documentation.
I followed the same workflow to create a Permanent meadows & pasture land DataFrame and plot the 2017 data on a Choropleth.
Change in land use over time
To visualize how land usage has changed since 1961 I’ve created choropleths displaying the difference between hectares in 1961 and 2017 by subtracting the 1961 hectare values from the 2017 values, and storing the returned values in a new column in the GeoDataFrame.
By default the colorbar applied in matplotlib will diverge from the midpoint between the max and min values of the plotted column e.g. 1000 to -4500. This is not very useful when using diverging colormaps to show positive and negative values as the plot above shows. The zero point is represented by blue and some negative values are also blue. We want the colourmap to diverge from zero, the grey midpoint color, positive values to be blue and negative values to be red.
To set the midpoint to zero use the
DivergingNorm function and set the vcenter value to zero, as shown below.
The choropleth now clearly illustrates how land use has increased, decreased or remained unchanged since 1961. Overall the amount of hectares in use for arable land has remained largely static since 1961. There has been an increase in arable hectares in Brazil. Kazakhstan and Ukraine also show an increase, but land use was not recorded for either of these countries in 1961. Looking at the actual data for these countries we see that there has been no significant change in arable hectares. Likewise, in the plot below the increase in Permanent meadows & pasture land in Kazakhstan is due to no data being recorded for 1961. The changes in Russia, Australia, Iran, Brazil, Mongolia and China are real. Which demonstrates the importance of questioning and being familiar with the data, rather than blindly making plots.
Geopandas makes it easy to combine geospatial data with the functionality of Pandas and quickly create Choropleths; an excellent way to visualize geospatial data and convey a lot of information in one picture.