- DataFrame filtering & subsetting.
- Dictionary from lists
- Relational plots using .relplot()
- Categorical plots using .catplot()
- Distribution, swarm and scatterplots
Ensuring global food security is number 2 of the 17 UN Sustainable development Goals that are part of the 2030 Agenda for Sustainable Development. The worlds population is currently 7.8 billion, 60 years ago it was approximately 3 billion and by 2050 is expected to increase to 9.7 billion. How will the world sustain food production for this increasing population? Where does our food come from? Which countries produce the most food and what is it? Analyzing data from the Food and Agriculture Organization of the United Nations I will look at global agricultural land use and crop production using the Python libraries Pandas to filter and manipulate the data and Seaborn for data visualization.
Seaborn is a Python data visualization library for making statistical graphs. It is built on top of Matplotlib and provides a high-level interface that allows you to make high quality, statistical plots in much less code than would be required in Matplotlib. Also, Seaborn is integrated with pandas which makes plotting DataFrames very straight forward. Generally the default Seaborn plots are attractive and have informative labels, legends etc. Customization of plots can be done by switching between Seaborn styles, creating customized color palettes or passing Matplotlib based keyword arguments that are passed down to the Matplotlib layer. Plotting in Matplotlib was covered in an earlier post.
The data comes from FAOSTAT, a data bank containing food and agricultural data for over 245 countries from 1961 to the most recent years available. I’ve downloaded data from the Land Use domain that contains data on 47 categories of land use, irrigation and agricultural practices from 1961 to 2017. The dataset contains 5824 rows and 64 columns, and is in this GitHub repository.
The raw data csv file has been imported into a Jupyter notebook and cleaned using the workflow described in the post Raw Materials & Raw Data, II, The Jupyter notebook to create the clean land use csv file is here.
I’ve imported Pandas, Numpy, Matplotlib and Seaborn, and the clean land use csv file. Using
.info() we can view the structure of the DataFrame.
The DataFrame is composed of 61 columns and 5823 rows. There are columns for each year and rows for each category of land use recorded for a country. Calling the functions
unique() on the ‘Country’ column returns the number of times each country name occurs as a unique value in the column, and a list of unique values in the column.
There are 274 different individual country names in the ‘Country’ column. Scrolling to the end of the list we see that there are rows for regional country groups, such as Western Asia, Eastern Europe, and also rows for World, which contains global values for land use categories.
Flitering, subsetting and shaping the DataFrame
The Land Use DataFrame contains values at a range of scales from World to individual country. Analyzing the data will be easier if the DataFrame is broken down into these different levels, we can drill down into the data from the global scale to individual country level.
To filter the DataFrame I’ve created 6 lists to create 6 DataFrames for world, world region, continent, continental region, countries and economic area.
.isin() function on the landuse DataFrame ‘Country’ column and passing one of the list names as an argument, selects the rows that contain the values in the list. Assigning the call to a new name saves the selection as a new DataFrame. We can check that the new DataFrame contains the list values by calling
.unique() on the DataFrame.
The same process is used to create the world, economic area and world region DataFrames. To create the DataFrame containing only countries I created a list of all values that are not countries, as this is a shorter list. I then negate the
.isin() function by putting ~ in front of the call for the column, this reads as select all values that are not in the list.
Once we have the 6 DataFrames we can reshape them into a more easily analysed format by using
.melt(). A discussion on melting data and tidy data can be found in a previous blog post, Raw Material & Raw Data, II. The image below shows the code for melting the world DataFrame. I’ve then used
.pivot_table to pivot the melted DataFrame, so that each land use category becomes a column. As we’ll see when we start plotting the data with Seaborn, it’s useful to be able to call a specific variable column.
The world DataFrame contains 18 land use categories, category definitions are provided on the FAOSTAT site. What I’m interested in looking at is land used for food production, namely crop production as plants provide 90% of the world’s energy intake. The most commonly eaten crops are know as staple crops and include rice, corn and wheat. The ‘Arable land’ category combines land under temporary crops (<1 year growing cycle), temporary meadows, pastures and fallow. ‘Land under permanent crops’ is land cultivated with long term crops that do not have to be replanted annually, e.g. coffee. ‘Land under permanent meadows and pastures’ is land used permanently (>5 years) to grow herbaceous forage crops.
To create a subset of the DataFrame containing only these three columns of interest use square brackets to define a list of the columns, which is then placed inside selection brackets [[ ]]. If we were selecting only one column we could use one set of square brackets which will return the column as a series, as each column in a DataFrame is a Series, or place the column in double square brackets [[ ]] to return the column as a single column DataFrame. You can check the data structure type using
In addition to the land use categories I’ve included the Country, Land area and year variable in the new DataFrame. I’ve also created a melted version of this reduced DataFrame to use for certain Seaborn plot types.
Data visualisation with Seaborn
Seaborn is an extremely useful library for data exploration as it is easy to create common statistical plot types e.g. scatter plots, boxplots, from pandas DataFrames. It’s also simple to display third variables or subgroup data using the hue parameter.
There are two kinds of plotting functions in Seaborn, axes-level functions called directly e.g.
sns.scatterplot, or figure-level functions called as a parameter in the
sns.catplot() plotting functions. Figure-level functions allow you to make subplots, axes_level functions produce a single graph. For a more detailed explanation look at the Seaborn documents.
Lets start by creating a distribution plot which is neither a relplot or catplot plot type. The full list of Seaborn plot types can be viewed in the Seaborn documentation.
Distribution plot .distplot()
Seaborn distribution plots combine the Matplotlib histogram plot with the Gaussian Kernel Density Estimate plot (kdeplot) and rugplot, both of these can be disabled when calling the
.distplot() function. In addition, distplot automatically labels the x axis and generates narrower bins than Matplotlib histogram.
Below I’ve plotted the distribution of hectares globally for each land use type from the world_cols DataFrame. The first set of plots include the kde and rugplot, the second set of plots have kde=False, creating a simple histogram. For a white plot background I’ve used the
'ticks' style and to scale the plot elements and labels I’ve set the context to
'notebook'. Scale parameters from smallest to largest are
'paper', 'notebook', 'talk', 'poster'. The
sns.despine() function removes the top and right boundary of the plot.
From the above plots we can see that the areas under arable and permanent meadows & pastures are much larger than areas under permanent crops. Lets use the melt_world_cols_cp DataFrame,
relplot() to look at global land use in more detail.
Relational .relplot() and categorical .catplot() plots
Relational plots are used to examine the relationship between two quantitative variables, e.g. height vs weight on a scatter plot. Categorical plots are used to make comparisons between groups or categories of variables, e.g. smoker or non-smoker. The advantage of using
catplot() is that it is easy to create subgroups of plots using the col= and row= parameters, providing a more detailed breakdown of the data. The example below shows how
relplot() make use of col and row to subgroup data.
If we want to look at the number of hectares globally that is used for the three land use types we can make a bar graph using
.catplot(). It would also be interesting to see what percentage of global land area is being use as arable land, permanent crops and permanent meadows & pasture. To do this we need to create a new column for land use hectares as a percentage of total global land area.
sns.catplot() and specifying
'bar' in the kind parameter we can make a bar plot of ‘Land use’ categories against ‘hectares’ and a second plot against ‘hect_pct’. Certain aspects of the plot can be set within the function, such as
palette. There are numerous Matplotlib color palette’s or you can make your own palette. Customizing plot labels etc. is done using
.set_, and requires each plot to have a label , the custom is to use g. As I’ve two plots I’ve labeled the second h.
We can see the power of
catplot() to quickly make categorical plots, It’s also simple to try out other plot types by changing the
kind= parameter, as I’ve done below to quickly change the bar plot to a box plot.
The bar plot shows that ~37% of land area globally is used for crop production or pasture land, with the greatest percentage being under meadows & pasture land. From the boxplot we can see that the range of hectares in each land use category is very narrow, suggesting little change in the size of area cultivated since 1961. Using
relplot() we can make a line plot of ‘year’ against ‘hectares’ to examine how land use area has changed over 60 years.
A swarmplot plots a point for each categorical observation. The points are adjusted along the categorical axis so that no points overlap (set
dodge=False). This can be a good way to visualize the distribution of observations within a category. The swarmplots below plot hectares per year and hectares as percentage of total land area for each of the land use types, and color codes the data by continent.
Each point in the swarmplot represents the number of hectares under a specific land use in a particular year, as such it plots land use over time. As the previous global land use line graph showed, the amount of agricultural land has remained relatively static across continents over the past 60 years, only Australia and Asia record a change in hectares in agricultural use. The continent with the most hectares of agricultural land is Asia, unsurprising as it’s the biggest land mass, but other continents come a close second depending on the land use type.
Scatterplot() and relplot(scatterplot)
In each continent which countries have the most arable or pasture land? One way to visualize this is to make a scatter plot plotting country land area against hectares, and to set the hue parameter to the ‘Country’ column to color code points for each country. Setting
style='Land use' uses a different marker for each land use type. The resultant graph below shows data for more than 200 countries and if you zoom right in you will see that the largest hectare values are plotted with an x marker for land under permanent meadows & pasture. Unfortunately there are a number of countries that are color coded red, blue and brown so we don’t know which are which on the plot. If we plot countries by continent it would make the graphs more readable.
Dictionaries from lists and DataFrame .query()
We could subset the melt_coun_cols_cp DataFrame into continents, as we did to the land use DataFrame to create the countries DataFrame initially. An alternative method is to use pandas
.query() function to select countries by continent. To do this we would call:
melt_coun_cols_cp.query('Continent == "Asia"')
Which will pull out all rows containing the value Asia, but there is no column containing continent names in the countries dataset. To add a continent column we can use a dictionary to map key value pairs to the ‘Country’ column.
To make a dictionary we first need to create lists of all countries in each continent. The lists for Asia and Africa are shown below.
Using the Python
dict.fromkeys() function we can transform the lists into dictionaries. The function takes two arguments
seq = the list of values which will be used for dictionary keys and
value = the value of the key:value dictionary pair. For more detail on Python dictionaries look at the Python documents. Pass the list of countries for each continent to the
seq argument so the countries will become the dictionary keys. The value for each key will be the same, it will be the name of the continent the country is located in. The Asia dictionary created from the Asia list is shown below.
To map continent names to the entire ‘Country’ column in same command we can merge the 6 dictionaries into a single dictionary. The
.map() function is then called; the dictionary keys map to the corresponding name in the ‘Country’ column and the dictionary value, continent name, is placed in a newly created ‘Continent’ column.
Now we can use
.query() to call a specific continent in the
.relplot() function and plot data for that continent only. Subplots for each of the land use types are created using the col= parameter. I’ve also used the size= parameter to scale the markers to the country land area and sizes=() parameter to specify the min and max size to use. Land area is also plotted on the x axis but this is an opportunity to demonstrate the many variables that can be easily plotted in a single Seaborn function.
Using the same code as above but replacing the query term with Africa produces the same graphs for all African countries in the DataFrame.
A list of countries can also be passed to the DataFrame query function. Here the list is composed of all the countries in which land use has changed since 1961.
Only 12 countries show any significant change in the number of hectares used for agricultural land. If we want to look at the nature of this change, whether the number of hectares are increasing or decreasing, take the same code as above and change the x variable from
'Land area' to
'year'. For aesthetics I’ve changed the
col_wrap parameter from 2 to 1 so the plots are in a single column and increased the
aspect from 1 to 2.
Of the three land use types only Land under permanent meadows and pastures shows any substantial changes over the past 60 years, with Australia recording a decrease of ~150,000 ha and China recording a increase of ~160,000 ha. Saudi Arabia has increased land under permanent meadows by ~ 50,000 since the early 80’s.
Seaborn is an excellent python library for statistical data visualization. It’s simple and quick to make informative graphs, subgroup data and switch between plot types to find the best visualization.
Interestingly the analysis of agricultural land use shows that over the past 57 years the area of land used for crop cultivation and animal pasture land has changed very little despite the worlds population have grown by 4.8 billion over the same period. What is the reason for this? Increased productivity per hectare? Decreased consumption? Another dataset from the FAOSTAT site could help answer these questions.