- Exploratory Data Analysis
- Data visualization
- Tidy Data
In part I we took raw csv files, merged, cleaned and shaped them. Now we’re ready for the fun part, exploratory data analysis (EDA). Using different data visualizations we can start to answer questions such as where is copper mined? How much is mined? and reveal other interesting aspects of the data. Data visualization is key to understanding your data and communicating important insights to your audience.
There are numerous different data visualization packages such as ggplot, Bokeh, Plotly, Altair and Seaborn, some are for specific tasks and others can be used for a wide range of plots. However, Matplotlib was the first data visualization library for Python, is the most commonly used and many other libraries are built on top of it, allowing you to access the functionality of Matplotlib with less code. Seaborn is a statistical data visualization library built on Matplotlib and lets you create attractive statistical graphs in much less code than using Matplotlib. There are limitations to these wrapper libraries, such as specific customization of graphs, which is why knowing how to use Matplotlib is really useful for creating customized data visualizations.
We start as always by importing the necessary python packages, including Matplotlib. Matplotlib includes stylesheets, templates for plot styles, that allow you to set a style for all charts created in the notebook. Here I’ll use the seaborn-whitegrid style. Import the clean_cu_mined_70_18 csv file created in Part I and check the DataFrame head. The Jupyter notebook for the analysis can been accessed in my Github repository.
New Columns, Tidy Data & Melted Data
Adding Total Column
.describe() method we can view some basic statistics of the DataFrame numeric columns. With the DataFrame in the current shape we get statistics for copper mined globally each year, but what about the total copper mined per country.
To get the total copper mined per country since 1970 we sum the values in each country row and create a column for ‘Total mined per country’. Pass the argument axis=1 into the
.sum() function to apply the function along each row.
To get the total amount mined per year globally create a separate single row DataFrame rather than append a row to the cu_mined DataFrame. Keeping the yearly total separate makes reshaping and plotting the cu_mined DataFrame simpler.
To create a separate DataFrame for the global annual amount mined firstly select all the year columns using
.iloc[:, 1:], the first argument
: selects all rows, the second
1: selects the second column, yr1970, and all columns to the end of the DataFrame. This creates a series with a column containing values for total tonnes per year, which we transpose (
.T) into a single row DataFrame. Add a column for Country by reindexing the DataFrame using the column names of the cu_mined DataFrame, and use
.fillna() to fill the ‘Country’ column with the label ‘All countries’. I’ve also used
.drop() to drop the last column which is the total of the total mined per country column.
Tidy Data & Melted Data
We now know the total copper mined per country and the total amount of copper mined globally each year since 1970, but how can we easily examine how the amount of copper mined has changed over time or which countries produce the most copper?
Right now the DataFrame is in a human readable form with a row for each country and a separate column for each year. This structure is not good for analysis using pandas or plotting, as we can’t call a single variable column such as year. We need to rearrange the DataFrame into a Tidy Data format, were each column represents a separate variable and each row represents an individual observation. In this case we want a separate column for ‘Country’, ‘Year’ and ‘Tonnes mined per year by country’ so that each row records the amount mined by a country in a particular year. This structure will allow us to use functions such as Groupby to look at country output and plot trends of production against time.
To reshape the DataFrame into a tidy structure use
pd.melt(). This function takes the following arguments:
- frame = the DataFrame to melt
- id_vars (identifier variables) = the columns to hold constant
- value_vars = columns to melt into a single column. If you don’t specify columns to melt the function will use all columns not specified in id_vars.
- var_name = name of the new variable column
- value_name = name of the new column containing the values.
By only specifying the id_vars columns the melt function merged all the year columns, which saved having to type in all 48 year column names. Now the descriptive statistics are for the total data set not just each year. We can see the maximum total amount of copper mined by a country is 150 million tonnes and the average amount mined per year by a country is 141,407 tonnes.
Filtering the melted data is straightforward, the example above uses
.sort_values to sort the data in the ‘Tonnes mined per year by country’ column from largest to smallest. We can quickly see that Chile produces the largest amount of copper and is the country that has mined the max 150 million tonnes.
Plotting the data
Once the DataFrame is in the right shape and we have all the columns we need, we can start plotting and examining the data. There are two approaches to creating graphs in Matplotlib, using the MATLAB-style state-based interface or the object-oriented interface.
The MATLAB-style tools are accessed via the pyplot (
plt) interface and creates active figures and axes on to which all plt commands are applied. Plotting simple plots quickly is easily done using the plt interface. For more complicated plots and the ability to alter plots already created the object-oriented interface is preferable.
In the object-oriented interface a figure and axis is created onto which the plotting functions are called, which means you can access and change individual plot elements such as the size of tick labels, plot label or the legend. A detailed explanation of the different styles and examples can be found in Jake VanderPlas Python Data Science Handbook. Most of the graphs here are made using the object-oriented interface.
The graph below shows how the total amount of copper mined per year globally has changed since 1970. To create the plot we call the
plt.subplots() function that creates two different objects. A figure object that is a container that holds everything on the page, and the axes object that is the canvas on to which the data is drawn. These two objects can be assigned any name but the convention is to call them fig and ax. Adding data to the axes object is done by calling methods on the axes object e.g.
To add column data to the axes object we call the
.plot() method on ax. In the above code the parameters passed into
.plot() are the name of the DataFrame and the column we want to plot on the x and y axis respectively. The line is formatted using the format string
'g--', for a green dashed line. Calling various methods allows us to add a graph title (
set_title) and label on the y axis (
set_ylabel), change the tick labels on the y axis (
set_yticklabels) and rotate the x axis tick labels (
Next lets look at the distribution of copper mined per country by year and in total by plotting the data on histograms. In this plot aspects of the plot are set for both plots rather than individually by setting the runtime configuration (rc) options. Calling
plt.rc we can set the font size of the plot title, axes labels, tick label size and more. Look at the Matplotlib documents for more plot customization parameters.
The histograms show that the majority of countries produce less than 250,000 tonnes of copper a year and have produce less than 5 million tonnes since 1970. A small number of countries produce volumes >1 million tonnes a year. To dive deeper into the production of each country over time we can plot the data for each country as a boxplot.
Boxplots are a great way to visualize the range of data, in this case the range of volumes mined by each country over 48 years. Clearly for the most part the level of production has been small and constant over the period, but there are a number of countries with high levels of production that have changed over time.
To pull out these countries for further analysis I’ve used the
.groupby() function to group the values by country, applied
.median() to find the median production for each country and sorted the values by ‘Tonnes mined per year by country’ .
Looking at the data in this way shows there is a break in the median value for ‘Total mined per country’ between Kazakhstan and South Africa. All countries above South Africa have mined more than 11 million tonnes of copper in total, and mine more than 250,000 tonnes annually. These are the top producers and are likely to produce most of the world’s copper. We’ll create a DataFrame containing these top producing countries so we can analyse the data in more detail.
Selecting countries that have produced >11 million tonnes in total returns a DataFrame containing 13 countries. If we want to see what proportion of copper globally has been produced by these 13 countries we can group the data by year and sum each of these groups, getting the total copper mined annually. Pandas groupby function is one of the most useful functions in Pandas. It’s well worth getting to grips with and a good place to start is with the creator of Pandas, Wes McKinney’s Python for Data Analysis.
.groupby() function creates a DataFrame with the index set to the grouped parameter, in this case year. For convenience of plotting the data, reset the index to integers by calling
.reset_index(inplace=True) on the DataFrame.
Now we can add the production data for the top 13 to the lineplot for annual global production, by plotting the top 13 data on the same ax using
.plot(). The graph shows that the majority of the worlds copper is mined in these 13 countries.
If we want a better look at the total amount produced by these top 13 we can again use
.groupby() to group the ‘Tonnes mined per year by country’ by ‘Country’, and sum the values for each country. Plotting this data on a bar chart shows the total amount mined by each of the top 13 countries since 1970.
To look at how the level of production has changed over time for each of these countries plot ‘Tonnes mined per year by country’ for each country against ‘Year’. There are a number of way we can make this plot.
We can set the ‘Year’ column as the DataFrame index using
.set_index(). Then use
.groupby to group the data by country and call
.plot() on the ‘Tonnes mined per year by country’ column.
Another approach is to pivot the Top_cu_mines DataFrame so that each country row becomes a column containing the value for tonnes mined per year. This can be plotted simply by calling
.plot() on the pivoted DataFrame.
If we reset the index of the pivoted DataFrame, so year is a variable column again, we can plot each country column against ‘Year’ as separate lines, as shown below; this gives greater control over line style.
Lists of country columns can be created which can be plotted as one color, as I’ve done below with the low_5 producing countries. Plotting these as grey lines simplifies the graph somewhat. These and a few other plotting varieties are available in the Jupyter Notebook.
What does our graph tells us about copper production? Since the early 90’s Chile has been the world’s top copper mining country, rapidly increasing the tonnes mined between 1995 and 2003. In 2018, Chile produced almost 6 million tonnes of copper, approximately 30% of all copper mined that year. The USA’s production has declined since 1998, while Peru and China have seen an increase in the amount of copper mined in recent years. For the remainder of countries production levels have remained relatively constant.
In this post you have seen some of the many useful tools and techniques Pandas and Matplotlib provide to manipulate, analyse and visualize data. Matplotlib provides a high level of plot customization but can make creating good quality plots time consuming. In future posts we’ll look at other Python libraries that produce high quality graphs in only a few lines of code.
One thought on “Raw Materials & Raw Data, II”
[…] points, lines and polygons. It also depends on Fiona for file access, and Descartes and Matplotlib for plotting. With GeoPandas we can carry out geospatial operations using Python and Pandas that […]