- Import data to Pandas DataFrame
- Merge DataFrames
- Data cleaning & shaping – missing values, duplicates, dictionary, for loop etc.
Copper is the third most consumed industrial metal in the world, with 75% of copper used in electrical wires, telecommunication cables and electronics. Amazingly there is about 30 miles of copper wire in a electronic chip the size of your finger nail. Copper is a key metal in the transition to a low carbon future. The low-carbon electrification of heat, transport and industry requires conductive copper wire for wind turbines, solar photovoltaic’s, generators, transformers, inverters and electrical cables. Given its importance, how ubiquitous is copper as a raw material? Where does the world’s copper come from? How much has been mined and how has production changed over time?
Before we can answer any of these questions we need data and more often then not we will need to do some (or a lot of) work to do to get the raw data into a format that we can analyse. This post goes through the steps to merge, clean and shape a DataFrame so that it’s ready to visualize. All the analysis has been done in Jupyter notebooks, the notebooks and data files can be accessed in my Github repository.
Making the DataFrame
The data used is from the British Geological Survey World Mineral Statistics archive. The BGS have compiled data on mining, smelting, refining and trading for a wide range of mineral commodities. Their online portal provides access to data from 1970 to 2018, data can be selected for a maximum 10 year period and exported as excel files. Which means that the data for tonnes of copper mined since 1970 is contained in 5 separate excel files that we will have to merge. Also, since 1970 data for new countries has been added and other countries dropped as they no longer mine copper. Consequently, each file does not contain data for the same countries or same number of rows. Merging these files requires the year columns to be joined sequentially left to right and rows inserted for new countries as their production data is added in later years.
Merging files in Python is a relatively simple process but first we have to import the necessary Python packages. Python packages are directories of files with specific functionality, the wide range of packages available in Python is what makes Python so useful for data science. The packages we’ll use here are:
- NumPy, the fundamental package for scientific computing enabling manipulation of homogeneous array-based data.
- Pandas, provides easy-to-use data structures (Series and DataFrames) and data analysis tools for heterogeneous and labeled data
- Matplotlib, for static, animated and interactive publication-quality visualisations
Once imported use pandas
pd.read_csv function to import each of the csv files into a separate pandas DataFrame. The excel files here have been saved as csv files, but you can import excel files using
pd.read_excel. We can view the first 5 rows of the DataFrame by calling
We want to merge these 5 files into one DataFrame using the
pd.merge() function. The merge function allows you to select which columns the two DataFrames (“left” and “right”) will be merged on, and how the DataFrames will be merged. There are four types of merges in Pandas:
- Inner Merge, is the default Pandas setting and keeps only rows that are contained in each DataFrame.
- Outer Merge, returns all the rows from both DataFrames, matches rows and fills in NaN where data is missing.
- Left Merge or Right Merge, keeps every row in the left or right DataFrame. Rows that have no values in the opposite DataFrame are filled with NaN.
A more detailed explanation of the merge function can be found here. To create a master DataFrame we merge the first two excel files creating cu_mined_70_91. The third excel file is merged with cu_mined_70_91, creating cu_mined_70_02, and so on until there is one DataFrame containing all production data for all countries from 1970 to 2018.
The final merged cu_mined_70_17 DataFrame has 90 rows and 50 columns as shown by calling the .
shape attribute. We can also see that there is no production data for Algeria and Austria after a certain period and these empty values have been filled with NaN.
Now that we have the merged dataframe the process of data cleaning begins, which is a task that consumes most of a data scientists time.
“80% of a data scientist’s valuable time is spent simply finding, cleaning , and organizing data, leaving only 20% to actually perform analysis…”IBM Data Analytics
First, lets get an overview of the DataFrame using the attribute
.info. The information returned shows the Column label and the Non-Null count for each column i.e. the number of values present in each column. In the image below I’ve cut out the middle section to shorten the figure but all of the columns have a number of values missing. The Dtype column describes the data type in each column.
Python has a number of data types:
- Object – string text (str) or mixed numeric and non-numeric values
- float64 – floating point numbers
- int64 – integer numbers
- bool – True/False values
- datetime64 – date and time values
- category – finite list of text values
When doing data analysis it’s important to use the correct data types, otherwise you may get unexpected results or errors. In many cases Pandas will infer the correct data types but inevitably there will be times when you will need to convert data from one type to another. A detailed look at Pandas Data Types can be found here. In our case, all data types are as expected, object for Country and floats for tonnes of copper mined. Which leaves the NaN values. In this DataFrame we’ll replace NaN with 0, in other cases you may want to fill the value with ‘missing’. We use
.loc to select all rows in the columns from 1970 to 2017 and
.fillna(0) to replace NaN with 0. We now have a DataFrame with no missing values.
The data extends back to 1970, over that period country names have changed. We can see this in the list of country names below. If we want to use our data to create maps from world map shapefiles we should ensure that the country names align with current official names. Otherwise we will have difficulty merging Dataframes and GeoDataFrames.
Country names include Congo, German Democratic Rep, German Federal Republic and Yugoslavia. To change these to the current country names we create a dictionary with the key being the old name and the value being the new name. The .map() function will take the dictionary key and find the corresponding value in the DataFrame ‘Country’ column, then replace the column value with the dict value. It’s important to call
.fillna() using the ‘Country’ column, as Python will replace names that are not in the dictionary with NaN. Call
.tolist() on the country column to check that the mapping has worked; only the upper part of the list is shown below.
The country column now contains the official country names in the correct format, but this has created rows with duplicated country names in some instances, such as the Democratic Republic of Congo replacing Congo, Democratic Republic Congo, and Zaire. We can count the number of duplicates in the ‘Country’ column using
.value_counts. This returns a count of occurrences in descending order and shows that 5 of the country labels occur more than once in the DataFrame.
As I want to look at the total amount of copper mined in each country I need to sum the production values for each of these countries so that I have production values for the entire period in one row, before and after the name change. This requires a couple of steps, to reduce the amount of code we can write a for loop for some of these steps.
Firstly, create a list containing the duplicate countries in the DataFrame and an empty list that the new DataFrames will be stored in. The for loop below iterates over the list, on the first iteration the rows containing Democratic Republic of the Congo are extracted and then summed together. Summing the rows produces a series of years and production values for the DRC. As we want to append this to our original DataFrame the series is converted to a DataFrame in the next line using
pd.DataFrame(). The final step is to append the DataFrame to the empty list with
.append(). The output of the for loop is a list of DataFrames that we can access using the list index.
Converting the Series has returned a DataFrame representation of the Series. Transposing the DataFrame will pivot the data into a row format that can be appended to the original DataFrame.
Rather than append each DataFrame individually we can concatenate them using
pd.concate into one DataFrame that can be joined to the cu_mined_70_18 DataFrame. We also have to change the country names in the ‘Country’ column from repeat names to a single name. As there are only 5 names we can use
.replace instead of creating a dictionary and mapping the names as we did earlier.
Remember we now have two DataFrames, the subdf_concat containing summed rows of values for 5 countries and the original cu_mined_70_18 DataFrame that contains the individual rows for these countries. We need to remove the individual country rows from the cu_mined_70_18 DataFrame before joining the two DataFrames or we will double the values for these countries.
.isin() view the rows for the relevant countries in cu_mined_70_18.
Pass the index for each of these rows to the
.drop() function to remove them from cu_mined_70_18. We can see that the number of rows in the DataFrame have decreased from 90 to 78. Now append subdf_concat to cu_mined_70_18, setting
ignore_index=True so that the index of the appended rows continues on from the DataFrame index. The new rows are appended to the end of the DataFrame.
Let’s do a quick check to make sure there are no duplicate names in the country column by calling
.value_counts on the ‘Country’ column and setting the count to 1. Check that there are no NaN’s in any of the year columns by selecting the year columns with
.iloc and calling
.notnull().all().all(). If we don’t chain two
.all() the output will be a DataFrame with True/False for each column,
.all().all() returns True/False for all columns. The final step is to save the clean cu_mined_70_18 DataFrame to a csv file using
There we have it, less than 25 lines of Python code to import 5 csv files, merge and clean the DataFrame and produce a single csv file ready for analysis. Which will be the subject of the second part of this post.