Pieces of Py #3 Clean up data from csv and save to json in a simple way with pandas
Posted on Mon 05 August 2019 in Python • 4 min read
At work I once had to take a csv file that was exported from a large list in SharePoint, clean it up a bit and export it to a JSON format for importing in to a data layer. The csv file contained all columns in the SharePoint list, which was quite many, and some of them contained information that was not useful in the data layer.
I could have used to built-in csv library in python, but had read a little bit of pandas and thought it would be faster and easier to use that. Also an opportunity for me to learn a little about pandas.
The data file
For this example I have generated a representative sample csv with the help of mockaroo.com. I've put the data into a file called SERVER_DATA.csv in a data folder. Here is a preview of how the top of the csv file looks.
id;server_name;ip_adress;mac_address;misc_info;status;size
1;swe_oefg_767;28.118.94.105;D8-8F-4A-21-FA-E3;;production;small
2;swe_uqyu_254;248.221.85.237;AA-05-A5-D3-5D-8C;4.1;production;small
3;swe_dkhr_247;188.169.9.5;30-19-FD-98-39-D0;;discontinued;large
4;swe_fini_035;42.30.234.149;5F-DE-2E-33-36-11;;discontinued;small
5;swe_wyff_450;198.113.163.165;F4-55-D6-AF-FC-7C;5.5.6;discontinued;medium
6;swe_opha_769;231.229.91.157;55-2A-38-75-8F-51;;discontinued;large
7;swe_nkeu_365;50.138.198.80;28-BD-EA-D0-83-F6;0.9.6;production;small
8;swe_zwsu_221;158.199.8.235;EC-3C-BE-C9-AA-B7;;discontinued;medium
9;swe_txjz_516;141.195.246.105;4E-E0-42-79-30-2F;1.54;discontinued;medium
10;swe_ccdm_053;150.107.67.30;33-57-6B-6F-AB-C2;;production;medium
11;swe_lerd_015;253.152.33.116;82-CA-F8-60-98-A0;8.2.3;discontinued;medium
12;swe_qjvb_567;78.118.62.146;95-FA-53-0F-A5-A4;;production;large
Load the file into a dataframe
At first I load the file into a pandas dataframe (df), and use head()
to have a look at the first five rows.
import pandas
# read the csv file into a dataframe
file = 'data/SERVER_DATA.csv'
df = pandas.read_csv(file, sep=";")
df.head()
id | server_name | ip_adress | mac_address | misc_info | status | size | |
---|---|---|---|---|---|---|---|
0 | 1 | swe_oefg_767 | 28.118.94.105 | D8-8F-4A-21-FA-E3 | NaN | production | small |
1 | 2 | swe_uqyu_254 | 248.221.85.237 | AA-05-A5-D3-5D-8C | 4.1 | production | small |
2 | 3 | swe_dkhr_247 | 188.169.9.5 | 30-19-FD-98-39-D0 | NaN | discontinued | large |
3 | 4 | swe_fini_035 | 42.30.234.149 | 5F-DE-2E-33-36-11 | NaN | discontinued | small |
4 | 5 | swe_wyff_450 | 198.113.163.165 | F4-55-D6-AF-FC-7C | 5.5.6 | discontinued | medium |
Remove columns
In this case I know that the column misc_info
is something that is not needed in the destination database. So I want to remove that column. This is easily done with the drop()
method of the dataframe object. I put in the name of the column (or columns) that I want to remove, into a list.
After then let's use head()
again to see how the data looks.
# Remove unwanted columns
df = df.drop(columns=[
'misc_info',
])
df.head()
id | server_name | ip_adress | mac_address | status | size | |
---|---|---|---|---|---|---|
0 | 1 | swe_oefg_767 | 28.118.94.105 | D8-8F-4A-21-FA-E3 | production | small |
1 | 2 | swe_uqyu_254 | 248.221.85.237 | AA-05-A5-D3-5D-8C | production | small |
2 | 3 | swe_dkhr_247 | 188.169.9.5 | 30-19-FD-98-39-D0 | discontinued | large |
3 | 4 | swe_fini_035 | 42.30.234.149 | 5F-DE-2E-33-36-11 | discontinued | small |
4 | 5 | swe_wyff_450 | 198.113.163.165 | F4-55-D6-AF-FC-7C | discontinued | medium |
Filter out unwanted rows
The misc_info
column is removed. If I look at the status column, there is servers that are no longer in production. I do not want these either in the destination database. So let's filter them out by the following command, and then use head()
again.
# filter out discontinued servers
df = df[df['status'] != 'discontinued']
df.head()
id | server_name | ip_adress | mac_address | status | size | |
---|---|---|---|---|---|---|
0 | 1 | swe_oefg_767 | 28.118.94.105 | D8-8F-4A-21-FA-E3 | production | small |
1 | 2 | swe_uqyu_254 | 248.221.85.237 | AA-05-A5-D3-5D-8C | production | small |
6 | 7 | swe_nkeu_365 | 50.138.198.80 | 28-BD-EA-D0-83-F6 | production | small |
9 | 10 | swe_ccdm_053 | 150.107.67.30 | 33-57-6B-6F-AB-C2 | production | medium |
11 | 12 | swe_qjvb_567 | 78.118.62.146 | 95-FA-53-0F-A5-A4 | production | large |
Export to json
Now the data seems to be clean and ready to export to json. The dataframe has a to_json()
method for that task. It can output the data into different formats. (Look in the pandas documentation for the different formats available).
I am using the records
format to get a list like output.
df.to_json('data/server_data.json', orient='records')
The top rows of the content of the json file is shown below.
[
{
"id": 1,
"server_name": "swe_oefg_767",
"ip_adress": "28.118.94.105",
"mac_address": "D8-8F-4A-21-FA-E3",
"status": "production",
"size": "small"
},
{
"id": 2,
"server_name": "swe_uqyu_254",
"ip_adress": "248.221.85.237",
"mac_address": "AA-05-A5-D3-5D-8C",
"status": "production",
"size": "small"
},
{
"id": 7,
"server_name": "swe_nkeu_365",
"ip_adress": "50.138.198.80",
"mac_address": "28-BD-EA-D0-83-F6",
"status": "production",
"size": "small"
},
Conclusion
In this article I used pandas
to load and clean up a csv file, and then exported it to a json file. All was done in around 6 lines of code. I don't know how many lines of code it would have taken to do this with the build in csv
library. But pandas
provide some really easy-to-use tools to do work like this on different data files, like csv
, excel
, json
and other formats.
I am thinking, to do these basic tasks that I've done in this article is so straight-forward, that you really don't need to be a python developer to do it. Any tech-savvy person could do this by using a hosted Jupyter notebook (e.g. notebooks.azure.com), and it might even be easier than to do it in Excel.
Resources
- Pandas home page
- Pandas.read_csv - Pandas documentation
- Pandas.DataFrame.to_json - Pandas documentation
Let me know on Twitter if I can improve this article, or if you have other resources to help out with understanding this topic.