Pieces of Py #3 Clean up data from csv and save to json in a simple way with pandas

Posted on Mon 05 August 2019 in Python • 4 min read

At work I once had to take a csv file that was exported from a large list in SharePoint, clean it up a bit and export it to a JSON format for importing in to a data layer. The csv file contained all columns in the SharePoint list, which was quite many, and some of them contained information that was not useful in the data layer.

I could have used to built-in csv library in python, but had read a little bit of pandas and thought it would be faster and easier to use that. Also an opportunity for me to learn a little about pandas.

The data file

For this example I have generated a representative sample csv with the help of mockaroo.com. I've put the data into a file called SERVER_DATA.csv in a data folder. Here is a preview of how the top of the csv file looks.

id;server_name;ip_adress;mac_address;misc_info;status;size
1;swe_oefg_767;28.118.94.105;D8-8F-4A-21-FA-E3;;production;small
2;swe_uqyu_254;248.221.85.237;AA-05-A5-D3-5D-8C;4.1;production;small
3;swe_dkhr_247;188.169.9.5;30-19-FD-98-39-D0;;discontinued;large
4;swe_fini_035;42.30.234.149;5F-DE-2E-33-36-11;;discontinued;small
5;swe_wyff_450;198.113.163.165;F4-55-D6-AF-FC-7C;5.5.6;discontinued;medium
6;swe_opha_769;231.229.91.157;55-2A-38-75-8F-51;;discontinued;large
7;swe_nkeu_365;50.138.198.80;28-BD-EA-D0-83-F6;0.9.6;production;small
8;swe_zwsu_221;158.199.8.235;EC-3C-BE-C9-AA-B7;;discontinued;medium
9;swe_txjz_516;141.195.246.105;4E-E0-42-79-30-2F;1.54;discontinued;medium
10;swe_ccdm_053;150.107.67.30;33-57-6B-6F-AB-C2;;production;medium
11;swe_lerd_015;253.152.33.116;82-CA-F8-60-98-A0;8.2.3;discontinued;medium
12;swe_qjvb_567;78.118.62.146;95-FA-53-0F-A5-A4;;production;large

Load the file into a dataframe

At first I load the file into a pandas dataframe (df), and use head() to have a look at the first five rows.

import pandas

# read the csv file into a dataframe
file = 'data/SERVER_DATA.csv'
df = pandas.read_csv(file, sep=";")

df.head()

	id	server_name	ip_adress	mac_address	misc_info	status	size
0	1	swe_oefg_767	28.118.94.105	D8-8F-4A-21-FA-E3	NaN	production	small
1	2	swe_uqyu_254	248.221.85.237	AA-05-A5-D3-5D-8C	4.1	production	small
2	3	swe_dkhr_247	188.169.9.5	30-19-FD-98-39-D0	NaN	discontinued	large
3	4	swe_fini_035	42.30.234.149	5F-DE-2E-33-36-11	NaN	discontinued	small
4	5	swe_wyff_450	198.113.163.165	F4-55-D6-AF-FC-7C	5.5.6	discontinued	medium

Remove columns

In this case I know that the column misc_info is something that is not needed in the destination database. So I want to remove that column. This is easily done with the drop() method of the dataframe object. I put in the name of the column (or columns) that I want to remove, into a list.

After then let's use head() again to see how the data looks.

# Remove unwanted columns
df = df.drop(columns=[
        'misc_info',
    ])

df.head()

	id	server_name	ip_adress	mac_address	status	size
0	1	swe_oefg_767	28.118.94.105	D8-8F-4A-21-FA-E3	production	small
1	2	swe_uqyu_254	248.221.85.237	AA-05-A5-D3-5D-8C	production	small
2	3	swe_dkhr_247	188.169.9.5	30-19-FD-98-39-D0	discontinued	large
3	4	swe_fini_035	42.30.234.149	5F-DE-2E-33-36-11	discontinued	small
4	5	swe_wyff_450	198.113.163.165	F4-55-D6-AF-FC-7C	discontinued	medium

Filter out unwanted rows

The misc_info column is removed. If I look at the status column, there is servers that are no longer in production. I do not want these either in the destination database. So let's filter them out by the following command, and then use head() again.

# filter out discontinued servers
df = df[df['status'] != 'discontinued']

df.head()

	id	server_name	ip_adress	mac_address	status	size
0	1	swe_oefg_767	28.118.94.105	D8-8F-4A-21-FA-E3	production	small
1	2	swe_uqyu_254	248.221.85.237	AA-05-A5-D3-5D-8C	production	small
6	7	swe_nkeu_365	50.138.198.80	28-BD-EA-D0-83-F6	production	small
9	10	swe_ccdm_053	150.107.67.30	33-57-6B-6F-AB-C2	production	medium
11	12	swe_qjvb_567	78.118.62.146	95-FA-53-0F-A5-A4	production	large

Export to json

Now the data seems to be clean and ready to export to json. The dataframe has a to_json() method for that task. It can output the data into different formats. (Look in the pandas documentation for the different formats available).

I am using the records format to get a list like output.

df.to_json('data/server_data.json', orient='records')

The top rows of the content of the json file is shown below.

[
    {
        "id": 1,
        "server_name": "swe_oefg_767",
        "ip_adress": "28.118.94.105",
        "mac_address": "D8-8F-4A-21-FA-E3",
        "status": "production",
        "size": "small"
    },
    {
        "id": 2,
        "server_name": "swe_uqyu_254",
        "ip_adress": "248.221.85.237",
        "mac_address": "AA-05-A5-D3-5D-8C",
        "status": "production",
        "size": "small"
    },
    {
        "id": 7,
        "server_name": "swe_nkeu_365",
        "ip_adress": "50.138.198.80",
        "mac_address": "28-BD-EA-D0-83-F6",
        "status": "production",
        "size": "small"
    },

Conclusion

In this article I used pandas to load and clean up a csv file, and then exported it to a json file. All was done in around 6 lines of code. I don't know how many lines of code it would have taken to do this with the build in csv library. But pandas provide some really easy-to-use tools to do work like this on different data files, like csv, excel, json and other formats.

I am thinking, to do these basic tasks that I've done in this article is so straight-forward, that you really don't need to be a python developer to do it. Any tech-savvy person could do this by using a hosted Jupyter notebook (e.g. notebooks.azure.com), and it might even be easier than to do it in Excel.

Resources

Let me know on Twitter if I can improve this article, or if you have other resources to help out with understanding this topic.