JSON is the de facto standard for information exchange and working with it can be a headache for data scientists wishing to access and convert data into a format within which they can better understand and manipulate the data. So this is a beginner's guide to reading in json data into Pandas dataframes. We'll also look at importing publicly available data that don't require any form of authentication and continue to parse and flattern nested data.
Python comes with a built-in package called json for encoding and decoding JSON data.
A few other imports are necessary for the context of this guide:
json_normalize( ) is a package for flattening json in a pandas dataframe.
The generic way to load JSON to a DataFrame is to use pandas read_json( ). Take an example from the pandas documentation:
Great! We have created a simple JSON string and converted it into a pandas dataframe. However, most real-world json strings are not so nicely formatted, and are often nested, meaning we must deal with lists of dictionries with dictionaries. The deeper a JSON string is nested the more difficult it can be to deserialise.
One can use json_normalize to create dataframes from JSON files with any nested attributes.
I have a json file ready for convenience, which lists nobel prize winning countries. Using Python’s context manager, we can open the file and use json.load( ) to turn JSON-encoded data into Python objects.
We also use json_load( ) to turn our JSON-encoded data into Python objects.
Our data has some nested columns, which we unpack with json_normalize( ).
Awesome! We now have a dataset that we can begin to easily manipulate and plot to get a better understanding of it.
A lot of the time one usually imports JSON data directly from somewhere on the web. Python's requests library allows us to send HTTP requests without having to manually add query strings to our URLs.
Passing our raw json data to requests.get( ) gives us a Response object called r. We can get all the information we need from this object. Passing r.text, which allows us to read the content of the server’s response, to json_loads( ) to convert the data to a Python dictionary.
Again, we use json_normalize( ) on the 'data' item within our new dictionary, transpose it so that our index is our timeline rather than temperature.
We see one nested column, laureates. record_path tells json_normalize( ) what path of keys leads to each individual record in our JSON object. The meta parameter specifies what data we want to include from the rest of the JSON object.
So let's unpack our data in its entirety!
To recap what we just accomplished there, we:
passed the json object data path d'prizes'
passed the record path within the object we wanted to parse; laureates
passed the parent metadata we wanted to append; category
Sometimes, the above methods won't work on certain JSON objects. For example, the imported data below we want to parse is a nested list of dictionaries within a list, so we first need to strip away the upper layer (which is just a description of the data). Once we do that we have another list of dictionaries we'd like to express in dataframe format.
json_normalize( ) may not be able to help us here, as we are dealing with a list rather than a dictionary. It looks like it is much easier to extract that data and then load it into a dataframe directly.
This can now be done by passing our data directly to pd.DataFrame( ), at which point we can again quickly manipulate and plot.
Awesome! Looks like India's economy is slowly catching up with it's mega-neighbour.
All datasets in this guide have been sourced from this Github repository, a curated list of JSON datasets that don't require any authentication.
Head over to test out the skills you have learnt in this guide!