Quick-Fire Guide: Reading JSON Data into Pandas DataFrames

#Quick-Fire-Guide:-Reading-JSON-Data-into-Pandas-DataFrames

JSON is the de facto standard for information exchange and working with it can be a headache for data scientists wishing to access and convert data into a format within which they can better understand and manipulate the data. So this is a beginner's guide to reading in json data into Pandas dataframes. We'll also look at importing publicly available data that don't require any form of authentication and continue to parse and flattern nested data.

Python comes with a built-in package called json for encoding and decoding JSON data.

A few other imports are necessary for the context of this guide:

Loading output library...

json_normalize( ) is a package for flattening json in a pandas dataframe.

Reading JSON to Pandas - A Simple Example

#Reading-JSON-to-Pandas---A-Simple-Example

The generic way to load JSON to a DataFrame is to use pandas read_json( ). Take an example from the pandas documentation:

Loading output library...
Loading output library...

Great! We have created a simple JSON string and converted it into a pandas dataframe. However, most real-world json strings are not so nicely formatted, and are often nested, meaning we must deal with lists of dictionries with dictionaries. The deeper a JSON string is nested the more difficult it can be to deserialise.

One can use json_normalize to create dataframes from JSON files with any nested attributes.

Flattening a JSON File

#Flattening-a-JSON-File

I have a json file ready for convenience, which lists nobel prize winning countries. Using Python’s context manager, we can open the file and use json.load( ) to turn JSON-encoded data into Python objects.

We also use json_load( ) to turn our JSON-encoded data into Python objects.

Our data has some nested columns, which we unpack with json_normalize( ).

Loading output library...

Awesome! We now have a dataset that we can begin to easily manipulate and plot to get a better understanding of it.

Loading output library...
Loading output library...

Importing Raw JSON Data and Converting to Pandas DataFrame

#Importing-Raw-JSON-Data-and-Converting-to-Pandas-DataFrame

A lot of the time one usually imports JSON data directly from somewhere on the web. Python's requests library allows us to send HTTP requests without having to manually add query strings to our URLs.

Passing our raw json data to requests.get( ) gives us a Response object called r. We can get all the information we need from this object. Passing r.text, which allows us to read the content of the server’s response, to json_loads( ) to convert the data to a Python dictionary.

Again, we use json_normalize( ) on the 'data' item within our new dictionary, transpose it so that our index is our timeline rather than temperature.

Loading output library...
Loading output library...

Working with Nested Data

#Working-with-Nested-Data
Loading output library...
Loading output library...

We see one nested column, laureates. record_path tells json_normalize( ) what path of keys leads to each individual record in our JSON object. The meta parameter specifies what data we want to include from the rest of the JSON object.

So let's unpack our data in its entirety!

To recap what we just accomplished there, we:

  • passed the json object data path d'prizes'

  • passed the record path within the object we wanted to parse; laureates

  • passed the parent metadata we wanted to append; category

Loading output library...
Loading output library...

Other Formats

#Other-Formats

Sometimes, the above methods won't work on certain JSON objects. For example, the imported data below we want to parse is a nested list of dictionaries within a list, so we first need to strip away the upper layer (which is just a description of the data). Once we do that we have another list of dictionaries we'd like to express in dataframe format.

json_normalize( ) may not be able to help us here, as we are dealing with a list rather than a dictionary. It looks like it is much easier to extract that data and then load it into a dataframe directly.

This can now be done by passing our data directly to pd.DataFrame( ), at which point we can again quickly manipulate and plot.

Loading output library...

Awesome! Looks like India's economy is slowly catching up with it's mega-neighbour.

All datasets in this guide have been sourced from this Github repository, a curated list of JSON datasets that don't require any authentication.

Head over to test out the skills you have learnt in this guide!