Unnest json bigquery

Working with nested JSON data in BigQuery analytics database might be confusing for people new to BigQuery. Yet if done well, nested data structure (JSON) is a very powerful mechanism to better express hierarchical relationships between entities comparing to the conventional flat structure of tables.

In this article, we will show you how to handle nested data in BigQuery. Beside some obvious naming and syntax differences, you can apply the same approaches to handle nested data in many other SQL databases like PostgreSQL, MySQL...

The examples below use the ga_sessions sample dataset provided by Google. All queries are written with BigQuery's #StandardSQL mode.

How JSON data is stored in BigQuery

In BigQuery, JSON data may be stored in two ways:

  • In a column of type "RECORD": This data type is specifically designed to store nested structure data (JSON) in BigQuery.
  • In a column of type "STRING": The JSON value is treated just like a normal string that happens to have JSON format. Let's call this "stringified JSON".

Each scenario needs to be handled differently as follow.

Handling BigQuery's JSON data as RECORD type

What is RECORD type?

While PostgreSQL and MySQL have JSON type, BigQuery uses the RECORD (or STRUCT) type to represent nested structure. A column of RECORD type is in fact a large column containing multiple child columns.

Storing your data in RECORD columns takes a bit more effort as you need to define the data structure up-front when creating your table. For example, this is from the Create table dialogue in BigQuery:

Unnest json bigquery
Define the table schema, including schema of nested fields

On the other hand, the explicit structure brings you several benefits:

  • Consistency: Your future data is made sure to conform to the pre-defined structure → not have to worry that a valid query today will be invalid tomorrow.
  • Performance: You can selectively scan child columns instead of scanning the whole parent RECORD column, which will save you query money.
  • Ease of use: You can use dot notation (parent_col.child_col) to query from RECORD columns, which is quite convenient.

Your RECORD column can have a simple nested structure (each row only holds a single record), or it can hold nested and repeated data (each row holds multiple records). In this case, the RECORD field will have REPEATED mode:

Unnest json bigquery
`hits` is a REPEATED RECORD field

Querying key-value JSON object (non-repeated RECORD field)

Querying single (non-repeated) RECORD field is like querying key-value JSON objects. You can simply use the dot notations to access the child columns:

#standardsql select visitId , totals.hits , totals.pageviews , totals.timeOnScreen , trafficSource.campaign , trafficSource.isTrueDirect , trafficSource.adwordsClickInfo.adGroupId , trafficSource.adwordsClickInfo.campaignId from test.ga_sessions_20170801

You can chain your dot notation to query deeper field values, for example trafficSource.adwordsClickInfo.adGroupId.

Result:

Unnest json bigquery

Querying JSON array (REPEATED RECORD fields)

As briefly mentioned above, REPEATED RECORD field is an array that hold multiple records in a single row. To access the records inside, you have to:

  • Use UNNEST to flatten the field into a table where each row is an element of the array
  • Join the flattened table back to the base table.

Sample code:

#standardsql select visitId , h.hitNumber , h.time , h.hour , h.isInteraction , h.isEntrance , h.isExit , h.referer , h.page.pagePath from test.ga_sessions_20170801 left join unnest(hits) as h -- No need to specify join key

Now we have one row for each hit (interaction) in the visit. Notice that visitId value is repeated across four rows:

Unnest json bigquery

What if the child column is also a nested column?

Unnest json bigquery

In the following example, you can see that hit number three interacted with multiple product SKUs. hits.product is also a REPEATED RECORD:

Unnest json bigquery

We can continue to unnest hits.product and join back to the base table:

#standardsql select visitId , h.hitNumber , h.time , h.hour , h.isInteraction , p.productSKU , p.v2ProductName , p.productPrice from test.ga_sessions_20170801 left join unnest(hits) as h left join unnest(h.product) as p -- No need to specify join key where visitId = 1501589858

As you can see in the results, each row now represents a product interacted, and values of visitId, hitNumber, time, hour are repeated in each row.

Unnest json bigquery

Handling "stringified JSON" in BigQuery

The other common way JSON data is stored in BigQuery is as STRING data type. For example:

Unnest json bigquery
`hits` is now a string that has array format

Storing nested data as plain string requires no pre-defined schema, but it will bring you headaches in the future:

  • Inconsistency: No explicit schema means that you can have a column holding both JSON-formatted strings and normal free-form strings. This may make your query result unpredictable.
  • Bad performance and high cost: When extracting data from stringified JSON, you have to scan the whole text, which result in high query cost (more money) if your JSON string is large.
  • Lower ease of use: As demonstrated below, you will need to specify the exact JSON path to the data you want. While this is not much of a hurdle, it still takes some time to get used to.

Similar to RECORD type, stringified JSON can be a single JSON object, or a JSON array. You can check this for BigQuery's documentation on JSON functions.

Handle single-value JSON string in BigQuery

BigQuery has dedicated functions to handle this format:

  • json_extract(json_expression, json_path) will return all JSON values
  • json_extract_scalar(json_expression, json_path) will return only scalar values (string, number, boolean)

Sample code:

#standardsql SELECT visitId , json_extract_scalar(totals, '$.hits') as total_hits_scalar -- returns a number , json_extract(totals, '$.hits') as total_hits_json -- returns a number , json_extract_scalar(trafficSource, '$.adwordsClickInfo') as adwords_scalar -- returns nothing , json_extract(trafficSource, '$.adwordsClickInfo') as adwords_json -- returns a JSON object FROM test.test_json_string

As the value of hits in totals is a scalar, both functions return the same thing. The value of adwordsClickInfo in trafficSource is a JSON object so json_extract_scalar() returns nothing.

Unnest json bigquery

Handle stringified JSON array in BigQuery

With this format, you can use json_extract_array(json_expression[, json_path]) to extract array elements (json_path is optional).

In the example above, hits is a stringified JSON array:

#standardsql SELECT visitId , json_extract_array(hits) as hits FROM test.test_json_string
Unnest json bigquery

As you can see json_extract_array(hits) returns a repeated column. To access elements in the new hits column, you need to unnest it just like with a REPEATED RECORD column:

#standardsql SELECT visitId , json_extract_scalar(h, '$.hitNumber') as hit_number , json_extract_scalar(p, '$.productSKU') as product_sku , json_extract_scalar(p, '$.productBrand') as product_brand , json_extract_scalar(p, '$.productPrice') as product_price FROM test.test_json_string left join unnest(json_extract_array(hits)) as h left join unnest(json_extract_array(h, '$.product')) as p

The result is a nice flattened table:

Unnest json bigquery

If you are using Holistics BI, there are a few things you can do to enable your end-users to explore BigQuery's nested JSON structures themselves.

Create Calculated Fields to access child columns at run time

If you have created a Table Model out on top of the source table, you can create calculated fields using the same dot notation mentioned above, for examples:

// total_hits {{#THIS.totals}}.hits // traffic_adw_google_click_id {{#THIS.traffic_source}}.adwordsClickInfo.gclId

This way your end-user can drag and drop the child columns like normal during exploration:

Sorry, your browser doesn't support embedded videos.

Write SQL transformation to flatten arrays into a table

With Holistics's modeling layer, you can let your end-user have access to data in nested JSON arrays by:

  • Write a SQL model to unnest repeated columns in BigQuery into a flat table
  • Set a relationship between this derived SQL model with the base model
  • Add the derived SQL model in a dataset to expose it to your end user
Sorry, your browser doesn't support embedded videos.

Conclusion

With this, we conclude our guide to handling JSON/nested data structures in BigQuery. While we only wrote this in the context of BigQuery, you can apply the same approach to handle JSON fields and arrays in other SQL databases (like PostgreSQL, MySQL 8, Snowflake...)

If you think we have missed any frequently encountered issue when dealing with nested data, feel free to drop us a message.

Happy unnesting!

What's happening in the BI world?

Join 15k+ people to get insights from BI practitioners around the globe. In your inbox. Every week. Learn more

No spam, ever. We respect your email privacy. Unsubscribe anytime.

How do you Unnest in BigQuery?

To convert an ARRAY into a set of rows, also known as "flattening," use the UNNEST operator. UNNEST takes an ARRAY and returns a table with a single row for each element in the ARRAY . Because UNNEST destroys the order of the ARRAY elements, you may wish to restore order to the table.

Can I store JSON in BigQuery?

By ingesting JSON data as a JSON data type, BigQuery can encode and process each JSON field individually. You can then query the values of fields and array elements within the JSON data by using the field access operator, which makes JSON queries easy to use and cost efficient.

How do you flatten a JSON in Python?

There are many ways to flatten JSON. There is one recursive way and another by using the json-flatten library. Now we can flatten the dictionary array by a recursive approach which is quite easy to understand. The recursive approach is a bit slower than using the json-flatten library.

How do you use structs in BigQuery?

Struct, being the Record data type, doesn't need to be unnested. Only unnested Array of Structs (Record, Repeated) will result in multiple rows with all Struct key-value pairs. You can also select few columns from Array of Structs by using unnest and selecting those particular columns with “.”