Creating and querying a Snowflake semi-structured OBJECT

Snowflake has a lot of features. One very handy feature is that it can handle semi-structured data natively. This can be done with a semi-structured OBJECT. This makes Snowflake very versatile and flexible.

We will take a look at 2 Snowflake functions that enable us to create and query(!) semi-structured data all from within Snowflake. We will also briefly discuss why semi-structured data can speed up your queries sometimes.

Using the OBJECT Data Type in Snowflake

Snowflake handles semi-structured data with a couple of semi-structured data types. We will be focussing on just one of those data types: the OBJECT. Be sure to check out my blog if you want to know more about OBJECTs.

There are multiple Snowflake functions for working with semi-structured data. These are called Both are what are called Semi-Structured Data Functions. We will be discussing 2 of those:

  1. (semi-structured) GET (not to be confused with the DML Command GET)
  2. OBJECT_AGG

These 2 functions are complementary to one another in a sense. GET is used to create an OBJECT. OBJECT_AGG is used to get a specific value from that OBJECT.

Aggregating into a semi-structured data type

Say, we have a table with rows that contain unique names with 1 associated phone number. With OBJECT_AGG we are able to create an OBJECT from that table, with the names as keys and the phone numbers as values:

WITH phonenumbers AS (
    SELECT       'Cleopatra' AS name, '+31 6 11 11 11 11' AS phonenumber
    UNION SELECT 'Patrick'          , '+31 6 22 22 22 22'
    UNION SELECT 'Nellie'           , '+31 6 33 33 33 33'
    UNION SELECT 'Arnold'           , '+31 6 44 44 44 44'
    UNION SELECT 'Suus'             , '+31 6 55 55 55 55'
    UNION SELECT 'Guus'             , '+31 6 66 66 66 66'
    )
SELECT 
    OBJECT_AGG(name, phonenumber::VARIANT) 
    AS obj 
    FROM phonenumbers
    ;

Notice the ::VARIANT? This is an important part, since OBJECT_AGG only accepts a VARIANT for the values (in the second argument). VARIANT is another semi-structured data type used by Snowflake. We won’t go into VARIANTs any further. You can check out the Snowflake Docs if you want to know more.

The first argument of the OBJECT_AGG function, is the place where the keys go. These keys must be strings. name in the above example is already a string, so we don’t have to do anything with that.

A very important aspect of working with OBJECT_AGG is that you cannot enter duplicate keys:

SELECT 
    OBJECT_AGG(name, phonenumber::VARIANT) 
    AS obj 
    FROM (
        SELECT 'Cleopatra' AS name, '+31 6 11 11 11 11'AS phonenumber
        UNION SELECT 'Cleopatra'  , '+31 6 22 22 22 22'
        )
    ;

This code block will throw an error:

Duplicate field key 'Cleopatra'

The main point of OBJECT_AGG is to create a single semi-structured value of the semi-structured data type OBJECT from something that contains multiple rows. In other words, you make 1 row with an OBJECT from multiple rows containing multiple unique values.

GETting the values of the keys

Now that we have created a semi-structured OBJECT, we also want to be able to query it. We could use : as in the following example. (There are multiple ways of passing a key to an OBJECT in Snowflake. The Traversing Semi-Structured Data article on the Snowflake Docs explains several of those.)

WITH phonenumbers AS (SELECT {
    'Cleopatra': '+31 6 11 11 11 11'
    , 'Patrick': '+31 6 22 22 22 22'
    , 'Nellie' : '+31 6 33 33 33 33'
    , 'Arnold' : '+31 6 44 44 44 44'
    , 'Suus'   : '+31 6 55 55 55 55'
    , 'Guus'   : '+31 6 66 66 66 66'
    } AS obj)
SELECT 
    obj:Cleopatra::VARCHAR AS "Cleopatra"
    FROM phonenumbers
    ;

But we can also use the semi-structured function GET for this, which is actually pretty straightforward: just pass the created OBJECT and the desired key to the GET function and you will get the value you are looking for! GET takes 2 arguments: the first can be of a semi-structured data type like OBJECT, the second one can be a string constant.

GET example

Let’s build up the previous example in a slightly different way with GET:

WITH phonenumbers AS (SELECT {
    'Cleopatra': '+31 6 11 11 11 11'
    , 'Patrick': '+31 6 22 22 22 22'
    , 'Nellie' : '+31 6 33 33 33 33'
    , 'Arnold' : '+31 6 44 44 44 44'
    , 'Suus'   : '+31 6 55 55 55 55'
    , 'Guus'   : '+31 6 66 66 66 66'
    } AS obj)
SELECT GET(obj, 'Cleopatra') AS "Cleopatra"
    FROM phonenumbers
    ;

This gives us:

get

GET and NULLs

Now, if we were to give the (as far as I know) non-existent, but beautiful name Cleopatrick to GET we would get:

cleopatrick
WITH phonenumbers AS (SELECT {
    'Cleopatra': '+31 6 11 11 11 11'
    } AS obj)
SELECT GET(obj, 'Cleopatrick') AS "Cleopatrick"
    FROM phonenumbers
    ;

This is very important! If GET does not find the key in the OBJECT you get NULL. Don’t take my word for it:

“If the OBJECT does not contain the specified key, this function returns NULL.” (Snowflake Docs accessed 2022-01-05)

This is specific behavior not necessarily found in other tools: In Python, for example, if you pass a key to a dictionary that is not in that particular dictionary, you will get a KeyError.

Another thing to note is that case is important when passing a key to GET: the following block will also return NULL:

WITH phonenumbers AS (SELECT {
    'Cleopatra': '+31 6 11 11 11 11'
    } AS obj)
SELECT GET(obj, 'cleopatra') AS "cleopatra"
    FROM phonenumbers
    ;

Conclusion of the example

We have created a semi-structured OBJECT with the OBJECT_AGG function. Then we queried it with GET. query succes

We know how to create and query semi-structured OBJECTs in Snowflake. Now, it is time to take a look at why you would actually want to use them.

First of all, you might not have a choice. If you have semi-structured data in your orgranization, you may need to use it. The fact that Snowflake directly and natively supports this, can make a huge difference.

There might also be scenario’s where you need to do a lookup for every row in a table (think row-level access policies). These lookups are generally fast in Snowflake anyway, but could become slow when repeated often in a dynamic way that cannot leverage caching. Semi-structured OBJECTs can come to the rescue in these situations, because they can be really fast.

Why are OBJECTS fast?

DISCLAIMER: I do not know how look-ups with keys in a semi-structured OBJECT are actually implemented in Snowflake. There are many different approaches.

However, working with collections of key-value pairs in a structure like a Snowflake’s OBJECT, JSON object, or a Python dict, often relies on the concept of a hash table.

Hash tables

With a hash function you can basically take a key and turn it into something you can use to decide where to store that key—with its value. Then, you can use that same function together with that key, to quickly find where you stored the key-value pair.

You could also just go through all key-value pairs one by one, and then stop when you find the desired key. This could happen really quickly, if you are lucky. It could also mean you have to go though all the key-value pairs. This is not a problem if there are just a few pairs and you don’t have to do it often. However, if there are many and/or you have to do it many times, it could take a lot of time. Whereas, if you have a hash table, you can pass the key to the hash function and use the result to determine where you can find the key-value pair.

This is a very simplistic representation, and hash tables and functions are a very complex topic way beyond the scope of this blog. If you want to learn more about it, I could definitely recommend the following YouTube video Hash Tables and Hash Functions.

For me, the most important takeaway from this is that semi-structured data can allow you to find something, without having to search through everything.

Conclusion

I hope you enjoyed taking a look at how to use some of the Snowflake features for working with semi-structured data. We looked at the semi-structured OBJECT and how to both create and query it. Using semi-structured data in Snowflake can sometimes speed up your queries. This makes it worthwhile to consider in some circumstances. A notable example being a dynamic query that needs to run many times in slightly different or unpredictable forms. When you are not able to leverage Snowflake’s cache or clustering features, going semi-structured might be the way to go.

Make sure to visit our site The Information Lab NL to see more blog posts, training and consultancy services regarding Tableau, Alteryx and Snowflake.

Bel ons

Afspraak

Mail ons