Our frosty journey continues with Frosty Friday challenge #3, where we keep nurishing our Snowflake skills. In this challenge we will be ingesting S3 data as we did in challenge #1, but this time with a twist.
The main task in this challenge is to track the S3 bucket files and create a table that lists all the files that contain any of the keywords in the keywords.csv file.
Time to get our hands dirty…
Environment Configuration and Staging Files
As always we will first set our environment, create a CSV file format and stage the files.
// Environment Configuration
USE WAREHOUSE COMPUTE_WH;
USE DATABASE ATZMON_DB;
USE SCHEMA challenges;
// Creating a CSV file format
CREATE OR REPLACE FILE FORMAT ch3_csvformat
type = 'CSV'
field_delimiter = ','
skip_header = 1;
// Creating a Stage Object to reference data files stored in a s3 bucket
CREATE OR REPLACE STAGE ch3_stage
file_format = ch3_csvformat
url = 's3://frostyfridaychallenges/challenge_3/';
Let’s take a look at the stage files:
// View files in stage
LIST @ch3_stage;
Stage Files Metadata
Among the files listed, we can locate a ‘keywords’ csv file. We would like to examine it, but this time we will also get acquainted with a file’s metadata.
Snowflake automatically generates metadata for files in internal (i.e. Snowflake) stages or external (Amazon S3, Google Cloud Storage, or Microsoft Azure) stages.
https://docs.snowflake.com/en/user-guide/querying-metadata.html
Currently, the following metadata columns can be queried or copied into tables:METADATA$FILENAME
Name of the staged data file the current row belongs to. Includes the path to the data file in the stage.METADATA$FILE_ROW_NUMBER
Row number for each record in the container staged data file.
SELECT
METADATA$FILENAME,
METADATA$FILE_ROW_NUMBER,
$1,
$2,
$3
FROM @ch3_stage/keywords (file_format => 'ch3_csvformat');
First two columns are the generated metadata columns and $1, $2 and $3 are the files columns. In column 1 we can see the keywords we want to filter the rest of the files on. Column 2 probably lists the name of person who uploaded it…
Keywords Table
As a first step, we will create a table and load all the keywords into it.
// Creating an empty table 'ch3_keywords_tbl'
CREATE OR REPLACE TABLE ch3_keywords_tbl (keyword VARCHAR, added_by VARCHAR);
// Load data into the keywords table
COPY INTO ch3_keywords_tbl
FROM
(
SELECT $1, $2
FROM @ch3_stage/keywords (file_format => 'ch3_csvformat')
);
// Checking the newly created table with 3 records
SELECT * FROM ch3_keywords_tbl;
As a reminder, our objective is to create a table that lists all the files in our stage that contain any of the keywords in the keywords.csv file. So our next and final step will be to filter out all file names that miss any of the keywords we loaded in our table.
Filtering
// Query the stage for all distinct files containing the requested keywords
SELECT DISTINCT(METADATA$FILENAME::STRING) as FILE_NAME
FROM @ch3_stage (file_format => 'ch3_csvformat')
WHERE
CONTAINS(METADATA$FILENAME::STRING, 'week3')
AND METADATA$FILENAME::STRING LIKE ANY (SELECT CONCAT('%', keyword, '%')
FROM ch3_keywords_tbl);
Ok, Let’s be a bit more clear about what is going in the last piece of code.
In general, we would like a list of distinct file names containing the requested keywords. Since we have created a keyword table, we can use the ‘keyword’ column in our query.
Checking out the WHERE
clause, there are 2 basic conditions:
- File name strings should contain ‘week3’.
- File name strings should be like any of the strings captured in the ‘keyword’ column that we have in our keyword table (ch3_keywords_tbl).
In the second condition we are using LIKE ANY
, which allows case-sensitive matching of strings based on comparison with one or more patterns in our keyword table. In our pattern also make use of wilcards (%
) – SELECT CONCAT('%', keyword, '%') FROM ch3_keywords_tbl
.
And we are Done!
I hope this blog helped you clarify and better understand the challenge.
Check out our blogs page to view more blogs on Tableau, Alteryx and Snowflake.
Work together with one of our consultants and maximise the effects of your data.
Contact us, and we’ll help you right away.