Machine Learning in Snowflake

In the realm of machine learning, local environments are often sufficient for small-scale projects and experimentation. However, when it comes to enterprises dealing with large data sets or stringent security requirements, local setups may not be feasible. There may be data protection laws or policies that prevent the data from being exported. Additionally, locally developed solutions may not be easy to run in production environments. A single laptop might not even have enough resources for the task at hand. There are a number of potential solutions to this: organizations can opt for cloud-based solutions like Databricks and AWS SageMaker, or perform machine learning directly in their data warehouses. In this blog, we will focus on the latter approach, looking at the practicalities of implementing machine learning within Snowflake.

Snowpark

Snowpark, a feature offered by Snowflake, enables developers to write code in languages such as Python and execute it directly within Snowflake. By pushing all the transformations down to the Snowflake platform, it eliminates the need for additional resources while ensuring that the data remains within Snowflake for security and governance purposes. This approach makes use of Snowflake's compute capabilities through the use of virtual warehouses. These supply the necessary compute resources for the algorithms to run. The developer doesn't need to concern himself with the environment, or worry that the code might not work in production.

A Practical Example

Working inside the Snowflake web interface we can write a simple python function using an SK-learn model. Our goal is to forecast the temperature one day in advance. We start by importing the snowpark module and any other relevant modules you wish to use.

				
					import snowflake.snowpark as snowpark
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import mean_absolute_error as mae
import pandas as pd
				
			

We must then pass 'session' as a parameter to your function. this will be provided by the runtime, you don't need to create it yourself. The function will also automatically be called.
We can now directly query our database using the session.sql function. The resulting table can be converted to a pandas df using the to_pandas function. At this point it's normal python, have a blast!

 

				
					
def main(session): 
    # we have two views; X and y with temperature data
    X = session.sql('select temp, t_max, t_min, "temp-1", humidity, clouds, pressure, wind_deg, wind_speed from X').to_pandas()
    y = session.sql('select "temp+1" from y').to_pandas()

    xtrain, xtest, ytrain, ytest = tts(X,y, test_size=0.3, shuffle=False)

    model = RandomForestRegressor()
    model.fit(xtrain,ytrain.values.ravel())
    forecast = model.predict(xtest)
    
    # Naive forecast error
    naive_error = str(mae(ytest, xtest[['TEMP']]))
    # Actual error of our forecast
    error = str(mae(ytest, forecast))

    xtest['forecast'] = forecast
    
    performance = error + " | " + naive_error
    print(f'our forecast vs naive {performance}')
    
    # write our forecast to a snowflake table
    forecast = session.create_dataframe(xtest)
    forecast.write.mode('overwrite').save_as_table('forecast')
    return forecast
				
			

After creating our model it's time to save the results. We convert our dataframe back to a Snowflake table using the create_dataframe function, and then save it. This entire process ran entirely on the Snowflake cloud, and at no point did we have to concern ourselves with management of a python environment. Furthermore the data never left Snowflake.

Challenges

While in principle this should solve our problems in practice things are never that simple. The Snowpark functionality leaves much to be desired. There are quite a few gotchas and very often things that you could do in python don't work in Snowpark. Not to mention working with Snowpark tables is a pain, there is an entire API dedicated to various table functions with their own unique syntax that one has to learn.

Conclusion

Snowpark is a very interesting development for data engineers and in my opinion the future. The benefits of not having to develop locally are legion. However there are still a number of limitations to creating a complex ML pipeline this way. I believe that as it matures it will get much better, and become much easier to use. To read more about python in Snowflake, check out this blog. May the future data warehouse be a backend to the pandas API!