Numerai and Snowflake: Tutorial

14 april 2023
Lukas Bogacz
Python, Snowflake
0

Numerai

In this tutorial we will use Snowflake and its Python integration, Snowpark, to participate in the Numerai tournament. Numerai is a data-driven hedge fund that runs a forecasting competition. Data scientists compete to build the best predictive models using obfuscated financial data provided by Numerai. The models are ranked based on their future performance.

Snowflake

Snowflake, on the other hand, is a cloud-based data warehousing platform that allows one to store, manage, and process data. One of Snowflake’s features, Snowpark, provides Python integration which allows one to execute code directly on the Snowflake cloud. This enables the use of Snowflake’s scalable and flexible infrastructure, which can mitigate hardware constraints when training and deploying models.

The RAM problem

The large size of the Numerai dataset presents a challenge, as RAM can quickly become an issue, especially when using pandas. Training a model on Snowflake can mitigate this, although if your model requires a GPU you will still need a separate service.

Step 1

Create a Snowflake account.
Download the data from Numerai to a local folder.
Create the required objects in Snowflake.

use role sysadmin;
-- replace with your database name
use lukas; 
-- replace with your schema name
use schema public;

CREATE OR REPLACE WAREHOUSE snowpark_opt_wh WITH
  WAREHOUSE_SIZE = 'MEDIUM'
  WAREHOUSE_TYPE = 'SNOWPARK-OPTIMIZED'
  MAX_CLUSTER_COUNT = 10
  MIN_CLUSTER_COUNT = 1
  SCALING_POLICY = STANDARD
  AUTO_SUSPEND = 60
  AUTO_RESUME = TRUE 
  INITIALLY_SUSPENDED = TRUE 
  COMMENT = 'created by lukas for machine learning';


create or replace file format numerai_parquet
    TYPE = PARQUET
    COMPRESSION = SNAPPY;
create or replace file format numerai_json
    type = json
    compression = auto;
    
-- create external stages
create or replace stage numerai_stage
    file_format = numerai_parquet;
create or replace stage numerai_json_stage
    file_format = numerai_json;
create stage numerai_models;

Step 2 – upload the data

Do the following in snowsql to upload the files to Snowflake. Upload the json data to the json stage and the parquet files to the numerai stage. Keep in mind it can take a while

put 'file:///Users/path_to_downloaded_data/train.parquet' @numerai_stage

Step 3 – move the files into a table

Now that you have uploaded the files, you need to copy them into a table. You can run the following in a worksheet. Make sure to run it twice, once for the train and once for the validation data.

import snowflake.snowpark as snowpark

# change the location to do both validation and train data
table_name = 'NUMERAI'
location = '@numerai_stage/validation.parquet'

def main(session):
    session.sql('use role sysadmin').collect()
    session.use_schema('PUBLIC')
    df = session.read.option("compression", "snappy").parquet(location)
    df.copy_into_table("{}".format(table_name))
    return 'success'

Step 4 – create view from json data

The json file contains the names of the columns of the data. we need to convert it into an easily queryable table in snowflake.

create or replace view numerai_targets as 
select VALUE as target_list
from @numerai_json_stage/features.json,
LATERAL FLATTEN( INPUT => $1:targets );

select * from numerai_targets;

create or replace view numerai_features_medium as
select value as feature
from @numerai_json_stage/features.json,
LATERAL FLATTEN( INPUT => $1:feature_sets.medium );

create or replace view numerai_features_small as
select value as feature
from @numerai_json_stage/features.json,
LATERAL FLATTEN( INPUT => $1:feature_sets.small );

Step 5 – train your model

Now that everything is set up, we can start training our model. For this tutorial we will make a simple model using the parameters supplied by Numerai in their example script. The full_df function gets the training data from the table. The second function trains the model and saves it to a stage. Make sure to use the snowpark optimized warehouse we created earlier.

import snowflake.snowpark as snowpark
from snowflake.snowpark.functions import col
from lightgbm import LGBMRegressor
from joblib import dump
    
def full_df(session):
    features = session.table("NUMERAI_FEATURES_MEDIUM")
    features = features.to_local_iterator()
    feature_list = [f[0] for f in features]
    feature_list = ['"target_nomi_v4_20"'] + feature_list
    data = session.table('numerai').select(feature_list).filter(col('"data_type"') == 'train')
    return data.to_pandas()


def makemodel(session):
    data = full_df(session)
    y = data[["target_nomi_v4_20"]]
    features = session.table("NUMERAI_FEATURES_MEDIUM")
    features = features.to_local_iterator()
    feature_list = [f[0].strip('\"') for f in features]
    x = data[feature_list]
    
    params = {"n_estimators": 2000,
          "learning_rate": 0.01,
          "max_depth": 5,
          "num_leaves": 2 ** 5,
          "colsample_bytree": 0.1}
    
    model = LGBMRegressor(**params)
    print('fitting model ...')
    model.fit(x,y)
    print('done ...')
    dump(model, '/tmp/model2')
    upload = session.file.put('/tmp/model2', '@NUMERAI_MODELS', auto_compress=False, overwrite=True)
    
    return upload[0].status


def main(session):
    status = makemodel(session)
    return status

Step 6 – predict

Now we can test our model on the validation data.

import snowflake.snowpark as snowpark
from snowflake.snowpark.functions import col
from joblib import dump, load
from lightgbm import LGBMRegressor

#feature_set,modelname = "NUMERAI_FEATURES_SMALL",'model1'
feature_set,modelname = "NUMERAI_FEATURES_MEDIUM",'model2'

def full_df(session):
    features = session.table(feature_set)
    features = features.to_local_iterator()
    features = [f[0] for f in features]
    cols = ['"id"', '"target_nomi_v4_20"'] + features
    data = session.table('numerai').select(cols).filter(col('"data_type"') == 'validation').to_pandas()
    return data


def predict(session):
    data = full_df(session)
    y = data[["id","target_nomi_v4_20"]]
    x = data.drop(columns=["id","target_nomi_v4_20"])
    del data

    session.file.get(f"@NUMERAI_MODELS/{modelname}", '/tmp/' )
    with open(f'/tmp/{modelname}', "rb") as f:
        model = load(f)

    prediction = model.predict(x)
    y['prediction'] = prediction
    diagnostic = y[["target_nomi_v4_20", "prediction"]]
    corr = diagnostic.corr(method="spearman")
    print('spearman',corr)
    corr = diagnostic.corr(method="pearson")
    print('pearson',corr)
    return session.create_dataframe(y[['id','prediction']])

Evaluation

We can download the predictions as a csv directly from the web interface. Under the output tab we can see the correlation between our model and the targets. I created 2 models, one using the small feature set and the other using the medium. The medium appears to perform slightly better.

After you have downloaded the results you can upload them to Numerai to get a more complete analysis of your model.

You can also download the models from Numerai to run elsewhere;

get @NUMERAI_MODELS "file:///Users/path_to_download_location/";

Summary

Snowflake’s python integration allows you to run python code directly on your data. This can alleviate some of the performance issues of creating models locally. If you are interested in a brief overview of running python on Snowflake, check out my other blog; Machine Learning in Snowflake. For more about the Numerai tournament check out their website.

Tags: Python Snowflake