Programs like Alteryx have opened up the world of ETL to other people besides programmers and have made it possible for people with a lot of domain knowledge to start exploring their data.
One of the things that Alteryx has done, besides giving you some basic tools like Select and Join, is give “catch-all”-tools like Cleansing and Auto-Field. These tools are easily implemented by people who might not know which data type would be best suited for their data.
We’ve always told our clients to be careful with tools like Auto-Field because they might give unexpected results due to data errors but discounting these situations:
Can or should these training wheel tools still be used by more experienced users?
We’ve already read multiple topics and murmurings of more experienced users, letting others know that the Cleansing Tool isn’t as efficient as it could (link) be but how much is it really costing in terms of time? To figure that out, we’ve set up “The Experiment”.
Preparing The Experiment
We’ve taken a clean dataset consisting of 34.959.672 rows and 6 columns
Using a Random Sample Tool, we’ve taken 20% of the set and had the Multi-Field Formula Tool make a mess of it for us :
IF (rand()*10)<2 then null() else [CurrentField] endif
This randomly introduced the needed nulls in our data that the flow had to clean up. After this, we’ve re-unioned the data back onto the entire set, giving us 41.951.606 records to work through.
But to make it a bit more obvious I used the Append Tool to blow this dataset up to 839.032.120 rows and 8 columns.
After Caching our Preparations to account for the randomness in this build, we can finally start our flow with the data :
Year | Age | Ethnic | Sex | Area | count | Extra String | Extra Int |
2018 | 0 | [NULL] | 1 | 1 | 795 | yes | 1 |
2018 | 0 | 1 | 1 | 1 | 795 | yes | [NULL] |
2018 | 0 | 1 | 1 | 1 | [NULL] | yes | 3 |
The Contenders
The Challenge consist of 2 challenging teams :
– The Training wheels, consisting of the Auto-field toolset on all of the columns and a Cleansing tool, set up to exchange nulls for either empty strings or 0’s.
– The “Advanced Approach”, consisting of a Select Tool, is manually set up to change the columns to various Int or Str types. After this, a Formula tool with a calculation for every column consisting of the same type of formula :
IF isnull([FIELD] then "" else [FIELD] ENDIF
The Results
I’ve run the setup twice to filter out any lagging issues and the results are pretty close between the attempts with a pretty surprising twist :
Training wheels: Attempt 1 | Ms | Minutes | Percentage |
Data Cleansing | 2973795 | 49.56325 | 63.58 |
Auto Field | 1583974 | 26.39957 | 33.87 |
Cache | 119379 | 1.98965 | 2.55 |
Finished running in 1:17 hours |
Training wheels: Attempt 2 | ms | minutes | Percentage |
Data Cleansing | 2879006 | 47.98343 | 62.65 |
Auto Field | 1586017 | 26.43362 | 34.51 |
Cache | 130476 | 2.1746 | 2.84 |
Finished running in 1:16 hours |
Both runs are pretty even, taking a total of 1 hour and 16/17 minutes with the Cleansing tool taking up about 2/3 of that time.
Advanced Approach: Attempt 1 | ms | min | Percentage |
Formula | 691633 | 11.52722 | 55.91 |
Select | 431019 | 7.18365 | 34.85 |
Cache | 114305 | 1.905083 | 9.24 |
Finished running in 20:37 minutes |
Advanced Approach: Attempt 2 | ms | min | Percentage |
Formula | 697138 | 11.61897 | 56.23 |
Select | 430259 | 7.170983 | 34.70 |
Cache | 112429 | 1.873817 | 9.07 |
Finished running in 20:40 minutes |
Same as the previous 2; the timings aren’t very different with the Formula tool taking a bit more than half of the time.
If we look at the overall results we can see a clear winner: 1h17m vs 20m!
The tools that we can really compare are the Formula and Data Cleansing tools; they’ve got the exact same function but the formula tool is 4 times as fast in comparison, talk about inefficiency!
I was personally quite surprised by the fact that the Select tool switching over the columns took a bit more than 7 minutes but also here, a big win from the Auto Field.
The Twist
Timing, however, isn’t everything.
The select tool was set as follows to avoid any size issues:
Field | Type | Size |
Year | Int64 | 8 |
Age | Int64 | 8 |
Ethnic | Int64 | 8 |
Sex | Bool | 1 |
Area | V_WString | 254 |
count | Double | 8 |
Extra String | V_WString | 3 |
Extra Int | Int64 | 8 |
and the outfield gave the following output :
Info: Auto Field (31): The FieldType of “Year” changed to: Int16 |
Info: Auto Field (31): The FieldType of “Age” changed to: String(6) |
Info: Auto Field (31): The FieldType of “Ethnic” changed to: Int16 |
Info: Auto Field (31): The FieldType of “Sex” changed to: Byte |
Info: Auto Field (31): The FieldType of “Area” changed to: String(8) |
Info: Auto Field (31): The FieldType of “count” changed to: Double |
Info: Auto Field (31): The FieldType of “Extra Int” changed to: String(3) |
The difference between these settings is pretty clear but the results can’t be underestimated!
We started with a dataset that was 35.8GB.
The Auto Field decreased this to 30.4GB.
The Select Tool increased this to 59.6GB!
Now we all know that drive space is the easiest issue to fix (with buying bigger drives) but with the “safe” settings on the select tool, we nearly doubled the needed space! I do have to admit that I didn’t give the Select tool that much attention during the setup ; I quickly scanned the data , ran the data and quickly increased the size due to warnings about size constraints.
The Conclusion
For me, we’ve got a very clear winner with the Advanced Approach but keeping an eye out for the “safe” settings in the Select tool is surely something to look out for!
We can still see a clear use of the Auto Field as an exploration tool but I wouldn’t keep it in a normal workflow that’s been placed in production. Our number 1 reason for this would normally be that an error in your data might cause columns to change type but another clear reason is the fact that I wouldn’t want to add such a huge time consumer into the flow.