Alteryx: Should you lose the training wheels?

2 oktober 2021
Mike Droog
Alteryx
0

Programs like Alteryx have opened up the world of ETL to other people besides programmers and have made it possible for people with a lot of domain knowledge to start exploring their data.

One of the things that Alteryx has done, besides giving you some basic tools like Select and Join, is give “catch-all”-tools like Cleansing and Auto-Field. These tools are easily implemented by people who might not know which data type would be best suited for their data.
We’ve always told our clients to be careful with tools like Auto-Field because they might give unexpected results due to data errors but discounting these situations:

Can or should these training wheel tools still be used by more experienced users?

We’ve already read multiple topics and murmurings of more experienced users, letting others know that the Cleansing Tool isn’t as efficient as it could (link) be but how much is it really costing in terms of time? To figure that out, we’ve set up “The Experiment”.

Preparing The Experiment

We’ve taken a clean dataset consisting of 34.959.672 rows and 6 columns
Using a Random Sample Tool, we’ve taken 20% of the set and had the Multi-Field Formula Tool make a mess of it for us :

IF (rand()*10)<2 then null() else [CurrentField] endif

This randomly introduced the needed nulls in our data that the flow had to clean up. After this, we’ve re-unioned the data back onto the entire set, giving us 41.951.606 records to work through.

But to make it a bit more obvious I used the Append Tool to blow this dataset up to 839.032.120 rows and 8 columns.

After Caching our Preparations to account for the randomness in this build, we can finally start our flow with the data :

Year	Age	Ethnic	Sex	Area	count	Extra String	Extra Int
2018	0	[NULL]	1	1	795	yes	1
2018	0	1	1	1	795	yes	[NULL]
2018	0	1	1	1	[NULL]	yes	3

A sample of the used data

The Contenders

The Challenge consist of 2 challenging teams :
– The Training wheels, consisting of the Auto-field toolset on all of the columns and a Cleansing tool, set up to exchange nulls for either empty strings or 0’s.
– The “Advanced Approach”, consisting of a Select Tool, is manually set up to change the columns to various Int or Str types. After this, a Formula tool with a calculation for every column consisting of the same type of formula :

IF isnull([FIELD] then "" else [FIELD] ENDIF

The Results

I’ve run the setup twice to filter out any lagging issues and the results are pretty close between the attempts with a pretty surprising twist :

Training wheels: Attempt 1	Ms	Minutes	Percentage
Data Cleansing	2973795	49.56325	63.58
Auto Field	1583974	26.39957	33.87
Cache	119379	1.98965	2.55
Finished running in 1:17 hours

Training wheels: Attempt 2	ms	minutes	Percentage
Data Cleansing	2879006	47.98343	62.65
Auto Field	1586017	26.43362	34.51
Cache	130476	2.1746	2.84
Finished running in 1:16 hours

Both runs are pretty even, taking a total of 1 hour and 16/17 minutes with the Cleansing tool taking up about 2/3 of that time.

Advanced Approach: Attempt 1	ms	min	Percentage
Formula	691633	11.52722	55.91
Select	431019	7.18365	34.85
Cache	114305	1.905083	9.24
Finished running in 20:37 minutes

Advanced Approach: Attempt 2	ms	min	Percentage
Formula	697138	11.61897	56.23
Select	430259	7.170983	34.70
Cache	112429	1.873817	9.07
Finished running in 20:40 minutes

Same as the previous 2; the timings aren’t very different with the Formula tool taking a bit more than half of the time.

If we look at the overall results we can see a clear winner: 1h17m vs 20m!
The tools that we can really compare are the Formula and Data Cleansing tools; they’ve got the exact same function but the formula tool is 4 times as fast in comparison, talk about inefficiency!

I was personally quite surprised by the fact that the Select tool switching over the columns took a bit more than 7 minutes but also here, a big win from the Auto Field.

The Twist

Timing, however, isn’t everything.
The select tool was set as follows to avoid any size issues:

Field	Type	Size
Year	Int64	8
Age	Int64	8
Ethnic	Int64	8
Sex	Bool	1
Area	V_WString	254
count	Double	8
Extra String	V_WString	3
Extra Int	Int64	8

and the outfield gave the following output :

Info: Auto Field (31): The FieldType of “Year” changed to: Int16

Info: Auto Field (31): The FieldType of “Age” changed to: String(6)

Info: Auto Field (31): The FieldType of “Ethnic” changed to: Int16

Info: Auto Field (31): The FieldType of “Sex” changed to: Byte

Info: Auto Field (31): The FieldType of “Area” changed to: String(8)

Info: Auto Field (31): The FieldType of “count” changed to: Double

Info: Auto Field (31): The FieldType of “Extra Int” changed to: String(3)

The difference between these settings is pretty clear but the results can’t be underestimated!

We started with a dataset that was 35.8GB.
The Auto Field decreased this to 30.4GB.
The Select Tool increased this to 59.6GB!

Now we all know that drive space is the easiest issue to fix (with buying bigger drives) but with the “safe” settings on the select tool, we nearly doubled the needed space! I do have to admit that I didn’t give the Select tool that much attention during the setup ; I quickly scanned the data , ran the data and quickly increased the size due to warnings about size constraints.

The Conclusion

For me, we’ve got a very clear winner with the Advanced Approach but keeping an eye out for the “safe” settings in the Select tool is surely something to look out for!

We can still see a clear use of the Auto Field as an exploration tool but I wouldn’t keep it in a normal workflow that’s been placed in production. Our number 1 reason for this would normally be that an error in your data might cause columns to change type but another clear reason is the fact that I wouldn’t want to add such a huge time consumer into the flow.