Deduplicate in Prep using one of two methods

22 april 2022
Arno Jissink
Tableau Prep
0

Data Preparation using Builder: removing duplicates

When starting with a new dataset in Tableau one of the first actions should be deduplicate in Prep. Deduplication is part of the first Normal Form to remove redundancy. For this example I have created a small dummy dataset inspired by Kaggle data set containing all F1 Races between 1955 and 2022.

The dummy data set

The dummy data set to illustrate the two methods is kept simple.

raceId	season	name	date
1051	2021	Qatar Grand Prix	21/11/2021
1052	2021	Bahrain Grand Prix	28/3/2021
1052	2021	Bahrain Grand Prix	28/3/2021
1053	2021	Emilia Romagna Grand Prix	18/4/2021
1054	2021	Portuguese Grand Prix	2/5/2021
1055	2021	Spanish Grand Prix	9/5/2021
1056	2021	Monaco Grand Prix	23/5/2021

Method 1 – Deduplication with Aggregate

Within Prep the classic method uses Aggregate to deduplicate.

After the data file create an Aggregate step, or as in below flow example a Clean step can created first to inspect and remove any unwanted fields.

Note that while using that method you ensure numeric Fields like round, year are aggregated as average, maximum or minimum. I have used MAX.

Method 2 – an alternative to deduplication with ROW_NUMBER

Another way to deduplicate is with the function ROW_NUMBER. First create a Clean step, and then Create a Calculated Field using a custom calculation.

ROW_NUMBER can be used to check for duplicates:

{ PARTITION [raceId]: { ORDERBY[raceId]:ROW_NUMBER() }} // a single field deduplication.

The ROW_NUMBER function in the code above assigns a sequential row ID to each unique row. ORDERBY sequences the rows by the selected field. PARTITION will then return the number of duplicate rows by the selected Field. In effect it Groups each value of the Field.

Note that ORDERBY can have multiple fields and can be changed to ascending by adding the ASC expression behind the field. Assess if a single or multiple calculation is needed on your data set. In this dataset the following can be done:

{ PARTITION [date], [raceId]: {ORDERBY [date], [raceId] ASC: ROW_NUMBER() } }

The flow would be one clean step with one ROW_number calculation. In below image I have added both options.

Deduplicated Data set

When you would use either method in your flow with periodic data refresh, and expect possible duplicates to occur, then add a step to remove duplicates. Remove duplicates by adding a Filter change in your existing Clean step. An easy way to do this is in the Profile Pane of the deduplication step and licking on number 1, selecting Keep Only.

Note that in Alteryx you can use the Unique Tool, you can read more about it in this blog post.

I hope this small guide on deduplication was helpful. Check my blogs for more Tableau and Tableau Prep content.

Which option would you use?

If you want to read other blogs about software like Tableau and Alteryx, follow a training course or book a consultant please visit The Information Lab NL for more information.