Skip to main content

Command Palette

Search for a command to run...

PB51: Clean Pipelines Mean Nothing

Updated
•8 min read

Clean is not a compliment. It just means your pipeline ran without crashing. The UFO data made it through, the transform applied, the output wrote. Clean. But some of those sightings are duplicates. The pipeline did not care. It just moved the water. This lesson is about something harder than clean: data quality and data integrity.


Removing Duplicate Rows

A duplicate is a row that should not exist twice. In Pipeline Builder, the Keep Duplicates transform will surface them for you. Any row that appears more than once comes through. From there you can see exactly what you are dealing with. The best way to learn is by building, so let's build!


Insert a transform after ufo_sightings_raw


Type keep and select Keeps duplicates


You can either leave the column subset selection empty, which selects all of them, or explicitly add every column yourself. Either way gets you to the same place.


Press Apply and Close


If you click the transform you should see 18 rows, which means 9 duplicate pairs.


To remove the duplicates. Click Edit on the transform.


Click the Delete Icon on the board for the KEEPS DUPLICATES


Type dup, and choose Drop duplicates


You can either leave the column subset selection empty, which selects all of them, or explicitly add every column yourself. Either way gets you to the same place.


Press Apply and Close


Rename the node to Drop Duplicates and change the color to Transform Teal


You will notice the Drop Duplicates node needs to be reconnected. Click the circle on the right side of Drop Duplicates and drag it to Shape to Uppercase.


Press the Layout nodes icon to tidy up the arrangement.


Verify duplicate removal

Let's double check our duplicate removal. Click ufo_sightings_raw and then click Calculate row count in the bottom right of the preview window.


It should display 80,332 total rows


Next, click the Drop Duplicate node and Calculate row count in that preview window. It should display 80,323

80,332 - 80,323 = 9 Duplicate rows removed.


Creating Primary Keys

Now that the duplicates are gone, every remaining row is unique. That means we can give each one a permanent identifier. That is all a primary key is: a unique ID per row so you can reference it, join on it, and trust it downstream.


To create a primary key we are going to concatenate (it just means combining) all of the fields together. To do so, click to add a transform after the Drop Duplicate node. And search for concatenate, choose Concatenate strings


Type the pipe | (on the US keyboard the key above enter) we use the pipe because it is a character that is unlikely to be in our data. Name the output field primary_key. And leave the Null output unchecked

Press Apply


Press Preview to see the results of the transform


You can see that the transform combined all the fields in the dataset


We are next going to take those combined fields and 'hash' them to create the unique identifier of the row, the primary key.

Imagine you put any piece of text into a machine. It does not matter if it is one word or an entire book. The machine always spits out a fixed length string of random looking characters. Same input always produces the same output. Different input always produces different output.


Add another step to the transform under the Concatenate strings. Type hash and choose Hash sha256


Choose primary key and press Apply


Rename the transform node to Create pk. Click the Create pk node and view the glorious primary key that was created.



Primary keys are important

Primary keys are important. A duplicate primary key does not just cause a problem in this dataset, it breaks everything downstream that relies on it. To make sure that never happens we are going to apply a Data Expectation, which is Foundry's way of encoding a rule that the data must always follow.

First, we need to add the primary key to the schema. To do so, click and edit the Set Schema node


In the columns to select, click the primary_key press apply. Drag it from the very end to the front. Press Apply and Close


Now click the ufo_sightings_clean and Pipeline outputs icon. Notice the alert telling us the 1 dropped. In order to add the primary key, click the Use upstream schema


Press Save


Setting up Data Expectations


In the Pipeline outputs, click Configure expectations


Click Add data expectation


Click Primary Key


Leave it on Fail, and add the primary_key and press Apply


Adding additional data

The data is solid now. Duplicates removed, primary keys assigned, expectations in place. But let us make things more interesting. We are going to bring in some current UFO sightings and union them to what we already have, extending our pipeline to handle multiple sources.


There are many ways to add data to a Pipeline Builder. Let us use the Enter Data Manually option. Click Add Data and then click Enter data manually.


We want our manually entered data to have the same columns as the existing dataset. The quickest way to match them is to load the schema definition from the ufo_sightings_raw file.


To do so click From dataset


Navigate to UFO Sightings \ data \ raw and choose ufo_sightings_raw and then click Select


Rename it to Manually entered sightings notice
Notice the columns that were added from the schema


Let's navigate to https://nuforc.org/ so we can copy and paste some new data into the manually entered sightings. On the nuforc.org page click Data Bank


Scroll down and click Index by EVENT DATE


Choose a month


Copy one of the rows, from OCCURRED to SHAPE


And paste it into cell values of the manual entry


Copy the SUMMARY

into the comments cell.


Copy this row twice like this, and click Apply and Close


Testing the primary key data expectation

To test the primary key data expectation let's union after the Drop Duplicates. To do so, Click Drop Duplicates and choose Union.


Click the Manually entered sighting dataset and then click Start


Keep the default type of Union by name, Click Apply and then Close


Drag the connection from the Union to Create pk


Press Save and then Deploy the pipeline

There are many visual indicators that the build failed. A toast alert on the top, a toast alert from the bottom left.


Click the monitored items, the red x next to deploy. Then click the failed build ufo_sightings_clean to see the details.


As was expected, the build failed the Primary key check.



Delete the union node by right clicking on it and choosing Remove node


Create a new union off the ufo_sightings_raw and connect it to the Manually entered sightings


Connect it to the Drop Duplicates node


Connect from Drop Duplicates to Create pk, and press the layout icon


Press Save and Deploy


And now it deploys successfully


You started this lesson with a pipeline that ran. You are ending it with a pipeline you can trust. Duplicates removed. Primary keys assigned. Data expectations enforced. The sluice is still moving water, but now it knows what good water looks like, and it will not let bad water through.

If you get stuck at any point, message me on LinkedIn.

4 views