PB51: Clean Pipelines Mean Nothing

Clean is not a compliment. It just means your pipeline ran without crashing. The UFO data made it through, the transform applied, the output wrote. Clean. But some of those sightings are duplicates. The pipeline did not care. It just moved the water. This lesson is about something harder than clean: data quality and data integrity.

Removing Duplicate Rows

A duplicate is a row that should not exist twice. In Pipeline Builder, the Keep Duplicates transform will surface them for you. Any row that appears more than once comes through. From there you can see exactly what you are dealing with. The best way to learn is by building, so let's build!

Insert a transform after ufo_sightings_raw

Type keep and select Keeps duplicates

You can either leave the column subset selection empty, which selects all of them, or explicitly add every column yourself. Either way gets you to the same place.

Press Apply and Close

If you click the transform you should see 18 rows, which means 9 duplicate pairs.

To remove the duplicates. Click Edit on the transform.

Click the Delete Icon on the board for the KEEPS DUPLICATES

Type dup, and choose Drop duplicates

You can either leave the column subset selection empty, which selects all of them, or explicitly add every column yourself. Either way gets you to the same place.

Press Apply and Close

Rename the node to Drop Duplicates and change the color to Transform Teal

You will notice the Drop Duplicates node needs to be reconnected. Click the circle on the right side of Drop Duplicates and drag it to Shape to Uppercase.

Press the Layout nodes icon to tidy up the arrangement.

Verify duplicate removal

Let's double check our duplicate removal. Click ufo_sightings_raw and then click Calculate row count in the bottom right of the preview window.

It should display 80,332 total rows

Next, click the Drop Duplicate node and Calculate row count in that preview window. It should display 80,323

80,332 - 80,323 = 9 Duplicate rows removed.

Creating Primary Keys

Now that the duplicates are gone, every remaining row is unique. That means we can give each one a permanent identifier. That is all a primary key is: a unique ID per row so you can reference it, join on it, and trust it downstream.

To create a primary key we are going to concatenate (it just means combining) all of the fields together. To do so, click to add a transform after the Drop Duplicate node. And search for concatenate, choose Concatenate strings

Type the pipe | (on the US keyboard the key above enter) we use the pipe because it is a character that is unlikely to be in our data. Name the output field primary_key. And leave the Null output unchecked

Press Apply

Press Preview to see the results of the transform

You can see that the transform combined all the fields in the dataset

We are next going to take those combined fields and 'hash' them to create the unique identifier of the row, the primary key.

Imagine you put any piece of text into a machine. It does not matter if it is one word or an entire book. The machine always spits out a fixed length string of random looking characters. Same input always produces the same output. Different input always produces different output.

Add another step to the transform under the Concatenate strings. Type hash and choose Hash sha256

Choose primary key and press Apply

Rename the transform node to Create pk. Click the Create pk node and view the glorious primary key that was created.

Primary keys are important

Primary keys are important. A duplicate primary key does not just cause a problem in this dataset, it breaks everything downstream that relies on it. To make sure that never happens we are going to apply a Data Expectation, which is Foundry's way of encoding a rule that the data must always follow.

First, we need to add the primary key to the schema. To do so, click and edit the Set Schema node

In the columns to select, click the primary_key press apply. Drag it from the very end to the front. Press Apply and Close

Now click the ufo_sightings_clean and Pipeline outputs icon. Notice the alert telling us the 1 dropped. In order to add the primary key, click the Use upstream schema

Press Save

Setting up Data Expectations

In the Pipeline outputs, click Configure expectations

Click Add data expectation

Click Primary Key

Leave it on Fail, and add the primary_key and press Apply

Adding additional data

The data is solid now. Duplicates removed, primary keys assigned, expectations in place. But let us make things more interesting. We are going to bring in some current UFO sightings and union them to what we already have, extending our pipeline to handle multiple sources.

There are many ways to add data to a Pipeline Builder. Let us use the Enter Data Manually option. Click Add Data and then click Enter data manually.

We want our manually entered data to have the same columns as the existing dataset. The quickest way to match them is to load the schema definition from the ufo_sightings_raw file.

To do so click From dataset

Navigate to UFO Sightings \ data \ raw and choose ufo_sightings_raw and then click Select

Rename it to Manually entered sightings notice
Notice the columns that were added from the schema

Let's navigate to https://nuforc.org/ so we can copy and paste some new data into the manually entered sightings. On the nuforc.org page click Data Bank

Scroll down and click Index by EVENT DATE

Choose a month

Copy one of the rows, from OCCURRED to SHAPE

And paste it into cell values of the manual entry

Copy the SUMMARY

into the comments cell.

Copy this row twice like this, and click Apply and Close

Testing the primary key data expectation

To test the primary key data expectation let's union after the Drop Duplicates. To do so, Click Drop Duplicates and choose Union.

Click the Manually entered sighting dataset and then click Start

Keep the default type of Union by name, Click Apply and then Close

Drag the connection from the Union to Create pk

Press Save and then Deploy the pipeline

There are many visual indicators that the build failed. A toast alert on the top, a toast alert from the bottom left.

Click the monitored items, the red x next to deploy. Then click the failed build ufo_sightings_clean to see the details.

As was expected, the build failed the Primary key check.

Delete the union node by right clicking on it and choosing Remove node

Create a new union off the ufo_sightings_raw and connect it to the Manually entered sightings

Connect it to the Drop Duplicates node

Connect from Drop Duplicates to Create pk, and press the layout icon

Press Save and Deploy

And now it deploys successfully

You started this lesson with a pipeline that ran. You are ending it with a pipeline you can trust. Duplicates removed. Primary keys assigned. Data expectations enforced. The sluice is still moving water, but now it knows what good water looks like, and it will not let bad water through.

If you get stuck at any point, message me on LinkedIn.

PB51: Clean Pipelines Mean Nothing

Removing Duplicate Rows

Verify duplicate removal

Creating Primary Keys

Primary keys are important

Setting up Data Expectations

Adding additional data

Testing the primary key data expectation

Comments

Pipeline Builder

PB51 AI - Same pipeline. AI FDE finds what clean missed.

More from this blog

PB52 - Not Every Pipeline Needs to Use an LLM

PB51 AI - Same pipeline. AI FDE finds what clean missed.

PB50 AI - Same pipeline. You prompt. AI FDE builds.

PB50 - A Gentle Intro to Pipeline Builder

Command Palette

Removing Duplicate Rows

Verify duplicate removal

Creating Primary Keys

Primary keys are important

Setting up Data Expectations

Adding additional data

Testing the primary key data expectation

Comments

Pipeline Builder

PB51 AI - Same pipeline. AI FDE finds what clean missed.

More from this blog