PB51: Clean Pipelines Mean Nothing
Clean is not a compliment. It just means your pipeline ran without crashing. The UFO data made it through, the transform applied, the output wrote. Clean. But some of those sightings are duplicates. The pipeline did not care. It just moved the water. This lesson is about something harder than clean: data quality and data integrity.
Removing Duplicate Rows
A duplicate is a row that should not exist twice. In Pipeline Builder, the Keep Duplicates transform will surface them for you. Any row that appears more than once comes through. From there you can see exactly what you are dealing with. The best way to learn is by building, so let's build!
Insert a transform after ufo_sightings_raw
Type keep and select Keeps duplicates
You can either leave the column subset selection empty, which selects all of them, or explicitly add every column yourself. Either way gets you to the same place.
Press Apply and Close
If you click the transform you should see 18 rows, which means 9 duplicate pairs.
To remove the duplicates. Click Edit on the transform.
Click the Delete Icon on the board for the KEEPS DUPLICATES
Type dup, and choose Drop duplicates
You can either leave the column subset selection empty, which selects all of them, or explicitly add every column yourself. Either way gets you to the same place.
Press Apply and Close
Rename the node to Drop Duplicates and change the color to Transform Teal
You will notice the Drop Duplicates node needs to be reconnected. Click the circle on the right side of Drop Duplicates and drag it to Shape to Uppercase.
Press the Layout nodes icon to tidy up the arrangement.
Verify duplicate removal
Let's double check our duplicate removal. Click ufo_sightings_raw and then click Calculate row count in the bottom right of the preview window.
It should display 80,332 total rows
Next, click the Drop Duplicate node and Calculate row count in that preview window. It should display 80,323
80,332 - 80,323 = 9 Duplicate rows removed.
Creating Primary Keys
Now that the duplicates are gone, every remaining row is unique. That means we can give each one a permanent identifier. That is all a primary key is: a unique ID per row so you can reference it, join on it, and trust it downstream.
To create a primary key we are going to concatenate (it just means combining) all of the fields together. To do so, click to add a transform after the Drop Duplicate node. And search for concatenate, choose Concatenate strings
Type the pipe | (on the US keyboard the key above enter) we use the pipe because it is a character that is unlikely to be in our data. Name the output field primary_key. And leave the Null output unchecked
Press Apply
Press Preview to see the results of the transform
You can see that the transform combined all the fields in the dataset
We are next going to take those combined fields and 'hash' them to create the unique identifier of the row, the primary key.
Imagine you put any piece of text into a machine. It does not matter if it is one word or an entire book. The machine always spits out a fixed length string of random looking characters. Same input always produces the same output. Different input always produces different output.
Add another step to the transform under the Concatenate strings. Type hash and choose Hash sha256
Choose primary key and press Apply
Rename the transform node to Create pk. Click the Create pk node and view the glorious primary key that was created.
Primary keys are important
Primary keys are important. A duplicate primary key does not just cause a problem in this dataset, it breaks everything downstream that relies on it. To make sure that never happens we are going to apply a Data Expectation, which is Foundry's way of encoding a rule that the data must always follow.
First, we need to add the primary key to the schema. To do so, click and edit the Set Schema node
In the columns to select, click the primary_key press apply. Drag it from the very end to the front. Press Apply and Close
Now click the ufo_sightings_clean and Pipeline outputs icon. Notice the alert telling us the 1 dropped. In order to add the primary key, click the Use upstream schema
Press Save
Setting up Data Expectations
In the Pipeline outputs, click Configure expectations
Click Add data expectation
Click Primary Key
Leave it on Fail, and add the primary_key and press Apply
Adding additional data
The data is solid now. Duplicates removed, primary keys assigned, expectations in place. But let us make things more interesting. We are going to bring in some current UFO sightings and union them to what we already have, extending our pipeline to handle multiple sources.
There are many ways to add data to a Pipeline Builder. Let us use the Enter Data Manually option. Click Add Data and then click Enter data manually.
We want our manually entered data to have the same columns as the existing dataset. The quickest way to match them is to load the schema definition from the ufo_sightings_raw file.
To do so click From dataset
Navigate to UFO Sightings \ data \ raw and choose ufo_sightings_raw and then click Select
Rename it to Manually entered sightings notice
Notice the columns that were added from the schema
Let's navigate to https://nuforc.org/ so we can copy and paste some new data into the manually entered sightings. On the nuforc.org page click Data Bank
Scroll down and click Index by EVENT DATE
Choose a month
Copy one of the rows, from OCCURRED to SHAPE
And paste it into cell values of the manual entry
Copy the SUMMARY
into the comments cell.
Copy this row twice like this, and click Apply and Close
Testing the primary key data expectation
To test the primary key data expectation let's union after the Drop Duplicates. To do so, Click Drop Duplicates and choose Union.
Click the Manually entered sighting dataset and then click Start
Keep the default type of Union by name, Click Apply and then Close
Drag the connection from the Union to Create pk
Press Save and then Deploy the pipeline
There are many visual indicators that the build failed. A toast alert on the top, a toast alert from the bottom left.
Click the monitored items, the red x next to deploy. Then click the failed build ufo_sightings_clean to see the details.
As was expected, the build failed the Primary key check.
Delete the union node by right clicking on it and choosing Remove node
Create a new union off the ufo_sightings_raw and connect it to the Manually entered sightings
Connect it to the Drop Duplicates node
Connect from Drop Duplicates to Create pk, and press the layout icon
Press Save and Deploy
And now it deploys successfully
You started this lesson with a pipeline that ran. You are ending it with a pipeline you can trust. Duplicates removed. Primary keys assigned. Data expectations enforced. The sluice is still moving water, but now it knows what good water looks like, and it will not let bad water through.
If you get stuck at any point, message me on LinkedIn.