PB50 - A Gentle Intro to Pipeline Builder
Data Engineering
First, what is the purpose of the Pipeline Builder Tool? One of the first steps to building in Foundry is bringing in outside data. One powerful way to do this is through the no-code tool Pipeline Builder. There are other ways of using code to bring in data like code repositories, but the most rapid method is using Pipeline Builder to drag and drop your way to build a dataset.
If you get stuck at any point, message me on LinkedIn.
Thinking about Pipeline Builder as a data sluice
Although it's called a Data Pipeline, I like to use the mental model of a Data Sluice. A sluice controls the flow of water and helps separate valuable material from everything else.
The purpose of the Pipeline Builder tool is to bring data together, transform it, filter it, and then produce a usable dataset at the end.
The best way to learn is to build! I'm going to assume that you have your own free foundry account from https://build.palantir.com/
This is really one of the best free learning environments out there!
Go, Sign up!
Loading our raw UFO data
Once you have logged into your account, click the Files on the Left and then create New Project on the right.
Name it UFO sightings.
Inside your project click the New button, and then create a folder named data. Inside the data folder create another folder named raw.
Next, we need some data to work with. Let's go to kaggle.com, search for UFO sightings, hit enter.
Choose Data Sets. Choose the NUFORC one.
Download, and unzip.
In the archive folder are two files
Back in your project, drag and drop scrubbed.csv into the raw folder, or click New, then Upload Files, and choose scrubbed.csv
Keep the Upload as a structured dataset, and then click Upload
After uploading, right click and choose Rename. Rename the file to ufo_sightings_raw
Exploring the raw dataset
Click the dataset to see the data that it contains.
This is a very useful view of our data.
We can easily identify what kinds of data we have.
We have a date
We have a string:
We have a double: A double is a number with a decimal point.
Compared to an integer (whole numbers only like 1, 42, -7), a double can represent the space between whole numbers.
If you click a column, it will calculate statistics about the data in that column
Ok, let's bring this into Pipeline Builder
At the left top of your screen click the data folder
Creating the clean pipeline
Create a new folder, name it clean, and navigate into it. Now the moment we have been waiting for! Click new, and choose Pipeline Builder
Name it, Clean UFO Sightings Data and click Create Pipeline
The UFO data already exists so click, Add Foundry data
Navigate to the UFO Sightings \ data \ raw folder and select the ufo_sightings_raw dataset
This is your blank canvas, ready to work some data pipeline magic!
If you click a dataset, you are given a preview of the data that it contains, it should look very familiar (It's almost the same view of data we were looking at when we opened the dataset previously)
To give ourselves a little bit more room, close the Pipeline outputs side panel.
To view our painters palette, click our dataset
And move the cursor to highlight any of the icons on the right.
One useful thing to do is to color our inputs. To do so, right click the dataset, navigate to Color nodes, then New color, type Input, and pick a color. I like the second default option, a medium blue.
And now your input dataset is colored blue
Applying our first transform
Let’s do our first transform: UPPER. This transform converts text to uppercase.
On the palette choose Transform
This brings you into a Transform window view
Type upper into the search transforms and columns
Click Uppercase, then choose the shape column, which contains the reported UFO shape, and click Apply.
This will replace the current shape field. If you want to keep the original field, enter a new field name instead.
Click Apply, to apply the transform
To preview what this transform will do, click the Preview button
You will notice that the updated column now appears first. You will also notice that all of the values in shape are now uppercase.
Let’s rename the transform. It defaults to Transform Path, which is not very explanatory. In the top left, click the name and change it to Shape to Uppercase.
Also note the double-star icon to the right. This means you can use AI to generate the transform name. In this case, it generated: Uppercases UFO shape names.
Now click close to go back to the main canvas.
Now our canvas should look like this
You will notice it inherited our input color. Let's change the color, let's use the 5th default color over to represent a transform
Let's click on the Shape to Uppercase transform, and view the preview down below in the preview pane. If the transform had multiple steps, we would see the final result here.
Delivering a clean dataset
Let's write our clean dataset out. In order to do so, click the Shape to Uppercase transform node, and then choose Add output
This gives us many options, we want to choose New dataset
This creates a dataset with a default name of New dataset Sat, Apr 4, 2026, 11:41:24 AM. Click in that field and rename it to ufo_sightings_clean
This format is called snake case. Computers do not like spaces in names. Snake_case is the fix: swap every space for an underscore, keep everything lowercase. Readable by humans. Readable by machines. No drama.
The dataset inherited the color from the transform node, lets create a new color, name it Output and I like to choose the 10th default color.
You will notice that the order of fields is based on what was originally in the file, and then with transforms the newly created fields move to the front
This is usually not the order we want our fields to be in. In order to reorder our fields we are going to insert a node between our transformation and our dataset output.
To do so, drag the ufo_sightings_clean node to the right. And you will see a plus appear.
Click the + to insert a Transform
In this transform we are going to click the Select columns transform
And then click into the Search for columns and then Select all
and Apply
This adds all of the columns, and puts them in a mode for easy moving. After hitting Apply it condenses the view of the transform, we want to click Edit to show what is inside the transform.
This shows all the columns with a Drag Handle indicating you can drag to reorder.
Drag it below shape below country. Then hit apply. Then hit close on the top right.
The transform is auto named Transform path (1) so we should rename it.
Another way to rename a node is by right clicking it and choosing Rename
Name this one, Set Schema.
Think of a schema as the blueprint of your data table -- it defines:
What columns exist (shape, datetime, city, state...)
What type of data each column holds (String, Date, Double...)
The order the columns appear in
Let us save our work, by pressing the Save button on the top right.
Before the pipeline can run, we need to deploy it. Deploy publishes the blueprint of your pipeline, or data sluice. Build is what actually opens the gates and runs the water through.
Deploying the pipeline: opening the sluice gate
To deploy, click the Deploy button in the top right. This opens the Deploy pane. Then click Deploy Pipeline.
This is the deploy phase
And we have our first deployment failure.
Welcome to Data Engineering: Diagnosing a Failed Pipeline
At the top right you can see a red X showing the failure. Clicking the red X lets us see the jobs.
And clicking on the job will take us to the job log
The error message: This job was aborted due to a malformed data record in the input dataset. So there looks to be some bad data in our input.
Scrolling the details, we can see that we have a 2 with a backtick that's breaking it.
BadRecordException: java.lang.NumberFormatException: For input string: "2`"
We need to go back and fix it at the source. Click the back arrow in the browser to go back to our pipeline. Right click the ufo_sightings_raw and choose Open. Foundry often provides an Open in new tab arrow in the interface, so click that.
When we brought in the UFO data, foundry made a best guess of what the data types should be.
We've done our best to guess the structure of your data on the sample below. Check that the headers and column types appear correct.
If everything looks right, there is no need to edit the options here.
To change those we need to edit the schema, so click Edit Schema.
The columns that are suspect are: duration_seconds, latitude, and longitude.
Let's change these to a string data type. Click the Double under duration_seconds, and select string. Do this for latitude and longitude as well.
Click Save and validate
To see where the data problems were occurring, we can temporarily flip those fields back to double and hit Save and validate.
We can see four rows that have bad data.
To complete the exercise, change them back to string and press Save and validate.
Close that tab and head back to the Pipeline Builder tab. You might have to refresh the browser to reload the pipeline, but it will now show an error because we have changed the schema.
In order to fix this error, click the output dataset ufo_sightings_clean and choose edit
This will explicitly tell us what the problem is. Those red X fields were previously doubles, and they are now strings.
The quickest way to resolve this is to click Use upstream schema, which applies the current schema of the connected table.
After doing this click Save at the top, now we can deploy the Pipeline.
Made it through? Have questions? Message me on LinkedIn.
