PB52 - Not Every Pipeline Needs to Use an LLM

👆Click the image above to explore the pipeline simulation👆

One of the most enabling features of Foundry as a platform is the ability to use LLMs within the ecosystem. You are not calling an external API, managing credentials, or stitching together separate tools. The model is there, inside the pipeline, ready to do work. But that accessibility is exactly why it is worth slowing down. Just because the LLM is one step away does not mean it is always the right step.

First some setup. We need more data, the previously loaded UFO sightings data contains only a short summary and not the full text of what was reported.

Head back to Kaggle, and search and download the Enhanced UFO Sighting Dataset

The .zip file contains 3 files, drag and drop ufo_sightings_enhanced.csv into the data\raw folder in foundry.

Add it to the Clean UFO Sightings Data pipeline that was created previously.

For the sake of the exercise, let's eliminate duplicates from the ufo_sightings_enhanced by adding a Transform and first doing a select and selecting datetime, city, and description. Then next add a DROP DUPLICATES and choose datetime and city. Rename the transform to Drop Enhanced Duplicates

Click ufo_sightings_raw and choose join

Click the Drop Enhanced Duplicates (to be the right dataset) and then Start

Choose datetime to match datetime

city to match city

Then click Deselect all and then choose only description (This is the new data we are adding)

Connect the Join to the Drop Duplicates

Extract the UFO color, let's try regex

We joined the descriptions. Now suppose we want to run a color analysis. What color did witnesses report most often? To answer that we need color as a column. You could open up the original spreadsheet, read each description and then type in the reported color. This would take you a long time. We have thousands of rows. So we need the pipeline to do it for us. The first instinct is pattern matching. We write a set of rules that scan each description for known color words. If it finds "orange," it returns orange. If it finds "red," it returns red. This approach is called regex. Let us try it first.

After the Join, add a Transform

First we want to lowercase the description, so that variations in the capitalization don't cause the pattern match to not work.
Search and choose Lowercase

Then choose the description column

Next type regex, and choose Extract all regex matches

Then choose description, Value, and paste in

We can see that the regex successfully pulls out some colors from the description

But if you scroll through the results you can note some interesting things. The regex is just blindly matching colors, so how do we know what the reported color of the UFO was? Also, if there is a color missing from our list we wouldn't get a match. You could spend an afternoon making the list longer and it would still miss entries. The descriptions are written in natural language by people who were not thinking about your color column when they filed their report.

This is where the LLM earns its place in the pipeline.

The Use LLM node in Pipeline Builder offers a convenient method for executing large language models on your data at scale, allowing you to seamlessly incorporate LLM processing logic between data transformations with no coding required. Instead of a list of rules, you write a prompt.

There has been a lot written about prompt engineering, and we can't cover all of it here. But there is one concept worth understanding before you start writing prompts against thousands of rows of data: tokens.

Every time the LLM reads your description and generates a response, it costs tokens. Tokens are roughly the unit of text the model processes, think of them as chunks of words. The longer your description, the more tokens it consumes. Run that across thousands of rows and it adds up fast.

Let's add a Use LLM

Choose Empty prompt

The Use LLM node has two main sections to configure.

The first is Describe the role the model will play and outline the task it will perform. This is your system prompt. It tells the model what it is and what you need from it before it reads a single row of data. For our color extraction, fill it in like this:

You are a data extraction assistant. Read the following UFO sighting description and extract any colors used to describe the observed object. Return only the color words as a list. If no color is mentioned, return an empty list. Do not explain your answer.

Paste it in the Instructions. Take note that to provide input data you press the forward slash /

Press / and choose the description

We want to capture multiple color outputs, to do so change the output type to an array

Keep the model at GPT-5 nano. Model choice matters. AIP supports a wide range of LLMs from providers like OpenAI, Anthropic, Meta, and Google. That range exists for a reason. Extracting color words from a sentence is not a complex reasoning task. A lightweight, fast model like GPT-5 nano is the right tool. Reaching for a larger model here is like sending a freight train to deliver a postcard. It will work, but it is wildly more machinery than the job requires. Save the heavy models for tasks that actually need them.

Rename the Output column to colors_llm

Before you commit to running the LLM against thousands of rows, use the trial feature. In Foundry, tokens are the basic units of text that LLMs use to process and understand input. The size of the text will dictate the amount of compute used by the backing model to serve the response. Every description you send costs tokens, and longer descriptions cost more. Running a bad prompt against your entire dataset and then fixing it is an expensive way to learn. The trial lets you test against a small sample in seconds, see exactly what the model returns, and adjust your instructions before you scale. Get it right on ten rows first. Then run it on ten thousand.

At the bottom click on Trial run

Then click, Select from input table

I will select the row that we were looking at previously, with the gray skies mixed in with silvery spaceship.

Press run

We can then see that it took about 1.2K LLM tokens, and 5 seconds

By clicking on the Use LLM node, we can get a preview of 10 rows.

Test run 100 rows

Next, we want to run a test run of 100 rows. To do so we want to sort the rows by date, and then take the top 100.

Insert a transform in front of the Regex Extract

Add a TOP ROWS, enter 100, choose datetime, Ascending, and Apply and then close

Rename the node to Limit 100

After the Use LLM add an output.

Choose New dataset

Name it 100 Test

Click Deploy, and the Deploy Pipeline

After it is deployed, right click and choose Open (Using the arrow method, to open in a new tab)

It's very interesting to profile the data. See how the LLM extracts the color versus the regex.

Wrapping up

Look at the two columns side by side. The regex gives you exactly what you asked for, every color word it could find, with no understanding of which one actually described the UFO. The LLM reads the sentence the way a person would and pulls out the color that matters.

That difference is the whole point of this lesson. Regex is fast, cheap, and predictable. When the pattern is simple and the rules are clear, it is the right tool. But language is messy, and the moment your data starts looking like something a human wrote, rules start breaking down. That is where an LLM earns its place.

A few things worth carrying forward:

Pick the right model for the job. GPT-5 nano handled this task in seconds for very few tokens. A larger model would have done the same work for far more cost and no better result.

Trial before you scale. Ten rows will tell you almost everything you need to know about whether your prompt is working. Ten thousand rows will tell you the same thing, just with a much bigger bill.

Use the LLM where it actually adds value. Not every column needs one. The pipeline is strongest when regex, transforms, and LLMs each do the part of the work they are best suited for.

PB52 - Not Every Pipeline Needs to Use an LLM