Eliminating Schema Hallucinations in AI Systems with DAGs

Learn how to turn natural language into guaranteed-valid filters using a DAG-based LLM architecture that prevents hallucinations and schema errors.

January 7th, 2025 | Written by Mordechai Worch | 10 Min 📖

Natural language is messy. Filters are precise.

At Kizen, we spent years building a powerful filter engine, composite query architecture, relationship filters, activity tracking and custom fields across dozens of object types. It handles everything from “show me active leads” to “contacts whose company has more than 100 employees AND submitted a form in the last 30 days OR triggered our welcome automation.”

Then we tried to let users build these filters with natural language.

Modern LLMs excel at grasping intent. If a user asks for ‘active leads,’ the model knows exactly what that means conceptually. The problem arises when that concept hits the rigid reality of a database schema.

LLMs are confident improvisers. They often generate payloads that look perfect to the human eye but are gibberish to a compiler, inventing field names that don’t exist or syntax that your engine never supported. The result isn’t a working filter; it’s a silent failure.

Our solution: decompose filter intent into a directed acyclic graph (DAG) of logic fragments, where every path through the graph produces a valid filter. No hallucinations. No invalid combinations. Just guaranteed-valid output.

Why Dumping your Schema into an LLM Doesn’t Work

Our first approach was obvious: give the LLM our complete filter schema and ask it to generate valid JSON. It failed spectacularly.

The complexity is combinatorial. We are bridging the gap between a probabilistic LLM and a deterministic query engine designed for exact precision. While the LLM deals in “likely” tokens, our engine requires strict type alignment to maintain data integrity:

“equals” works on text fields.
“greater_than” works on numbers.
“contains_any” works on multi-select fields.
“between” works on dates.
And dozens more condition-field type combinations.

Multiply fields × conditions × values × boolean groupings and you get thousands of valid combinations — but millions of invalid ones. LLMs generate plausible-looking output that crashes our parser or, worse, fails silently:

Hallucinated fields: “filter by customer_tier" when the field is actually called membership_level .
Invalid conditions: Using equals on a date field instead of on_date.
Malformed groupings: Nested AND/OR logic that doesn’t match our query structure.

When a user says “customers who signed up last week OR have premium status AND live in California,” the boolean logic must match our exact query format. One wrong bracket and the filter returns garbage.

Not All Filters Are Created Equal

Before diving into the solution, it helps to understand what we’re dealing with. Our filter engine supports seven distinct filter types, each with different structures:

Description of Different Filter Types

Filter

What It Does

Example

Field Filters

Filter on record's own fields

status equals active

Relationship Filters

Filter on related object's fields

company.employee_count>100

Relationship Group Filters

Use saved filter groups from related objects

company in 'Enterprise Accounts' group

Logged Activity Filters

Filter by activities recorded on records.

had a meeting in last 30 days

Scheduled Activity Filters

Filter by upcoming activities

has call scheduled this week

Form Submission Filters

Filter by form responses

submitted 'Contact Us' form

Automation Filters

Filter by automation execution

triggered 'Welcome Email' automation

Team Interaction Filters

Filter by interactions with employees

contracts John reviewed in last 2 weeks

Each filter type has its own valid conditions, value formats, and nesting rules. A field filter on a text field accepts different conditions than a field filter on a date field. A relationship filter, by contrast, must also specify the related object and the field on that object.

This complexity is exactly why naive prompting fails and why DAG decomposition works.

From Messy Input to Structured Intent

Before walking the DAG, we need to understand what the user actually wants. Consider this input:

“Show me customers who bought something last month and haven’t been contacted”

This input contains two distinct filter conditions combined with AND logic, which must be processed independently.

The Rewrite Phase uses an LLM to decompose user intent into structured filter statements:

class RewritePromptResponse(BaseModel):
    filters: list[str]        # Individual filter intents
    logical_grouping: str     # "AND" or "OR" or "Complex"py

For the input “active customers from California or premium members,” rewriting produces:

filters: [
    "status equals active",
    "state equals California",
    "membership_tier equals premium"
]
logical_grouping: "('status equals active' and 'state equals California') 
                    or ('membership_tier equals premium')"

Each filter in the list gets processed through the DAG independently, then combined at the end. This parallelizes the work and keeps each DAG walk focused on a single filter intent.

Breaking Intent into Steps

Instead of asking the LLM to generate a complete filter in one shot, we break the problem into constrained steps:

Choose filter type: field, relationship, activity, etc.
Choose field (for field filters): only fields that exist on this object.
Choose condition: only conditions valid for that field’s type.
Provide value: only value formats that work with that condition.

At each step, the LLM sees only valid options. When you select a text field, you don’t see date conditions. When you select “equals,” you provide a string, not a date range.

We built a lightweight custom router (not LangGraph or similar frameworks) to keep the logic simple and debuggable:

class Step:
    name: str                    # Human-readable name
    next_steps: list[Step]       # Valid transitions in DAG
    output_type: type            # What this step produces
    output_is_step: bool         # True if output determines routing

    def get_response_model(self):
        # Dynamically creates Pydantic model for LLM response
        # Only includes valid options for this step
        ...

The result is a system where invalid filters cannot be generated. Every path through the DAG produces output that is valid for our filter engine.

Making the LLM Color Inside the Lines

Now that we have steps, how do we guarantee each step’s output is valid? Modern LLMs support structured output, forcing responses to match a specific JSON schema. Instead of free-form text that might hallucinate invalid values, the LLM must return valid JSON matching a Pydantic model.

This is the secret sauce: at each DAG step, we generate a dynamic Pydantic model containing only valid options for that specific point in the graph. This also saves tokens, since we’re not feeding the LLM the schema for date filters when it’s processing a text field.

def get_response_model(self) -> type[BaseModel]:
    # Dynamically create enums and typed fragments based on step
    if step.has_choices:
        # Create dynamic enum - LLM must pick from exact valid options
        Options = StringEnum("Options", {opt: opt for opt in step.choices})
        return create_model("Response", value=(Options, ...))
    elif step.output_type == DateFragment:
        # Break date into validated components - no string parsing errors
        class DateFragment(BaseModel):
            year: int = Field(ge=1900, le=2100)
            month: int = Field(ge=1, le=12)
            day: int = Field(ge=1, le=31)
            def get_value(self) -> str:
                return f"{self.year}-{self.month:02d}-{self.day:02d}"
        return DateFragment
    elif step.output_type == int:
        # Constrain integers with bounds
        return create_model("Response", value=(Annotated[int, Field(ge=0, le=1000)], ...))
    return create_model("Response", value=(str, ...))

The LLM cannot hallucinate invalid field names because dynamic enums constrain it to exact valid options. For dates, we break the value into typed components (year, month, day) with built-in bounds validation, no string parsing, no format errors.

Structured outputs + dynamic schemas = the LLM can only produce valid fragments, making invalid output impossible by construction.

Executing the DAG

The Router walks the graph, accumulating fragments at each step:

class Router:
    def __init__(self, metadata):
        self.current_step = build_filter_dag(metadata)  # Root of DAG
        self.fragments = FilterResult()

    def process(self) -> list[str]:
        while self.current_step:
            # LLM picks from valid options only
            response = self.llm.predict(
                response_model=self.current_step.get_response_model(),
                context=self.fragments.to_string()
            )
            # Accumulate the fragment
            self.fragments.add(response.value)
            # Navigate to next step
            self.current_step = response.get_next_step()
        return self.fragments.to_list()

Fragments accumulate in a simple structure:

class FilterResult:
    filters: list[list[str]] = [[]]  # 2D: groups × fragments
    
    def add(self, fragment: str):
        self.filters[-1].append(fragment)
    def new_group(self):
        self.filters.append([])

Each completed filter becomes a list like ["field", "status", "equals", "active"], guaranteed valid, ready for conversion.

Walking Through Real Filters

Let’s trace some examples through the DAG:

Example 1: Simple field filter

User: "Find all policies where claim count exceeds 3"

DAG Walk:
├─ Step 1: Filter Type → "field"
├─ Step 2: Field Name → "claim_count"
├─ Step 3: Condition → "greater_than"
└─ Step 4: Value → "3"Fragments: ["field", "claim_count", "greater_than", "3"]

Example 2: Relationship filter

User: "Show leads whose company has over 100 employees"

DAG Walk:
├─ Step 1: Filter Type → "relationship"
├─ Step 2: Relationship → "company"
├─ Step 3: Related Field → "employee_count"
├─ Step 4: Condition → "greater_than"
└─ Step 5: Value → "100"Fragments: ["relationship", "company", "employee_count", "greater_than", "100"]

Example 3: Activity filter

User: "Contacts who had a meeting logged in the last 7 days"

DAG Walk:
├─ Step 1: Filter Type → "logged_activity"
├─ Step 2: Activity Type → "meeting"
├─ Step 3: Time Condition → "within_last"
└─ Step 4: Time Value → "7 days"Fragments: ["logged_activity", "meeting", "within_last", "7_days"]

Each walk produces a valid fragment list. No hallucinations, no invalid combinations.

Trust, but Verify

Even with DAG constraints and structured outputs, we add guardrails for edge cases:

Pre-flight checks:

Is this actually a filter request? (vs. a general question)
Does it reference fields that exist on this object?
Is it asking for something our system can express?

Runtime validation:

Max iteration limits prevent infinite loops.
Invalid responses trigger retry with error context.

class GuardrailResponse(BaseModel):
    is_valid_filter_request: bool
    reason: str | None
    suggested_clarification: str | None

Guardrails catch edge cases the DAG structure can’t prevent, such as requests that are technically valid filters but nonsensical (“filter where email equals 42”).

Cleaning Up: From Fragments to System Format

With validated fragments in hand, the final step is converting them to your system’s expected format.

The DAG produces raw fragments, simple lists for example: ["fields", "status", "equals", "Active"]. These are semantically valid but not yet in the format your system expects.

A payload cleaner step transforms fragments into your target schema:

def clean_fragment(fragment: list[str]) -> dict:
    filter_type = fragment[0]

    # Route to type-specific transformer
    transformers = {
        "fields": transform_field_filter,
        "relationship": transform_relationship_filter,
        "logged_activity": transform_activity_filter,
    }
    return transformers[filter_type](fragment)

This separation is intentional, think of it like a compiler. The DAG produces an intermediate representation (IR): semantically valid fragments that capture user intent. The payload cleaner then compiles this IR to your target format, mapping field names to IDs, converting values to your schema’s expected types, building nested structures.

Why separate them? Because the DAG logic stays clean and testable, while system-specific formatting lives in one place. When your schema changes, you update the cleaner, not the entire DAG.

The Cost of Decomposition

Let’s be honest: this approach uses multiple LLM calls instead of one. That has costs.

When this is overkill:

Simple schemas with few valid combinations.
One-off scripts where validation errors are acceptable.
Latency-critical applications where multiple round-trips hurt UX.

Development cost:

Each filter type requires its own DAG branch, step definitions, and payload cleaner.
Adding a new filter type means implementing the full pipeline, not just updating a prompt.
Upfront investment is significant; payoff comes from reliability at scale.

Inference tradeoffs:

More calls = more latency (each step waits for the previous steps).
More calls = higher token costs.
But: each call is simpler, more reliable, and easier to debug.

Latency Mitigation Strategies

Strategy

How it Helps

Prompt Caching

LLM providers such as OpenAI and Gemini provide native caching for prompt reuse reducing latency and cost.

Prompt Compression

Minimize context passed to each step

Step Merging

Combine low-risk steps (e.g., field + condition in one call)

Parallel Processing

Process multiple filters concurrently

Fast-Forward Steps

Skip LLM call when only one valid option exists

def process_step(self, step: Step) -> str:
    # If only one valid option, skip LLM call entirely
    if len(step.options) == 1:
        return step.options[0]

Comparing Architectures: The Engineering Trade-offs

While the DAG approach solved our specific constraints, it isn’t the only way to build agentic systems. Here is how the architectures stack up:

1. One-Shot Prompting (Optimized for Speed)

Best for: Independent Variables.
Why: If you ask for Name and Age, the model can predict both simultaneously because your name doesn't change the valid format of your age.
The Trap: It fails when variables are interdependent. If choosing “Field A” radically changes the valid options for “Operator B,” the one-shot model tries to hold the entire decision tree in its context window at once. It’s fast, but brittle.

2. Specialist Sub-Agents (Optimized for Breadth)

Best for: Disconnected Domains.
Why: Great for keeping distinct logic separate, like routing “Sales Questions” vs. “Support Tickets” to different prompts.
The Trap: Sub-agents struggle with shared state. If the “Date Agent” needs to know exactly what the “Field Agent” selected to validate its bounds, you end up passing massive context blobs back and forth, reintroducing the complexity you tried to avoid.

3. Structured DAG Router (Optimized for Dependencies)

Best for: Cascading Constraints.
Why: This is the only architecture where Output A defines the Schema of Output B.
The Win: When the LLM picks a “Date Field,” the DAG strictly prunes the universe of next steps. The “Text Operators” (like contains) literally cease to exist as options. The model can't hallucinate them because they aren't in the available schema.

The Verdict: If your outputs constrain each other (e.g. Field Type → determines Condition → determines Value Format), you cannot rely on probability. You need the architectural enforcement of a DAG.

Where Else Can You Use This?

DAG decomposition isn’t just for filters. The pattern applies anywhere you need to translate fuzzy intent into precise structure:

Workflow engines: Decompose “send email, wait 2 days, check response” into step DAGs.
Query builders: Natural language → SQL/GraphQL with constrained options per step.
Form generators: “Create a signup form with email validation” → structured form config.
Low-code platforms: Translate business intent into executable configuration.

Anywhere you need to constrain LLM output to match a specific system’s requirements, DAG decomposition keeps the output space manageable.

Key Takeaways

DAG decomposition guarantees validity: Every output works with your existing system.
Constraining LLM choices at each step: Dramatically reduces hallucinations.
Fragments provide composability: Same logic, different output formats.
The pattern generalizes: Beyond filters to any structured output generation.
Trade-offs exist: More LLM calls, but each is simpler and more reliable.

At Kizen, this approach transformed our AI filter builder from a brittle experiment into a production-ready feature. The next time you’re tempted to dump your entire schema into an LLM prompt, consider breaking the problem into steps. Because ultimately, reliability isn’t a prompting problem, it’s an architecture problem.

Written by Mordechai Worch

ML/Data Staff Engineer @ Kizen

Last updated 7 days ago

Was this helpful?

Good morning

hashtagWhy Dumping your Schema into an LLM Doesn’t Work

hashtagNot All Filters Are Created Equal

hashtagDescription of Different Filter Types

hashtagFrom Messy Input to Structured Intent

hashtagBreaking Intent into Steps

hashtagMaking the LLM Color Inside the Lines

hashtagExecuting the DAG

hashtagWalking Through Real Filters

hashtagTrust, but Verify

hashtagCleaning Up: From Fragments to System Format

hashtagThe Cost of Decomposition

hashtagLatency Mitigation Strategies

hashtagComparing Architectures: The Engineering Trade-offs

hashtag1. One-Shot Prompting (Optimized for Speed)

hashtag2. Specialist Sub-Agents (Optimized for Breadth)

hashtag3. Structured DAG Router (Optimized for Dependencies)

hashtagWhere Else Can You Use This?

hashtagKey Takeaways

hashtagWritten by Mordechai Worch