# Eliminating Schema Hallucinations in AI Systems with DAGs

January 7th, 2026 | Written by Mordechai Worch | 10 Min 📖

<div data-with-frame="true"><figure><img src="https://695940432-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FI9zTS3Se7JKwAjYJD8Xc%2Fuploads%2F6cQW1LGGbtipGvhDkJx3%2FBattleprompting.jpg?alt=media&#x26;token=066a99e6-51fb-495f-b494-b48a5b1faf8a" alt=""><figcaption><p><strong>From Brittle to Bulletproof:</strong> Visualizing the shift from single-shot prompting (which often fails silently) to a directed acyclic graph (DAG) that validates user intent incrementally.</p></figcaption></figure></div>

Natural language is messy. Filters are precise.

At Kizen, we spent years building a powerful filter engine, composite query architecture, relationship filters, activity tracking and custom fields across dozens of object types. It handles everything from “show me active leads” to “contacts whose company has more than 100 employees AND submitted a form in the last 30 days OR triggered our welcome automation.”

Then we tried to let users build these filters with natural language.

Modern LLMs excel at grasping intent. If a user asks for ‘active leads,’ the model knows exactly what that means conceptually. The problem arises when that concept hits the rigid reality of a database schema.

LLMs are confident improvisers. They often generate payloads that *look* perfect to the human eye but are gibberish to a compiler, inventing field names that don’t exist or syntax that your engine never supported. The result isn’t a working filter; it’s a silent failure.

Our solution: decompose filter intent into a directed acyclic graph (DAG) of logic fragments, where every path through the graph produces a valid filter. No hallucinations. No invalid combinations. Just guaranteed-valid output.

***

### Why Dumping your Schema into an LLM Doesn’t Work <a href="#id-8ce9" id="id-8ce9"></a>

Our first approach was obvious: give the LLM our complete filter schema and ask it to generate valid JSON. It failed spectacularly.

The complexity is combinatorial. We are bridging the gap between a probabilistic LLM and a deterministic query engine designed for exact precision. While the LLM deals in “likely” tokens, our engine requires strict type alignment to maintain data integrity:

* “`equals`” works on text fields.
* “`greater_than`” works on numbers.
* “`contains_any`” works on multi-select fields.
* “`between`” works on dates.
* And dozens more condition-field type combinations.

Multiply fields × conditions × values × boolean groupings and you get thousands of valid combinations — but millions of invalid ones. LLMs generate plausible-looking output that crashes our parser or, worse, fails silently:

* **Hallucinated fields**: “filter by `customer_tier`" when the field is actually called `membership_level` .
* **Invalid conditions**: Using `equals` on a date field instead of `on_date`.
* **Malformed groupings**: Nested AND/OR logic that doesn’t match our query structure.

When a user says “customers who signed up last week OR have premium status AND live in California,” the boolean logic must match our exact query format. One wrong bracket and the filter returns garbage.

***

### Not All Filters Are Created Equal <a href="#aac8" id="aac8"></a>

Before diving into the solution, it helps to understand what we’re dealing with. Our filter engine supports seven distinct filter types, each with different structures:

#### Description of Different Filter Types

<table><thead><tr><th width="209.51953125">Filter</th><th>What It Does</th><th>Example</th></tr></thead><tbody><tr><td>Field Filters</td><td>Filter on record's own fields</td><td>status equals active</td></tr><tr><td>Relationship Filters</td><td>Filter on related object's fields</td><td>company.employee_count>100</td></tr><tr><td>Relationship Group Filters</td><td>Use saved filter groups from related objects</td><td>company in 'Enterprise Accounts' group</td></tr><tr><td>Logged Activity Filters</td><td>Filter by activities recorded on records.</td><td>had a meeting in last 30 days</td></tr><tr><td>Scheduled Activity Filters</td><td>Filter by upcoming activities</td><td>has call scheduled this week</td></tr><tr><td>Form Submission Filters</td><td>Filter by form responses</td><td>submitted 'Contact Us' form</td></tr><tr><td>Automation Filters</td><td>Filter by automation execution</td><td>triggered 'Welcome Email' automation</td></tr><tr><td>Team Interaction Filters</td><td>Filter by interactions with employees</td><td>contracts John reviewed in last 2 weeks</td></tr></tbody></table>

Each filter type has its own valid conditions, value formats, and nesting rules. A field filter on a text field accepts different conditions than a field filter on a date field. A relationship filter, by contrast, must also specify the related object and the field on that object.

This complexity is exactly why naive prompting fails and why DAG decomposition works.

***

### From Messy Input to Structured Intent <a href="#a4e6" id="a4e6"></a>

Before walking the DAG, we need to understand what the user actually wants. Consider this input:

> *“Show me customers who bought something last month and haven’t been contacted”*

This input contains two distinct filter conditions combined with AND logic, which must be processed independently.

**The Rewrite Phase** uses an LLM to decompose user intent into structured filter statements:

```
class RewritePromptResponse(BaseModel):
    filters: list[str]        # Individual filter intents
    logical_grouping: str     # "AND" or "OR" or "Complex"py
```

For the input “active customers from California or premium members,” rewriting produces:

```
filters: [
    "status equals active",
    "state equals California",
    "membership_tier equals premium"
]
logical_grouping: "('status equals active' and 'state equals California') 
                    or ('membership_tier equals premium')"
```

Each filter in the list gets processed through the DAG independently, then combined at the end. This parallelizes the work and keeps each DAG walk focused on a single filter intent.

***

### Breaking Intent into Steps <a href="#id-3669" id="id-3669"></a>

Instead of asking the LLM to generate a complete filter in one shot, we break the problem into constrained steps:

1. **Choose filter type**: field, relationship, activity, etc.
2. **Choose field** (for field filters): only fields that exist on this object.
3. **Choose condition**: only conditions valid for that field’s type.
4. **Provide value**: only value formats that work with that condition.

At each step, the LLM sees only valid options. When you select a text field, you don’t see date conditions. When you select “equals,” you provide a string, not a date range.

We built a lightweight custom router (not LangGraph or similar frameworks) to keep the logic simple and debuggable:

```
class Step:
    name: str                    # Human-readable name
    next_steps: list[Step]       # Valid transitions in DAG
    output_type: type            # What this step produces
    output_is_step: bool         # True if output determines routing

    def get_response_model(self):
        # Dynamically creates Pydantic model for LLM response
        # Only includes valid options for this step
        ...
```

The result is a system where invalid filters cannot be generated. Every path through the DAG produces output that is valid for our filter engine.

***

### Making the LLM Color Inside the Lines <a href="#id-7b0a" id="id-7b0a"></a>

Now that we have steps, how do we guarantee each step’s output is valid? Modern LLMs support **structured output**, forcing responses to match a specific JSON schema. Instead of free-form text that might hallucinate invalid values, the LLM must return valid JSON matching a Pydantic model.

This is the secret sauce: at each DAG step, we generate a *dynamic* Pydantic model containing only valid options for that specific point in the graph. This also saves tokens, since we’re not feeding the LLM the schema for date filters when it’s processing a text field.

```
def get_response_model(self) -> type[BaseModel]:
    # Dynamically create enums and typed fragments based on step
    if step.has_choices:
        # Create dynamic enum - LLM must pick from exact valid options
        Options = StringEnum("Options", {opt: opt for opt in step.choices})
        return create_model("Response", value=(Options, ...))
    elif step.output_type == DateFragment:
        # Break date into validated components - no string parsing errors
        class DateFragment(BaseModel):
            year: int = Field(ge=1900, le=2100)
            month: int = Field(ge=1, le=12)
            day: int = Field(ge=1, le=31)
            def get_value(self) -> str:
                return f"{self.year}-{self.month:02d}-{self.day:02d}"
        return DateFragment
    elif step.output_type == int:
        # Constrain integers with bounds
        return create_model("Response", value=(Annotated[int, Field(ge=0, le=1000)], ...))
    return create_model("Response", value=(str, ...))
```

The LLM cannot hallucinate invalid field names because dynamic enums constrain it to exact valid options. For dates, we break the value into typed components (year, month, day) with built-in bounds validation, no string parsing, no format errors.

Structured outputs + dynamic schemas = the LLM can only produce valid fragments, making invalid output impossible by construction.

<div data-with-frame="true"><figure><img src="https://695940432-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FI9zTS3Se7JKwAjYJD8Xc%2Fuploads%2FGOLmfcfgf1BRLulC88Yv%2FDAG.jpg?alt=media&#x26;token=5a2dbaa8-45a3-452d-9dc6-48e79712ce20" alt=""><figcaption><p>This visualization shows how the DAG progressively narrows the output possibilities. At each step, the valid search space shrinks. By the time the LLM reaches the “Condition” step, it isn’t hallucinating from the entire universe of SQL operators; it is choosing from a strict list of five valid enums defined by the previous selection.</p></figcaption></figure></div>

***

### Executing the DAG <a href="#de6a" id="de6a"></a>

The Router walks the graph, accumulating fragments at each step:

```
class Router:
    def __init__(self, metadata):
        self.current_step = build_filter_dag(metadata)  # Root of DAG
        self.fragments = FilterResult()

    def process(self) -> list[str]:
        while self.current_step:
            # LLM picks from valid options only
            response = self.llm.predict(
                response_model=self.current_step.get_response_model(),
                context=self.fragments.to_string()
            )
            # Accumulate the fragment
            self.fragments.add(response.value)
            # Navigate to next step
            self.current_step = response.get_next_step()
        return self.fragments.to_list()
```

Fragments accumulate in a simple structure:

```
class FilterResult:
    filters: list[list[str]] = [[]]  # 2D: groups × fragments
    
    def add(self, fragment: str):
        self.filters[-1].append(fragment)
    def new_group(self):
        self.filters.append([])
```

Each completed filter becomes a list like `["field", "status", "equals", "active"]`, guaranteed valid, ready for conversion.

<div data-with-frame="true"><figure><img src="https://695940432-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FI9zTS3Se7JKwAjYJD8Xc%2Fuploads%2FpP5fOmjNfMgmlWLBwt6H%2Fworkflow.jpg?alt=media&#x26;token=7813d35c-0ea1-4a12-be5c-287714890c5c" alt=""><figcaption><p>The workflow isolates every atomic filter in the user’s request. By processing the three conditions, Status, State and Tier, through independent DAG routers, we eliminate cross-contamination and hallucinated outputs.</p></figcaption></figure></div>

***

### Walking Through Real Filters <a href="#f966" id="f966"></a>

Let’s trace some examples through the DAG:

**Example 1: Simple field filter**

```
User: "Find all policies where claim count exceeds 3"
```

```
DAG Walk:
├─ Step 1: Filter Type → "field"
├─ Step 2: Field Name → "claim_count"
├─ Step 3: Condition → "greater_than"
└─ Step 4: Value → "3"Fragments: ["field", "claim_count", "greater_than", "3"]
```

**Example 2: Relationship filter**

```
User: "Show leads whose company has over 100 employees"
```

```
DAG Walk:
├─ Step 1: Filter Type → "relationship"
├─ Step 2: Relationship → "company"
├─ Step 3: Related Field → "employee_count"
├─ Step 4: Condition → "greater_than"
└─ Step 5: Value → "100"Fragments: ["relationship", "company", "employee_count", "greater_than", "100"]
```

**Example 3: Activity filter**

```
User: "Contacts who had a meeting logged in the last 7 days"
```

```
DAG Walk:
├─ Step 1: Filter Type → "logged_activity"
├─ Step 2: Activity Type → "meeting"
├─ Step 3: Time Condition → "within_last"
└─ Step 4: Time Value → "7 days"Fragments: ["logged_activity", "meeting", "within_last", "7_days"]
```

Each walk produces a valid fragment list. No hallucinations, no invalid combinations.

***

### Trust, but Verify <a href="#db60" id="db60"></a>

Even with DAG constraints and structured outputs, we add guardrails for edge cases:

**Pre-flight checks:**

* Is this actually a filter request? (vs. a general question)
* Does it reference fields that exist on this object?
* Is it asking for something our system can express?

**Runtime validation:**

* Max iteration limits prevent infinite loops.
* Invalid responses trigger retry with error context.

```
class GuardrailResponse(BaseModel):
    is_valid_filter_request: bool
    reason: str | None
    suggested_clarification: str | None
```

Guardrails catch edge cases the DAG structure can’t prevent, such as requests that are technically valid filters but nonsensical (“filter where email equals 42”).

***

### Cleaning Up: From Fragments to System Format <a href="#f48f" id="f48f"></a>

With validated fragments in hand, the final step is converting them to your system’s expected format.

The DAG produces raw fragments, simple lists for example: `["fields", "status", "equals", "Active"]`. These are *semantically valid* but not yet in the format your system expects.

A **payload cleaner** step transforms fragments into your target schema:

```
def clean_fragment(fragment: list[str]) -> dict:
    filter_type = fragment[0]

    # Route to type-specific transformer
    transformers = {
        "fields": transform_field_filter,
        "relationship": transform_relationship_filter,
        "logged_activity": transform_activity_filter,
    }
    return transformers[filter_type](fragment)
```

This separation is intentional, think of it like a compiler. The DAG produces an intermediate representation (IR): semantically valid fragments that capture user intent. The payload cleaner then compiles this IR to your target format, mapping field names to IDs, converting values to your schema’s expected types, building nested structures.

Why separate them? Because the DAG logic stays clean and testable, while system-specific formatting lives in one place. When your schema changes, you update the cleaner, not the entire DAG.

***

### The Cost of Decomposition <a href="#e58b" id="e58b"></a>

Let’s be honest: this approach uses multiple LLM calls instead of one. That has costs.

**When this is overkill:**

* Simple schemas with few valid combinations.
* One-off scripts where validation errors are acceptable.
* Latency-critical applications where multiple round-trips hurt UX.

**Development cost:**

* Each filter type requires its own DAG branch, step definitions, and payload cleaner.
* Adding a new filter type means implementing the full pipeline, not just updating a prompt.
* Upfront investment is significant; payoff comes from reliability at scale.

**Inference tradeoffs:**

* More calls = more latency (each step waits for the previous steps).
* More calls = higher token costs.
* But: each call is simpler, more reliable, and easier to debug.

#### Latency Mitigation Strategies

| Strategy            | How it Helps                                                                                               |
| ------------------- | ---------------------------------------------------------------------------------------------------------- |
| Prompt Caching      | LLM providers such as OpenAI and Gemini provide native caching for prompt reuse reducing latency and cost. |
| Prompt Compression  | Minimize context passed to each step                                                                       |
| Step Merging        | Combine low-risk steps (e.g., field + condition in one call)                                               |
| Parallel Processing | Process multiple filters concurrently                                                                      |
| Fast-Forward Steps  | Skip LLM call when only one valid option exists                                                            |

```
def process_step(self, step: Step) -> str:
    # If only one valid option, skip LLM call entirely
    if len(step.options) == 1:
        return step.options[0]
```

***

### Comparing Architectures: The Engineering Trade-offs <a href="#b0f8" id="b0f8"></a>

While the DAG approach solved our specific constraints, it isn’t the only way to build agentic systems. Here is how the architectures stack up:

<div data-with-frame="true"><figure><img src="https://695940432-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FI9zTS3Se7JKwAjYJD8Xc%2Fuploads%2F8QaWRai0pXv5jAOxVjC8%2FArchitectures.jpg?alt=media&#x26;token=431e035b-3fab-4beb-8803-0d1e5ea13473" alt=""><figcaption></figcaption></figure></div>

#### 1. One-Shot Prompting (Optimized for Speed) <a href="#cc6a" id="cc6a"></a>

* **Best for:** **Independent Variables.**
* **Why:** If you ask for `Name` and `Age`, the model can predict both simultaneously because your name doesn't change the valid format of your age.
* **The Trap:** It fails when variables are interdependent. If choosing “Field A” radically changes the valid options for “Operator B,” the one-shot model tries to hold the entire decision tree in its context window at once. It’s fast, but brittle.

#### 2. Specialist Sub-Agents (Optimized for Breadth) <a href="#id-2755" id="id-2755"></a>

* **Best for:** Disconnected Domain&#x73;**.**
* **Why:** Great for keeping distinct logic separate, like routing “Sales Questions” vs. “Support Tickets” to different prompts.
* **The Trap:** Sub-agents struggle with shared state. If the “Date Agent” needs to know exactly what the “Field Agent” selected to validate its bounds, you end up passing massive context blobs back and forth, reintroducing the complexity you tried to avoid.

#### 3. Structured DAG Router (Optimized for Dependencies) <a href="#id-37f1" id="id-37f1"></a>

* **Best for:** Cascading Constraint&#x73;**.**
* **Why:** This is the only architecture where Output A defines the Schema of Output B.
* **The Win:** When the LLM picks a “Date Field,” the DAG strictly prunes the universe of next steps. The “Text Operators” (like `contains`) literally cease to exist as options. The model can't hallucinate them because they aren't in the available schema.

**The Verdict:** If your outputs constrain each other (e.g. *Field Type* → determines *Condition* → determines *Value Format*), you cannot rely on probability. You need the architectural enforcement of a DAG.

***

### Where Else Can You Use This? <a href="#id-1f82" id="id-1f82"></a>

DAG decomposition isn’t just for filters. The pattern applies anywhere you need to translate fuzzy intent into precise structure:

* **Workflow engines**: Decompose “send email, wait 2 days, check response” into step DAGs.
* **Query builders**: Natural language → SQL/GraphQL with constrained options per step.
* **Form generators**: “Create a signup form with email validation” → structured form config.
* **Low-code platforms**: Translate business intent into executable configuration.

Anywhere you need to constrain LLM output to match a specific system’s requirements, DAG decomposition keeps the output space manageable.

***

### Key Takeaways <a href="#e51e" id="e51e"></a>

* **DAG decomposition guarantees validity:** Every output works with your existing system.
* **Constraining LLM choices at each step:** Dramatically reduces hallucinations.
* **Fragments provide composability:** Same logic, different output formats.
* **The pattern generalizes:** Beyond filters to any structured output generation.
* **Trade-offs exist**: More LLM calls, but each is simpler and more reliable.

At Kizen, this approach transformed our AI filter builder from a brittle experiment into a production-ready feature. The next time you’re tempted to dump your entire schema into an LLM prompt, consider breaking the problem into steps. Because ultimately, reliability isn’t a prompting problem, it’s an architecture problem.

***

{% columns %}
{% column width="25%" %}

<div data-with-frame="true"><figure><img src="https://695940432-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FI9zTS3Se7JKwAjYJD8Xc%2Fuploads%2FblW6Re9W507fY6ssVebu%2Fmord_64x64.jpg?alt=media&#x26;token=15ed5bc7-3961-4711-8be2-e8cd3595f0b7" alt=""><figcaption></figcaption></figure></div>
{% endcolumn %}

{% column width="75%" %}

#### Written by Mordechai Worch

ML/Data Staff Engineer @ Kizen
{% endcolumn %}
{% endcolumns %}
