How Lexega Turns SQL Into Signals (Deterministically)

🚀 Check out this trending post from Hacker News 📖

📂 **Category**:

📌 **What You’ll Learn**:

If you’ve ever reviewed a SQL PR where “the diff is bigger than your screen”, you already know the failure mode:

the change looks reasonable,
tests pass,
but a single dangerous statement slips through because nobody can realistically inspect everything.

Lexega is built around a simple idea: turn SQL into deterministic, actionable “signals” before it runs, then use policy to decide what to do with those signals. It’s a guardrail layer for SQL — a structural analysis engine that sits between “code written” and “code deployed”.

This post is a quick tour of what actually happens under the hood when you run lexega-sql analyze.

The pipeline: SQL → semantics → signals

At a high level, Lexega does four things:

Tokenize and parse your SQL (including multi-statement scripts and Jinja/dbt templates)
Walk the AST to extract semantic facts — tables read/written, grants, policy changes — and emit categorical signals that describe what actually happened
Match rules (builtin + custom YAML) against those signals to assign severity and messaging
Evaluate policy to produce a decision (block, warn, or allow)

You can think of it like this:

SQL text
  → lexer/parser
  → semantic extraction
  → signals (category/surface/condition)
  → rule matching (builtin + custom)
  → policy evaluation (env-aware)
  → decision.json (allow / warn / block)

The important bit is that nothing in that pipeline is probabilistic: the same input yields the same signals—every time.

Signals are semantic events

Signals in Lexega describe what a statement does, not how it looks. Examples:

an unbounded write (e.g. DELETE FROM t; without a WHERE)
a policy removed (e.g. a masking/row access policy dropped)
a storage security change (e.g. encryption disabled on an external stage)

Those events are then mapped to rules that decide severity and messaging, and to policies that decide enforcement.

A concrete example: a bug that requires column lineage to detect

Here’s a query that looks perfectly reasonable:

WITH order_details AS (
  SELECT
    o.order_id,
    o.total,
    c.name,
    c.tier
  FROM orders o
  LEFT JOIN customers c
    ON o.customer_id = c.id
)
SELECT *
FROM order_details
WHERE tier = 'enterprise'
ORDER BY total DESC;

Run it through Lexega:

echo "WITH order_details AS (
  SELECT o.order_id, o.total, c.name, c.tier
  FROM orders o
  LEFT JOIN customers c ON o.customer_id = c.id
)
SELECT * FROM order_details
WHERE tier = 'enterprise'
ORDER BY total DESC;" | lexega-sql analyze --stdin --min-severity info

signals:
  [CRITICAL] LEFT JOIN nullable side filtered in WHERE clause.
             This effectively converts the LEFT JOIN to an INNER JOIN,
             likely a bug.
             ↳ Line 7 • `data_integrity:join:nullable_table_filtered:table=customers,column=tier`

The WHERE tier = 'enterprise' in the outer query silently converts the LEFT JOIN inside the CTE into an INNER JOIN — any order without a matching customer is dropped instead of preserved with NULLs. This is one of the most common SQL bugs in analytics code, and the LEFT JOIN and the filter aren’t even in the same scope. Catching it requires:

Parsing the LEFT JOIN inside the CTE and knowing c is the nullable (right) side
Tracking that tier in the CTE’s SELECT list originates from c.tier
Following that column lineage through the CTE boundary into the outer query
Recognizing that WHERE tier = 'enterprise' filters on a nullable-origin column, negating the LEFT

That’s column lineage across CTE boundaries — the kind of structural analysis that makes Lexega’s signals useful for catching real bugs, not just flagging syntax.

More signal categories

Semantic diff: catching high-risk changes

If you care about changes (not just “does this SQL contain a JOIN”), you need semantic diff.

Example: a PR touches a revenue query. The text diff shows one line changed in a JOIN predicate. Is that a big deal? Depends which tables are involved and what the predicate used to be.

Before:

SELECT
  o.order_id,
  o.total,
  c.customer_name,
  c.tier
FROM ANALYTICS.ORDERS o
INNER JOIN ANALYTICS.CUSTOMERS c
  ON o.customer_id = c.customer_id
WHERE o.region = 'NA'
ORDER BY o.total DESC;

After (one column name changed in the ON clause):

SELECT
  o.order_id,
  o.total,
  c.customer_name,
  c.tier
FROM ANALYTICS.ORDERS o
INNER JOIN ANALYTICS.CUSTOMERS c
  ON o.billing_customer_id = c.customer_id
WHERE o.region = 'NA'
ORDER BY o.total DESC;

In a text diff, that’s a single-line change buried in context. Lexega’s semantic diff surfaces what it means:

lexega-sql diff main..HEAD models/ -r

─── models/revenue.sql ───
  🟠 JoinConditionChanged JOIN condition changed: ANALYTICS.ORDERS ↔ ANALYTICS.CUSTOMERS
  - ❌ removed: ANALYTICS.ORDERS.customer_id = ANALYTICS.CUSTOMERS.customer_id
  - ✅ added: ANALYTICS.ORDERS.billing_customer_id = ANALYTICS.CUSTOMERS.customer_id

The key point isn’t that it can “parse JOIN”. It’s that it can answer the review question you actually care about: “did the join condition between ORDERS and CUSTOMERS change, and how?”

Jinja/dbt-templated SQL

Templating—especially without Python—is very difficult to get right. Even if the underlying SQL is valid, the file is no longer “just SQL”.

Lexega always attempts to render Jinja templates before analysis. The question is how much context it has to work with.

dbt project mode (automatic): When you run lexega-sql analyze from inside a dbt project, Lexega detects the project context and automatically renders templates using your project’s profiles, vars, and macros — including ref(), source(), and config() resolution. No flags required. This is the default behavior in CI when your working directory is a dbt project.

Explicit vars: Outside a dbt project, you can supply variables directly with --var, --var-file, or --dbt-profile. Undefined Jinja variables evaluate as falsy (standard Jinja behavior), so even without explicit vars, 🔥 branches resolve deterministically — you just might not get the branch you intended.

Fallback: If rendering fails entirely (e.g., templates use runtime-only dbt functions like run_query or adapter.*), Lexega falls back to analyzing the raw template structure. Coverage will be limited — Jinja in structurally important positions (table names, conditions) means less analysis. Rendering is always the stronger path.

Consider a template-guarded deletion:

🔥
  DELETE FROM 🔥
  WHERE created_at < DATEADD('day', -90, CURRENT_TIMESTAMP());
{% else %}
  DELETE FROM {{ table }};
{% endif %}

Run it with safe mode off:

cat model.sql | lexega-sql analyze --stdin \
  --var "table=production.users" --var "safe_mode=false"

signals:
  [CRITICAL] Unbounded write operation detected - no WHERE clause.
             This affects ALL rows in the target table(s).

Now flip the variable:

cat model.sql | lexega-sql analyze --stdin \
  --var "table=production.users" --var "safe_mode=true"

No signals. The bounded DELETE ... WHERE is structurally safe.

Same template, different variables, completely different risk profile — and the analysis is deterministic for each combination. This is the kind of thing you can wire into CI: render with your actual deployment context, then enforce policy on the result.

Your rules, same engine

The built-in rules ship with hundreds of signals, but the underlying engine is the same one you use to define your own.

Custom rules are declarative YAML — no plugins, no scripting, no code to deploy. You match on the same signal taxonomy that the built-in rules use, but you can combine them in ways that are specific to your org.

For example, say your revenue pipeline depends on INNER JOINs between ANALYTICS.ORDERS and ANALYTICS.CUSTOMERS — if someone changes one of those to a LEFT JOIN in a PR, you want that blocked in prod. A custom diff rule can match on exactly that:

rules:
  - id: "ACME-002"
    name: "block-revenue-join-change"
    risk_level: Critical
    message: "Join type changed on revenue-critical tables. Requires data-eng approval."
    diff_triggers:
      - change_type: JoinTypeChanged
        match:
          tables:
            any_of: ["ANALYTICS.ORDERS", "ANALYTICS.CUSTOMERS"]
          from_kind: "Inner"
          to_kind: ["Left", "Cross"]

This matches only when a semantic diff detects a join type change from INNER to LEFT or CROSS, and only on those two tables. The signal itself reports the affected tables and the before/after join types, so the alert tells you exactly what changed.

You can also override built-in rules. The built-in SCHEMA-DROP rule fires at Critical severity when someone drops a schema. If your org uses ephemeral schemas as a standard part of your data pipeline (e.g., swap-and-drop patterns for zero-downtime deploys), Critical is the wrong default — it’s noise, not signal. You don’t have to disable detection entirely; you can downgrade it:

rules:
  - id: "SCHEMA-DROP"
    name: "schema-drop-downgrade"
    risk_level: Low
    message: "Schema dropped (downgraded per org policy -- swap-and-drop is expected)"
    triggers:
      - statement_type: DropSchemaStatement
        categorical_signal:
          category: GOVERNANCE
          surface: schema
          condition: dropped

The built-in Critical rule is suppressed in favor of your custom Low rule. No code changes, no recompilation, no forking. And if you get the signal category wrong? Schema validation catches it before the rule can silently do nothing.

The point is that Lexega’s detection layer isn’t a black box you have to accept as-is. The built-in rules are a starting point; your org’s specific risks get the same treatment.

From signals to policy: CI-friendly decisions

Signals are informational on their own. Enforcement comes from policy.

In CI/CD you typically run analysis with a policy file and environment context:

lexega-sql analyze migrations/*.sql \
  --policy .lexega/policy.yml \
  --env prod \
  --decision-out decisions/$GITHUB_RUN_ID/

That produces a machine-readable decision artifact (decision.json) that your pipeline can treat as a gate.

This separation is intentional:

Rules define what gets detected (and at what severity)
Policies define what to do about it (block in prod, warn in staging, allow in dev)
Exceptions provide auditable, time-scoped overrides

Why determinism matters (especially with AI-generated SQL)

If your review volume is increasing because of AI-assisted changes, the worst possible property in a guardrail is “it depends”.

Deterministic guardrails buy you:

Repeatability: the same PR produces the same findings every time
Auditability: you can point to exactly why a deploy was blocked
Low-noise enforcement: policies can be tuned per environment instead of relying on human judgment in every PR

It’s not a replacement for review. It’s a guardrail that makes review possible again when the diffs stop fitting inside a human brain.

Try it

curl -fsSL https://lexega.com/install.sh | sh

Then point it at something real:

export LEXEGA_LICENSE_KEY=
lexega-sql analyze your_migration.sql --min-severity medium

Quick Start — first analysis in under a minute
Signal Analysis — what gets detected and why
Semantic Diff — catch risky changes, not just patterns
Policy Reference — CI enforcement with environment-aware rules

{💬|⚡|🔥} **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#Lexega #Turns #SQL #Signals #Deterministically #Lexega #Blog**

🕒 **Posted on**: 1771636404

🌟 **Want more?** Click here for more info! 🌟

How Lexega Turns SQL Into Signals (Deterministically) – Lexega Blog