DataHub's SQL parser silently
drops lineage. Across every warehouse.
sqlglot falls back to an opaque Command node on procedural SQL —
stored procedures, control flow, dynamic SQL — losing column-level lineage
without warning. gsp-datahub-sidecar recovers it.
Your lineage graph is lying to you.
DataHub uses sqlglot to parse SQL. When it hits procedural constructs —
stored procedures, control flow, dynamic SQL — it silently falls back
to an opaque Command node. Zero lineage extracted. Zero warnings.
This isn't a BigQuery-only gap. The same silent fallback happens across Snowflake, SQL Server, Oracle, Databricks, and every dialect with procedural SQL.
Same query. Dramatically different results.
Same 4 tables. Same query. 0 vs 25 column-level relationships.
Same pattern, any dialect — the gap is universal.
Every warehouse. Every parser gap. One sidecar.
Click a dialect to see the deep dive. Coming soon? Email us to signal demand — it directly shapes our roadmap.
BigQuery
Procedural SQL — DECLARE, IF/END IF, temp tables — drops to Command node.
0 → 25 relationships recovered
Snowflake
Stored procedures and EXECUTE IMMEDIATE silently bypass parsing.
96 tracked issues
Databricks
Notebook SQL and Unity Catalog references fall through the parser.
69 tracked issues
Power BI
#(lf) encoding in M-language means -- comments swallow JOINs and WHERE clauses.
1 → 5 column lineages recovered
SQL Server
T-SQL stored procedures with TRY/CATCH, cursors, and dynamic SQL.
17 tracked issues
Oracle
PL/SQL packages, CONNECT BY, MODEL clause — full procedural support.
47 tracked issues
Hive
HiveQL UDFs and TRANSFORM clauses bypass standard parsing.
154 tracked issues
Spark SQL
Spark SQL extensions, DataFrame lineage, and catalog references.
265 tracked issues
DB2
DB2 SQL PL stored procedures and compound statements.
27 tracked issues
3 steps. 60 seconds.
Install
One pip command. No Docker, no infra changes, no DataHub plugins.
pip install git+https://github.com/gudusoftware/gsp-datahub-sidecar.git Detect & Re-parse
The sidecar identifies every SQL statement where sqlglot fell back to a Command node, then re-parses with the GSP engine.
Emit to DataHub
Column-level lineage is emitted back into DataHub via the GMS API. Your lineage graph is complete — no fork, no redeploy.
gsp-sidecar emit --gms-url http://localhost:8080 Three backends. Pick your comfort level.
Every backend uses the same GSP SQLFlow engine. The only difference is where SQL gets parsed.
Anonymous
- Cloud-parsed, not logged
- Rate-limited (fair use)
- Great for evaluation
Authenticated
- Personal API key
- Higher per-minute quota
- Usage dashboard
- Priority processing
Self-Hosted
- SQL never leaves your network
- No rate limits
- Full audit trail
- Enterprise support
Common questions.
Which SQL databases are supported?
BigQuery is live today with a full deep-dive page. Snowflake, Databricks, SQL Server, Oracle, Hive, Spark SQL, and DB2 are coming soon. The underlying GSP SQLFlow engine already parses all 8 dialects (and 20+ more) — the sidecar integration is what we're shipping per dialect. Email us if you need a specific dialect prioritized.
Does this replace DataHub's lineage parser?
No. The sidecar complements DataHub's existing parser. DataHub still runs sqlglot for standard SQL — it handles straightforward queries well. The sidecar only re-parses statements where sqlglot falls back to a Command node, recovering the lineage that would otherwise be silently lost.
How is the sidecar different from what DataHub already does?
DataHub's native SQL parser (sqlglot) handles standard SELECT/INSERT/UPDATE queries. But procedural SQL — stored procedures, DECLARE blocks, TRY/CATCH, dynamic SQL, temp tables — produces an opaque Command node with zero lineage. The sidecar detects those gaps and re-parses with the GSP engine, which fully understands procedural constructs across all supported dialects.
Is my SQL sent to a third party?
Depends on your backend. Anonymous and Authenticated modes parse SQL via Gudu Software's cloud API (processed in memory, never logged or stored). Self-hosted mode runs the GSP engine on your infrastructure — SQL never leaves your network. Choose Self-hosted for regulated or sensitive environments.
What's the licensing model?
The sidecar tool itself is open source (Apache 2.0). The Anonymous backend is free with fair-use rate limits. The Authenticated backend provides higher limits with a personal API key. Self-hosted deployments require a commercial SQLFlow license — contact us for pricing.
When will Snowflake / Databricks / other dialects be available?
We're shipping dialect-specific sidecar integrations based on community demand. Snowflake and Databricks are the highest priority (Tier 1). If you need a specific dialect, email support@gudusoft.com — every email directly influences our roadmap sequencing.
Install in 60 seconds. See what you've been missing.
One command. Every missing column-level relationship back in DataHub.
Open source on GitHub · Apache 2.0 license · Read the deep dive blog post