2. How We Chose These 4 Open-Source Tools

Too Many Candidates

There were too many options.

For the metadata catalog alone: DataHub, Amundsen, Apache Atlas, OpenMetadata. Add commercial options and you get Collibra, Alation. Then you need to fill four axes – NL2SQL engine, document knowledge engine, graph DB – and the number of combinations exploded exponentially.

I made a comparison spreadsheet. Rows for candidate tools, columns for evaluation criteria. Three weeks in, the spreadsheet had grown to 7 tabs. When you have too many choices, the problem is not choosing. Pick one and the combinations with the rest shift, forcing you to compare from scratch.

Four Components, Each With Its Own Requirements

In the previous post , I defined DataNexus’s four components: metadata catalog, NL2SQL engine, document knowledge engine, and graph DB.

Three non-negotiable common criteria. It must be open source. It must support or enable multi-tenancy – data isolation per group company is mandatory. It must be production-ready – community activity, release cadence, documentation quality all mattered.

Each component had additional requirements. The metadata catalog needed the ability to define relationships between terms in the Business Glossary and emit change events in real-time. The NL2SQL engine required per-user context isolation and Row-level Security. The document knowledge engine couldn’t rely on vector search alone – it needed hybrid graph search. The graph DB required Multi-DB support and Cypher query support as prerequisites.

With these criteria in hand, I filtered the candidates.

Metadata Catalog

DataHub, OpenMetadata, Amundsen, Apache Atlas, commercial (Collibra/Alation). Five options on the table.

Commercial was eliminated first. Licensing costs aside, what this project needed was to use the catalog’s Glossary like an ontology store. Commercial Glossary features are powerful, but they have limitations when it comes to accessing the internal data model for customization.

Apache Atlas is tied to the Hadoop ecosystem. You need to spin up HBase, Solr, and Kafka. It’s a 2016-era design that’s too heavy for cloud-native environments. Amundsen is decent as a search-focused catalog, but its ability to define relationships between terms in the Glossary is lacking. Couldn’t use it as an ontology store.

OpenMetadata was the one I deliberated on the longest. Clean architecture, built-in data quality measurement – excellent as a standalone catalog. The issue was that Glossary relationships are primarily Parent-Child and RelatedTerms. Not enough for ontology representation where you need to clearly distinguish inheritance (IsA) from containment (HasA). Real-time event sync was also webhook-based, which is less reliable than Kafka-native for large-scale streaming.

I went with DataHub.

Glossary relationships come in 4 types: IsA (inheritance), HasA (containment), Values (value lists), RelatedTo (general association). These four are enough to express hierarchies between business terms. “Net Revenue IsA Revenue”, “Revenue HasA Gross Revenue, Returns, Discounts” – that kind of thing.

GraphQL API also played a role. You need to be able to read and write metadata programmatically to auto-sync the ontology to the NL2SQL engine’s RAG Store, and with GraphQL you can pick exactly the fields you need.

The biggest factor was Kafka MCL events. DataHub exports Metadata Change Logs to Kafka, and when a Glossary Term changes, an event is published. Subscribe to that and you can sync the graph DB ontology in real-time. Manually reflecting metadata changes will inevitably result in gaps as scale grows. This was a non-negotiable requirement.

NL2SQL Engine

At first, I considered building it from scratch. I’d already been through building a conversational BI solution – connecting GPT and Gemini for NL2SQL, optimizing prompt engineering, even designing a multi-agent architecture.

Two lessons came out of that. First, LLMs can’t understand business context from DDL alone. Second, building from scratch means auxiliary features balloon endlessly – user auth, query logging, data filtering, response streaming, query learning. The estimate came out to over a month of work.

That’s when Vanna hit 2.0.

Version 1.x was simple. Inherit a Python class, call train() and ask(). Fine for prototyping, but not production-ready. No per-user context isolation, no security features.

2.0 is a different beast. It switched to an Agent-based architecture where you compose independent components, and added a User-Aware structure where user ID automatically propagates through all components. Row-level Security is supported at the framework level. Tool Memory for auto-learning from successful queries is built in. Streaming with Rich UI Components (tables, charts) in real-time.

User-Aware and Row-level Security were the most important. DataNexus needs to isolate data per group company, and having the NL2SQL engine support this at the framework level means significantly less custom code.

Tool Memory was also significant. One of the most reliable ways to improve NL2SQL accuracy is accumulating successful queries and reusing them for similar questions – and this is built into the framework. Building it separately means handling query storage, similarity matching, and version management. All of that effort gone.

Document Knowledge Engine

Vector search alone isn’t enough.

When searching business reports or internal policy documents, pulling chunks by vector similarity alone breaks context. You want to find “Business Unit A’s revenue recognition criteria,” but vector search just lists chunks containing “revenue” by similarity score. Graph-structural information like the relationship between Business Unit A and revenue recognition criteria, or when the criteria changed, doesn’t live in vectors.

ApeRAG solves this by combining three types of search. Vector Search for embedding-based semantic search. Full-text Search for cases where the literal string matters, like proper nouns or code names. GraphRAG for traversing relationships between entities extracted from documents. All three run simultaneously.

There’s a specific reason this hybrid works especially well with DataNexus. If you inject DataHub’s Glossary Terms as the Taxonomy for ApeRAG’s Entity Extraction, entities extracted from documents are automatically linked to business terms. It goes through 4-stage Entity Resolution: Exact Match, Synonym Match, Fuzzy Match (threshold 0.85), and Context Match.

There’s also MinerU integration. Enterprise documents commonly have complex tables, formulas, and multi-column layouts. Standard PDF parsers break table rows and columns. Especially documents like annual reports with lots of merged cells – parsing results are disastrous. MinerU preserves document structure during parsing, directly solving this problem.

Graph DB

The biggest variable was the Neo4j license.

The critical difference between Community Edition and Enterprise Edition is Multi-DB. Community allows one graph per instance. Enterprise allows multiple databases within the same instance.

Multi-DB is mandatory for DataNexus. We need to isolate ontology graphs per group company. groupA_ontology_db, groupB_ontology_db – separate databases per tenant with access controlled by user permissions. Shoving everything into a single Community DB and distinguishing by labels doesn’t make sense from a security standpoint.

But we can’t buy a Neo4j Enterprise license either. That goes against the project’s open-source principles.

DozerDB solved this dilemma. It’s an open-source plugin that adds Enterprise features on top of Neo4j Community Edition, including Multi-DB support. You can create per-tenant graphs with CREATE DATABASE, and Cypher queries work as-is.

I also looked at ArangoDB. The multi-model approach (document + graph + key-value) is appealing, but you can’t use Cypher. Its own query language AQL is fine for graph traversal, but you lose access to the Neo4j ecosystem’s libraries and tools. Since patterns and references for querying ontologies with Cypher are overwhelmingly abundant, I chose ecosystem compatibility.

I’m aware of DozerDB’s limitations. Fabric – cross-DB queries – isn’t supported yet, so querying across different databases in a single Cypher statement isn’t possible. Deferred to Phase 3. For now, single-tenant queries are sufficient.

Connecting the Four

Line up the four and they’re just four tools.

Metadata Sync Architecture

When a Glossary Term changes in DataHub, a Kafka MCL event is published. This event is reflected in real-time to the DozerDB ontology graph, and simultaneously to Vanna’s RAG Store. The context injected into NL2SQL prompts is automatically refreshed. Since ApeRAG’s Entity Extraction references the DataHub Glossary as its Taxonomy, document search results are also linked to the latest term system.

Fix a term in one place and four places update simultaneously. Manual metadata propagation will inevitably miss something as scale grows.

The limitations and workarounds when using DataHub’s Business Glossary as an ontology.

Documenting the process of designing and building DataNexus. GitHub | LinkedIn

Too Many Candidates#

Four Components, Each With Its Own Requirements#

Metadata Catalog#

NL2SQL Engine#

Document Knowledge Engine#

Graph DB#

Connecting the Four#

Next Post#