Google Colab에서 실습하기 Why We Tried Using the Glossary as an Ontology The core idea of DataNexus is simple. If you define relationships between business terms as a graph, the NL2SQL engine can reference that graph to convert natural language to SQL. That graph is the ontology. “Ontology” sounds like something from an academic paper, but there’s nothing fancy about it. “Net revenue is a type of revenue (IsA).” “Revenue includes gross revenue, returns, and discounts (HasA).” It’s just business knowledge from people’s heads, written down in a format machines can read. The question was where to store it. Spinning up a dedicated ontology system adds another management point. We were already using DataHub as the metadata platform, and it came with a Business Glossary. Term registration, relationship configuration – it’s all there. In the previous post , I’d assessed that the Glossary’s 4 relationship types (IsA, HasA, RelatedTo, Values) were sufficient to express business term hierarchies. Eliminating one system was appealing. Programmatic access via GraphQL API was there, and term changes automatically trigger Kafka MCL (Metadata Change Log) events. A reasonable starting point. Started Putting Terms In First thing after setting up DataHub was registering Glossary Terms. “Net Revenue IsA Revenue”, “Revenue HasA Gross Revenue, Returns, Discounts.” Entered terms and set up relationships like this. Basic hierarchies went in cleanly. Revenue -> Gross Revenue, Net Revenue -> Actual Revenue. Then problems started. The Limits of 4 Relationship Types: “Factory and Product” Hit a wall when modeling actual business data. “Product B manufactured at Factory A” and “Product B in stock at Factory A.” Both are relationships between factory and product, but one is manufacturing (Manufactures) and the other is inventory (Stocks). Completely different meanings. Express this in DataHub’s Glossary? Both become RelatedTo. You get two “Factory RelatedTo Product” entries, with no way to tell which is manufacturing and which is inventory. Why this is fatal – DataNexus’s NL2SQL engine builds SQL by looking at the ontology. When a question like “Show me products manufactured at Factory A” comes in, the engine looks up the factory-product relationship to determine the relevant tables and JOIN paths. User question: “What products are manufactured at Factory A?" Ontology lookup: Factory -> RelatedTo -> Product (Manufacturing? Inventory? Unknown) -> LLM might JOIN the inventory table instead of the production table -> Wrong results returned With only RelatedTo as the relationship type, the engine has no basis for judgment. A wrong JOIN means wrong data delivered to the user. Adding Types Requires Redeployment Can’t you just add more granular relationships to DataHub? No. You’d need to define a new Aspect in PDL (Persona Data Language), declare the relationship type with @Relationship annotations, then build and redeploy DataHub. This cycle repeats for every new relationship type. In practice, business modeling means relationships keep multiplying. “Supplies,” “Inspects,” “Returns”… business context can require dozens. Editing code and redeploying for each one isn’t realistic. Digging Deeper, More Issues Surfaced Relationship types weren’t the only problem. Synonym Conflicts I registered “Net Revenue” (순매출) and “Actual Revenue” (실매출) as synonyms. Same concept, different names. But both terms had “Net Sales” as an English synonym. One English name mapped to two Korean terms – DataHub just lets this pass. No warnings. In NL2SQL, if synonym mappings get tangled, the engine references the wrong term. Once you’re past hundreds of terms, catching these conflicts by eye is impossible. You have to build custom validation logic separately. Visualization DataHub’s UI is designed for data lineage exploration. A directional tree showing data flowing from Table A to Table B. Ontology has a different structure. Dozens to hundreds of nodes connected in many-to-many mesh networks. “Product” is connected to “Factory,” “Warehouse,” “Supplier,” “Category” – each with different relationships – and those nodes are interconnected among themselves. DataHub simply doesn’t have a screen for exploring this kind of graph. If you can’t see the big picture of what you’ve built, you can’t manage it. Can’t Attach Properties to Relationships This was the biggest problem. When you set “A RelatedTo B” in DataHub’s Glossary, you can’t add anything more to that relationship. In practice, you often need metadata on the relationship itself. Confidence is a prime example. An auto-extracted relationship might be 0.7, while an expert-defined one is 0.95 – the NL2SQL engine needs to know this difference. Validity period is similar. When organizational restructuring changes department-to-product mappings, you need to track when that relationship was valid. Without this, you end up querying current data with past organizational structures, which is a classic cause of mismatched report numbers. Cardinality directly affects JOIN strategy. Summary: What Works and What Doesn’t Works Doesn’t Work Term definitions (name, definition) Granular relationship types (MANUFACTURES, STOCKS, etc.) Synonym registration (custom fields) Automatic synonym conflict detection 4 relationship types (IsA, HasA, RelatedTo, Values) Properties on relationships (confidence, validity period) Programmatic access via GraphQL API Complex graph exploration UI Kafka MCL event stream Real-time relationship type extension without redeployment As a term dictionary, it’s decent. As an ontology store, it lacked expressiveness. Split the Roles: DataHub + DozerDB Throwing away the Glossary entirely wasn’t the answer. Nothing could replace DataHub as the Source of Truth for term definitions. GraphQL API, Kafka MCL events – building this infrastructure from scratch in another tool would be a waste of time. Each got assigned what it does best. DataHub Glossary -> Source of Truth for term definitions and basic relationships DozerDB -> Handles granular relationships, property-annotated edges, and graph reasoning We chose DozerDB because it supports Cypher queries. Properties can be freely attached to relationships (edges), and adding new relationship types doesn’t require schema changes or redeployment. The sync flow is straightforward. When a Glossary Term changes in DataHub, a Kafka MCL event is emitted. A consumer subscribing to the event reflects it in DozerDB’s ontology graph. Basic information like names and definitions stays with DataHub; DozerDB adds granular relationships and properties on top. Relationship Definitions in DozerDB The “factory-product” problem from earlier – here’s how it resolves in DozerDB. // Entity creation (terms synced from DataHub) CREATE (factory:Entity {name: 'Factory A', type: 'Factory'}) CREATE (product:Entity {name: 'Product B', type: 'Product'}) // Manufacturing relationship — start date and confidence as properties CREATE (factory)-[:MANUFACTURES { since: '2024-01-01', confidence: 0.95 }]->(product) // Inventory relationship — separate edge, quantity and update timestamp CREATE (factory)-[:STOCKS { quantity: 500, last_updated: '2026-02-01' }]->(product) MANUFACTURES and STOCKS are separate relationship types. When the question “What products are manufactured at Factory A?” comes in, the engine finds MANUFACTURES and correctly JOINs to the production table. Fundamentally different from lumping everything under a single RelatedTo. Derived Metrics Go in the Graph Too Managing derived metric definitions in Excel means when the source term changes, the derived spreadsheet doesn’t follow. “Net Revenue = Gross Revenue - Returns - Discounts” – if the gross revenue definition changes but the net revenue side stays the same, that inconsistency propagates up to reports. This time, we used CALCULATED_FROM relationships to put the formulas directly in the graph. // Expressing net revenue's calculation structure as relationships MATCH (net:Entity {name: 'Net Revenue'}) MATCH (gross:Entity {name: 'Gross Revenue'}) MATCH (returns:Entity {name: 'Returns'}) MATCH (discounts:Entity {name: 'Discounts'}) CREATE (net)-[:CALCULATED_FROM {operator: 'subtract'}]->(gross) CREATE (net)-[:CALCULATED_FROM {operator: 'subtract'}]->(returns) CREATE (net)-[:CALCULATED_FROM {operator: 'subtract'}]->(discounts) When the formula changes, you update the relationship. The graph DB tracks the change history. Better than having it buried in some Excel sheet where nobody knows who changed what or when. Doesn’t This Break the Source of Truth? One thing to address here. In Post 1, I called the metadata catalog the “Source of Truth for the ontology.” In Post 2, I also said the Glossary’s 4 relationship types were sufficient. But now in Post 3, I’ve added DozerDB. “So now there are two Sources of Truth?” A fair question. The subject of the SoT changed. SoT doesn’t mean “put everything in one system.” It means “for a specific data category, which system has the final authority.” DataHub and DozerDB answer different questions. “What is net revenue?” -> DataHub answers. Name, definition, synonyms, owning department. DataHub has final authority on everything about the term’s identity. “How is net revenue connected to which tables, through which paths?” -> DozerDB answers. Granular relationships like CALCULATED_FROM and MANUFACTURES, confidence, validity periods. DozerDB owns the semantics of inter-term connections. If there’s a conflict between the two? DataHub wins. If a node’s name or definition in DozerDB differs from the DataHub Glossary, DataHub is correct. Kafka MCL events flow in one direction only – DataHub to DozerDB. There is no reverse sync. I should have made this distinction clear from the start. When I wrote “Source of Truth for the ontology” in Post 1, what I really meant was “Source of Truth for term definitions.” The initial assumption that a single system could handle both term definitions and relationship semantics was wrong. Post 3 exposed that, and the DataHub + DozerDB dual structure is the result. The SoT didn’t break. Its scope narrowed. Remaining Issue: Standards Compatibility DataHub’s Glossary model is a DataHub-specific structure. The industry has standard ontologies like FIBO (finance) and Schema.org (general-purpose). To import industry standards or export the DataNexus ontology, you need standard format support. Right now, it’s a proprietary system that only works inside DataNexus. An ontology that can’t be exchanged externally ends up being used only internally and then abandoned. The next post covers why we added a SKOS compatibility layer. Google Colab에서 실습하기 Documenting the process of designing and building DataNexus. GitHub | LinkedIn