3. On-Site GEO Technical Architecture - From Product DB to JSON-LD

GEO Optimization Guide — 전체 시리즈
1. What Is GEO - AI Citation Strategy Beyond SEO
2. Each AI Cites Different Sources
3. On-Site GEO Technical Architecture - From Product DB to JSON-LD ← 현재 글
4. Off-Site GEO - How to Win Over AI That Ignores Your Official Site
5. AEO - Why Coding Agents Read Documentation Differently

Where Do You Build JSON-LD and Where Does It Go

In the previous article , we confirmed that each AI platform prefers different citation sources. Gemini favors official websites, ChatGPT leans on directories, and Perplexity gravitates toward community discussions. One thing they share: pages with structured data get cited more often across all platforms.

So the technical core of On-Site GEO boils down to this question. How do you transform product master DB data into JSON-LD and inject it into the HTML <head>?

It sounds simple, but once you dig in, the tangles pile up fast. Product DB field names are cryptic abbreviations. The attributes AI needs do not exist in the DB. Sites built as SPAs cannot serve JSON-LD to crawlers. This article covers how to solve these problems with a structured approach.

The Concentric Architecture of a GEO System

A GEO system expands outward through four layers.

Layer	Components	Role
Core	Product Master DB	SSOT (Single Source of Truth). The origin of all data
Channel	Website / Mobile App	JSON-LD injection, SSR rendering
API	Product Query API	Interface for AI agent integration
Agent	ChatGPT / Gemini / Perplexity	End consumer touchpoint

Data flows from Core through Channel to Agent. Its shape changes at each layer. Raw DB fields become structured JSON-LD, which becomes the citation source in AI answers.

The API layer is easy to overlook. You might think just embedding JSON-LD is enough, but once you consider AI agent integrations like ChatGPT Plugins or MCP (Model Context Protocol), a separate API layer becomes necessary. Even if you do not need it right now, accounting for it in the design phase saves pain later.

The 3-Stage Data Pipeline

Instead of managing product descriptions as monolithic blobs, decompose them into individual fields. AI cites more accurately when data is field-level structured. That is the core idea behind this pipeline.

This stage maps existing product master DB fields to Schema.org fields. You are not creating new data – just organizing what already exists.

DB Field             →  Schema.org Field
─────────────────────────────────────────
PROD_NM              →  name
BRND_CD (code lookup) → brand.name
GTIN_13              →  gtin13
PRC_AMT              →  offers.price
STCK_YN              →  offers.availability
IMG_URL              →  image
CTG_NM               →  category

The field count runs around 15-18 depending on the industry. Since most values already exist in the DB, development effort is modest. The catch is converting code values to human-readable text. You need to transform BRND_CD = P1042 into brand.name = "FoodCo" for AI to understand it.

The biggest stumbling block at this stage is GTIN. It is a GS1 standard identifier, and different variants of the same product (size, flavor) need different GTINs. If you lump “Choco Stick Original” and “Choco Stick Almond” under one master code, AI cannot tell them apart.

Stage 2: LLM Extraction - AI-Generated Attributes

Some attributes AI needs for citation do not exist in the DB. Target users, usage occasions, sentiment keywords. Having humans write these manually becomes impractical when you have thousands of SKUs.

Instead, let an LLM read existing product descriptions, reviews, and category data to extract them automatically.

Source	Field	Description	Example
DB	`@type`	Schema.org type	Product
DB	`name`	Product name	Gram 16
DB	`gtin13`	GS1 identifier	8801056038800
LLM	`targetUser`	Target user	Students, professionals
LLM	`occasion`	Usage occasion	Graduation gift, work use
LLM	`sentiment`	Sentiment keywords	Lightweight, sleek
LLM	`nutrition`	Nutrition info	Sugar-free
LLM	`safety`	Safety info	CAS 9002-88-4

LLM extraction fields vary by industry. For food, nutrition facts and ingredients are key. For hotels, amenities and check-in times matter. For chemicals/B2B, it is material properties and certifications.

This stage adds 10-15 fields. Combined with Stage 1, each product ends up with 25-33 structured fields.

Stage 3: JSON-LD Output - Automated Conversion and SSR Deployment

Fields from Stages 1 and 2 are converted into Schema.org-compliant JSON-LD and automatically injected into the HTML <head> via SSR.

{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Choco Stick Original",
  "gtin13": "8801234567890",
  "brand": {
    "@type": "Brand",
    "name": "FoodCo"
  },
  "description": "Chocolate-coated crispy stick snack. 200kcal per 46g serving.",
  "offers": {
    "@type": "Offer",
    "price": 1500,
    "priceCurrency": "KRW",
    "availability": "https://schema.org/InStock"
  },
  "nutrition": {
    "@type": "NutritionInformation",
    "calories": "200 calories",
    "servingSize": "1 pack (46g)"
  }
}

Once this JSON-LD sits inside the <head> tag, the Invisible GEO discussed in Part 1 is complete. Invisible to users, but parsed directly by AI and search engines.

See how the before and after compares for a food product in the demo.

Demo - JSON-LD Before/After

Four Principles for Writing Descriptions

The most human-dependent part of the pipeline is product descriptions. Descriptions that AI cites well follow a pattern.

Fact-based - Include only objective information. AI ignores advertising copy like “industry-leading” or “customer satisfaction #1.”

100-300 characters - The sweet spot for AI reference. Too short lacks context; too long buries the key points.

Natural keywords - As confirmed by Princeton/Georgia Tech research , keyword stuffing actually decreases AI visibility. Weave keywords into natural sentences.

Unique per SKU - Copy-pasting the same template with only the product name swapped gets flagged as duplicate content by AI. Each product needs its own description.

<!-- Bad: advertising copy + keyword stuffing -->
<meta name="description" content="About Us"/>

<!-- Good: fact-based, natural language, proper length -->
<meta name="description"
  content="ChemCo is a global petrochemical company
  supplying PE/PP products to 50 countries with
  annual revenue of $11B. Leading ESG management
  and carbon neutrality by 2050."/>

Why SSR Is Non-Negotiable

Even if you build the JSON-LD perfectly, it is useless if AI crawlers cannot read it. This is where SPAs (Single Page Applications) become a bottleneck.

SPAs require JavaScript execution in the browser to render content. It looks fine to humans, but AI crawlers like GPTBot and Google-Extended mostly do not execute JS. Even if you put JSON-LD in the <head>, when the server sends an empty HTML shell, crawlers see nothing.

Switching to SSR (Server-Side Rendering) means the server sends fully rendered HTML, so crawlers can read JSON-LD immediately without JS execution.

Here is how it looks with the Next.js App Router:

// app/product/[id]/page.tsx
export default async function ProductPage({ params }) {
  const product = await fetchProduct(params.id);

  const jsonLd = {
    "@context": "https://schema.org",
    "@type": "Product",
    "name": product.name,
    "gtin13": product.gtin,
    "brand": { "@type": "Brand", "name": product.brand },
    "description": product.description,
    "image": product.imageUrl,
    "offers": {
      "@type": "Offer",
      "price": product.price,
      "priceCurrency": "KRW",
      "availability": "https://schema.org/InStock",
      "url": product.pageUrl
    }
  };

  return (
    <>
      <script
        type="application/ld+json"
        dangerouslySetInnerHTML={{
          __html: JSON.stringify(jsonLd)
        }}
      />
      <ProductDetail product={product} />
    </>
  );
}

The server fetches DB data via fetchProduct, builds the JSON-LD object, and injects it as a <script> tag. This HTML reaches crawlers as-is.

If SSR adoption feels too heavy, Google Tag Manager (GTM) can inject JSON-LD as a transitional approach. Less effective than full SSR, but viable when you cannot convert an SPA right away.

SSR Trade-offs

Aspect	Advantage	Disadvantage	Mitigation
SEO optimization	Crawlers read without JS	Initial dev cost	SDK/shared module
Data reflection	Auto-updates on DB changes	Increased server load	Redis caching + ISR
Central management	Site-wide uniform deployment	Dev team dependency	Admin console for non-devs
Validation	Build-time schema validation	Legacy system migration	GTM hybrid fallback

Server load is largely mitigated by Redis caching and ISR (Incremental Static Regeneration). As long as product data has not changed, cached HTML is served directly.

Data Freshness Drives Citations

Even well-structured data gets deprioritized when stale.

Analyzing pages with high Perplexity citations, over three-quarters had been updated within the past month. ChatGPT Shopping refreshes feeds every 15 minutes (OpenAI). Pages untouched for over three months are likely to drop in AI citation rankings.

Freshness management guidelines:

Critical data (price, inventory, promotions): refresh within 24 hours
General data (descriptions, images): refresh within 7 days
Static data (brand info, company overview): monthly review

Keeping lastmod dates in sitemap.xml aligned with actual update timestamps, and using the IndexNow API to notify search engines of changes immediately, also makes a difference.

// next-sitemap.config.js
module.exports = {
  siteUrl: 'https://www.example.com',
  generateRobotsTxt: true,
  changefreq: 'daily',
  transform: async (config, path) => ({
    loc: path,
    changefreq: path.includes('/product/') ? 'daily' : 'weekly',
    priority: path.includes('/product/') ? 0.9 : 0.5,
    lastmod: new Date().toISOString(),
  }),
};

Validation - If You Added It, Verify It

Inserting JSON-LD is not the finish line. You need to confirm crawlers can actually read it.

Google Rich Results Test - Enter your URL at search.google.com/test/rich-results to instantly check whether structured data is being recognized.

Crawler simulation with curl - Send requests with AI crawler User-Agents to verify JSON-LD is included in the HTML response.

# Request as GPTBot
curl -A "GPTBot" https://www.example.com/product/12345 | grep "application/ld+json"

# Extract JSON-LD from HTML source
curl -s https://www.example.com/product/12345 \
  | grep -oP '<script type="application/ld\+json">.*?</script>'

If your site is still an SPA without SSR, curl results will likely show no JSON-LD. That is exactly why SSR is non-negotiable.

Try building JSON-LD yourself with the interactive builder to get a hands-on feel for the structure.

Demo - JSON-LD Builder

Common issues encountered in practice:

Symptom	Cause	Fix
JSON-LD not crawled	robots.txt blocking	Set GPTBot, Google-Extended to Allow
AI not citing data	Schema.org type error	Validate with Rich Results Test
Slow API response	No caching	Apply Redis caching + minimize fields
Server overload after SSR	DB query on every request	ISR + Redis caching

Where Do You Build JSON-LD and Where Does It Go#

The Concentric Architecture of a GEO System#

The 3-Stage Data Pipeline#

Stage 1: DB Refinement - Field Mapping#

Stage 2: LLM Extraction - AI-Generated Attributes#

Stage 3: JSON-LD Output - Automated Conversion and SSR Deployment#

Four Principles for Writing Descriptions#

Why SSR Is Non-Negotiable#

SSR Trade-offs#

Data Freshness Drives Citations#

Validation - If You Added It, Verify It#