Skip to content

Executive Summary

ddigraph is a Python package. It turns DDI (Data Documentation Initiative) metadata into a queryable knowledge graph in Neo4j. The input is static XML. This lets organizations:

  • Understand complex survey instruments through visual exploration
  • Integrate seamlessly with other international standards like SDMX
  • Leverage AI and machine learning for metadata discovery
  • Scale metadata operations beyond manual documentation

The Goal: From Archive to Asset

The Problem We Solve

Survey metadata today lives in XML files that are:

  • Hard to navigate: Deeply nested hierarchies require specialized tools
  • Difficult to integrate: No standardized way to connect to other systems
  • Inaccessible to AI: Current formats can't leverage modern ML capabilities
  • Slow to analyze: A question like "which variables use this code list?" takes manual work

What We Deliver

ddigraph converts this:

<!-- 148KB of nested XML -->
<Fragment>
  <QuestionItem>
    <QuestionReference>
      <ID>q-001</ID>
    </QuestionReference>
  </QuestionItem>
</Fragment>

Into this:

// Simple, visual query
MATCH (q:QuestionItem)-[:USES_CODELIST]->(cl:CodeList)
WHERE cl.name = 'Age Groups'
RETURN q.question_text

Key Benefits

1. Instant Metadata Discovery

Before: Analysts read XML by hand to find which questions use a code list.

  • Time: 2-4 hours per query
  • Error-prone: Easy to miss references in nested structures
  • Limited: Can only answer questions you explicitly plan for

After: Graph queries return answers in real-time.

// Find all questions using "Employment Status" codes
MATCH (q:QuestionItem)-[:USES_CODELIST]->(cl:CodeList {name: 'Employment Status'})
RETURN q.question_text, q.name
  • Time: < 1 second
  • Comprehensive: Returns all relationships automatically
  • Flexible: Can explore unexpected connections

Business Impact: Faster questionnaire reviews, reduced data collection errors, improved survey quality.

2. Survey Instrument Visualization

Before: Survey flow is documented in lengthy Word documents or spreadsheets.

After: Interactive graph visualization shows the complete structure at a glance.

[Entry Point] → [Sequence: Demographics]
    ├→ [Question: Age]
    ├→ [If Age ≥ 18]
    │   └→ [Sequence: Employment]
    │       ├→ [Question: Employment Status]
    │       └→ [Question: Hours Worked]
    └→ [If Age < 18]
        └→ [Sequence: Education]

Business Impact: Stakeholders can review survey logic visually, catching design flaws before field deployment.

3. Impact Analysis and Quality Assurance

Before: To change a code list, you check dozens of XML files by hand to see what depends on it.

After: Graph queries instantly show all affected elements.

// What will be affected if we change this code list?
MATCH (cl:CodeList {name: 'Industry Codes'})
MATCH (cl)<-[:USES_CODELIST]-(q:QuestionItem)
MATCH (q)<-[:ASKS_QUESTION]-(qc:QuestionConstruct)
MATCH (qc)<-[:HAS_CONSTRUCT*]-(seq:Sequence)
RETURN DISTINCT seq.name AS affected_sections, 
       count(q) AS question_count

Business Impact: Reduced risk of breaking changes, faster survey updates, improved data quality.

4. Automated Documentation

Before: Survey documentation is manually written and quickly becomes outdated.

After: Documentation is generated from the graph, always in sync with the actual structure.

# Generate natural language documentation
def document_survey(instrument_id):
    results = session.run("""
        MATCH (i:Instrument {id: $id})-[:HAS_CONSTRUCT]->(seq:Sequence)
        MATCH (seq)-[:HAS_CONSTRUCT]->(c)
        RETURN seq.name, labels(c)[0], count(*) as count
        ORDER BY seq.name
    """, id=instrument_id)

    # Feed to LLM for natural language generation
    return generate_documentation(results)

Output: "The Labour Force Survey consists of 5 sections: Demographics (8 questions), Employment (12 questions)..."

Business Impact: Always-current documentation, reduced manual effort, consistent formatting.


Interoperability with International Standards (SDMX)

The Integration Challenge

Organizations must report data to international bodies (Eurostat, OECD, UN). They use the SDMX (Statistical Data and Metadata eXchange) format. Mapping DDI survey metadata to SDMX is usually:

The Graph Solution

ddigraph makes DDI-SDMX integration explicit and queryable:

// Automatic mapping via shared identifiers
MATCH (v:Variable)
WHERE v.user_id IS NOT NULL
MATCH (d:Dimension {id: v.user_id})
MERGE (v)-[m:MAPS_TO_DIMENSION]->(d)
SET m.match_type = 'user_id',
    m.confidence = 1.0,
    m.mapped_on = datetime()

What This Means:

  1. DDI Variables (from your survey) link directly to SDMX Dimensions (international reporting structure)
  2. The mapping is documented in the graph with metadata about confidence and method
  3. Both directions are queryable:
  4. "What SDMX components does this variable map to?"
  5. "What DDI variables feed this SDMX dimension?"

Real-World Example

Scenario: Your Labour Force Survey needs to report to Eurostat using their SDMX template.

Traditional Approach:

  1. Export DDI variables to Excel
  2. Manually match to Eurostat dimension codes
  3. Maintain Excel mapping file
  4. Hope nothing changes

ddigraph Approach:

  1. Load DDI metadata: ddigraph load survey.xml
  2. Load SDMX structure into same graph
  3. Run automated mapping query
  4. Review and validate mappings visually

Business Impact:

  • faster international reporting setup
  • Zero maintenance as mappings update automatically
  • Full auditability of which variables map where and why
  • Reduced errors from manual transcription

Alignment Reports

Generate comprehensive alignment reports:

// Show mapping coverage
MATCH (v:Variable)
OPTIONAL MATCH (v)-[:MAPS_TO_DIMENSION]->(d:Dimension)
WITH count(v) as total, count(d) as mapped
RETURN total, mapped, 
       round(100.0 * mapped / total) AS coverage_pct

Output: "Coverage: 87% of variables mapped (1,400 of 1,609)"


AI-Readiness: Unlocking Machine Learning Capabilities

Why Graph Structure Matters for AI

Traditional XML files are not AI-ready:

  • LLMs can't "see" relationships buried in nested tags
  • Retrieval-Augmented Generation (RAG) systems need queryable context
  • Embeddings don't capture structural connections

Graph databases make metadata AI-accessible:

  • Relationships are explicit and traversable
  • Context is preserved in graph structure
  • Pattern matching is native, not reconstructed

Capability 1: Intelligent Survey Chatbots

Use Case: Stakeholders ask questions about survey design in natural language.

User: "What questions come after the employment section?"
Bot: [Queries graph for sequences following employment constructs]
     "After the employment questions, respondents answer 5 questions 
      about job satisfaction, then branch based on employment status..."

How It Works:

  1. User question → LLM converts to graph query
  2. Graph returns structured results
  3. LLM converts results to natural language

Business Impact: Non-technical stakeholders can explore survey structure without training.

Capability 2: Retrieval-Augmented Generation (RAG)

Use Case: Generate survey documentation or answer policy questions grounded in actual metadata.

# RAG pipeline with graph context
def answer_question(user_query):
    # 1. Convert query to graph pattern
    graph_results = neo4j.run("""
        MATCH (q:QuestionItem)
        WHERE q.question_text CONTAINS $keywords
        MATCH path = (q)<-[:ASKS_QUESTION]-(:QuestionConstruct)
                        <-[:HAS_CONSTRUCT]-(seq:Sequence)
        RETURN q.question_text, seq.name, path
    """, keywords=extract_keywords(user_query))

    # 2. Feed graph context to LLM
    context = format_graph_results(graph_results)
    response = llm.generate(user_query, context=context)

    return response

Example:

  • Query: "How does the survey measure unemployment?"
  • Graph retrieves: Questions about employment status, hours worked, job search
  • LLM generates: "The survey measures unemployment through three key questions in the Employment section..."

Business Impact: Automated response to stakeholder inquiries, consistent with actual survey design.

Capability 3: Quality Validation with AI

Use Case: Automatically detect survey design issues.

// Find questions without response options (potential error)
MATCH (q:QuestionItem)
WHERE NOT (q)-[:USES_CODELIST]->()
  AND q.response_type = 'code'
RETURN q.name, q.question_text AS potentially_missing_codelist

// Find unreachable constructs (dead code in survey)
MATCH (c)
WHERE (c:Sequence OR c:QuestionConstruct)
  AND NOT ()-[:HAS_CONSTRUCT|THEN|ELSE]->(c)
  AND NOT c:EntryPoint
RETURN labels(c)[0] AS type, c.name AS orphaned_construct

AI Enhancement: Train ML models to predict common issues based on graph patterns.

Business Impact: Catch survey design errors before field deployment, reducing costly revisions.

Capability 4: Semantic Search and Recommendations

Use Case: Find similar questions across different surveys for harmonization.

# Generate embeddings for all questions
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

with driver.session() as session:
    questions = session.run("""
        MATCH (q:QuestionItem)
        RETURN q.fragment_id AS id, q.question_text AS text
    """)

    for q in questions:
        embedding = model.encode(q["text"])
        session.run("""
            MATCH (q:QuestionItem {fragment_id: $id})
            SET q.embedding = $embedding
        """, id=q["id"], embedding=embedding.tolist())

Query similar questions:

// Find questions semantically similar to "What is your age?"
MATCH (target:QuestionItem {fragment_id: 'q-age'})
MATCH (similar:QuestionItem)
WHERE similar <> target
WITH similar, 
     gds.similarity.cosine(target.embedding, similar.embedding) AS similarity
WHERE similarity > 0.8
RETURN similar.question_text, similarity
ORDER BY similarity DESC
LIMIT 5

Business Impact: Faster survey harmonization, better reuse of existing questions, improved comparability across time.

The AI-Ready Advantage

Capability XML Archive Graph + AI
Question answering Manual lookup Automated chatbot
Documentation Static Word docs Generated on-demand
Quality checks Manual review AI-powered validation
Harmonization Spreadsheet comparison Semantic search
Discovery Limited to known queries Exploratory analysis

Bottom Line: ddigraph + AI = Metadata that works for you, not the other way around.

Risk Mitigation

Technical Risks

Risk Mitigation
Learning curve Pre-built queries, visual tools, documentation
Infrastructure complexity Use managed Neo4j Aura, Docker for development
Data migration Incremental rollout, keep XML as source of truth
Tool integration Adapter pattern supports export to CSV/JSON/pandas

Organizational Risks

Risk Mitigation
Resistance to change Start with PoC, demonstrate quick wins
Skill gaps Training plan, external support available
Budget constraints Open-source core, optional cloud scaling

Comparison: Graph vs. Traditional Approaches

vs. Relational Databases (SQL)

Feature Relational DB Graph DB (ddigraph)
Query complexity Complex JOINs Natural pattern matching
Variable-depth queries Recursive CTEs (slow) Native traversal (fast)
Schema flexibility ALTER TABLE migrations Add relationships dynamically
Relationship modeling Foreign keys + junction tables Direct relationships
Visualization Not native Built-in graph browser

Verdict: Graph databases are purpose-built for connected data like DDI metadata.

vs. XML Parsing Scripts

Feature Custom Scripts ddigraph
Maintenance Scripts break with schema changes Declarative schema, adapts automatically
Reusability One-off scripts Query library for all uses
Performance Re-parse file each time Query graph in milliseconds
Collaboration Code reviews Visual exploration

Verdict: ddigraph eliminates technical debt from brittle parsing scripts.

vs. Manual Documentation

Feature Word/Excel Docs ddigraph
Accuracy Outdated immediately Always reflects current structure
Discovery Limited to documented paths Explore any relationship
Updates Manual editing Automatic from data
Integration Copy-paste API/query access

Verdict: Graph data is self-documenting and always current.


Questions and Discussion

For Technical Managers

  • Q: How does this fit with our existing data infrastructure?
  • A: ddigraph exports to CSV, JSON, and pandas, so it works with any tool. The graph adds to your stack; it does not replace it.

  • Q: What if Neo4j becomes a bottleneck?

  • A: Neo4j scales to billions of nodes. Current metadata volumes are tiny by comparison.

For Non-Technical Managers

  • Q: Do staff need to learn programming?
  • A: No. You get a visual graph browser and ready-made queries for common tasks. Power users can learn Cypher, which is simpler than SQL.

Conclusion: Metadata as Strategic Asset

Survey metadata has long been a cost center. It is needed, but it takes manual work to keep up to date and to query.

ddigraph transforms metadata into a strategic asset by making it:

  • Queryable in real-time
  • Integrated with international standards
  • AI-ready for modern capabilities
  • Scalable across the organization

The opportunity: Treat metadata as infrastructure. Teams that do gain an edge in data quality, compliance, and day-to-day work.

The choice: Continue with manual processes, or leverage modern graph technology to work smarter.


Thank you. Questions?