Executive Summary¶

ddigraph is a Python package. It turns DDI (Data Documentation Initiative) metadata into a queryable knowledge graph in Neo4j. The input is static XML. This lets organizations:

Understand complex survey instruments through visual exploration
Integrate seamlessly with other international standards like SDMX
Leverage AI and machine learning for metadata discovery
Scale metadata operations beyond manual documentation

The Goal: From Archive to Asset¶

The Problem We Solve¶

Survey metadata today lives in XML files that are:

✗ Hard to navigate: Deeply nested hierarchies require specialized tools
✗ Difficult to integrate: No standardized way to connect to other systems
✗ Inaccessible to AI: Current formats can't leverage modern ML capabilities
✗ Slow to analyze: A question like "which variables use this code list?" takes manual work

What We Deliver¶

ddigraph converts this:

<!-- 148KB of nested XML -->
<Fragment>
  <QuestionItem>
    <QuestionReference>
      <ID>q-001</ID>
    </QuestionReference>
  </QuestionItem>
</Fragment>

Into this:

// Simple, visual query
MATCH (q:QuestionItem)-[:USES_CODELIST]->(cl:CodeList)
WHERE cl.name = 'Age Groups'
RETURN q.question_text

Key Benefits¶

1. Instant Metadata Discovery¶

Before: Analysts read XML by hand to find which questions use a code list.

Time: 2-4 hours per query
Error-prone: Easy to miss references in nested structures
Limited: Can only answer questions you explicitly plan for

After: Graph queries return answers in real-time.

// Find all questions using "Employment Status" codes
MATCH (q:QuestionItem)-[:USES_CODELIST]->(cl:CodeList {name: 'Employment Status'})
RETURN q.question_text, q.name

Time: < 1 second
Comprehensive: Returns all relationships automatically
Flexible: Can explore unexpected connections

Business Impact: Faster questionnaire reviews, reduced data collection errors, improved survey quality.

2. Survey Instrument Visualization¶

Before: Survey flow is documented in lengthy Word documents or spreadsheets.

After: Interactive graph visualization shows the complete structure at a glance.

[Entry Point] → [Sequence: Demographics]
    ├→ [Question: Age]
    ├→ [If Age ≥ 18]
    │   └→ [Sequence: Employment]
    │       ├→ [Question: Employment Status]
    │       └→ [Question: Hours Worked]
    └→ [If Age < 18]
        └→ [Sequence: Education]

Business Impact: Stakeholders can review survey logic visually, catching design flaws before field deployment.

3. Impact Analysis and Quality Assurance¶

Before: To change a code list, you check dozens of XML files by hand to see what depends on it.

After: Graph queries instantly show all affected elements.

// What will be affected if we change this code list?
MATCH (cl:CodeList {name: 'Industry Codes'})
MATCH (cl)<-[:USES_CODELIST]-(q:QuestionItem)
MATCH (q)<-[:ASKS_QUESTION]-(qc:QuestionConstruct)
MATCH (qc)<-[:HAS_CONSTRUCT*]-(seq:Sequence)
RETURN DISTINCT seq.name AS affected_sections, 
       count(q) AS question_count

Business Impact: Reduced risk of breaking changes, faster survey updates, improved data quality.

4. Automated Documentation¶

Before: Survey documentation is manually written and quickly becomes outdated.

After: Documentation is generated from the graph, always in sync with the actual structure.

# Generate natural language documentation
def document_survey(instrument_id):
    results = session.run("""
        MATCH (i:Instrument {id: $id})-[:HAS_CONSTRUCT]->(seq:Sequence)
        MATCH (seq)-[:HAS_CONSTRUCT]->(c)
        RETURN seq.name, labels(c)[0], count(*) as count
        ORDER BY seq.name
    """, id=instrument_id)

    # Feed to LLM for natural language generation
    return generate_documentation(results)

Output: "The Labour Force Survey consists of 5 sections: Demographics (8 questions), Employment (12 questions)..."

Business Impact: Always-current documentation, reduced manual effort, consistent formatting.

Interoperability with International Standards (SDMX)¶

The Integration Challenge¶

Organizations must report data to international bodies (Eurostat, OECD, UN). They use the SDMX (Statistical Data and Metadata eXchange) format. Mapping DDI survey metadata to SDMX is usually:

The Graph Solution¶

ddigraph makes DDI-SDMX integration explicit and queryable:

// Automatic mapping via shared identifiers
MATCH (v:Variable)
WHERE v.user_id IS NOT NULL
MATCH (d:Dimension {id: v.user_id})
MERGE (v)-[m:MAPS_TO_DIMENSION]->(d)
SET m.match_type = 'user_id',
    m.confidence = 1.0,
    m.mapped_on = datetime()

What This Means:

DDI Variables (from your survey) link directly to SDMX Dimensions (international reporting structure)
The mapping is documented in the graph with metadata about confidence and method
Both directions are queryable:
"What SDMX components does this variable map to?"
"What DDI variables feed this SDMX dimension?"

Real-World Example¶

Scenario: Your Labour Force Survey needs to report to Eurostat using their SDMX template.

Traditional Approach:

Export DDI variables to Excel
Manually match to Eurostat dimension codes
Maintain Excel mapping file
Hope nothing changes

ddigraph Approach:

Load DDI metadata: ddigraph load survey.xml
Load SDMX structure into same graph
Run automated mapping query
Review and validate mappings visually

Business Impact:

faster international reporting setup
Zero maintenance as mappings update automatically
Full auditability of which variables map where and why
Reduced errors from manual transcription

Alignment Reports¶

Generate comprehensive alignment reports:

// Show mapping coverage
MATCH (v:Variable)
OPTIONAL MATCH (v)-[:MAPS_TO_DIMENSION]->(d:Dimension)
WITH count(v) as total, count(d) as mapped
RETURN total, mapped, 
       round(100.0 * mapped / total) AS coverage_pct

Output: "Coverage: 87% of variables mapped (1,400 of 1,609)"

AI-Readiness: Unlocking Machine Learning Capabilities¶

Why Graph Structure Matters for AI¶

Traditional XML files are not AI-ready:

LLMs can't "see" relationships buried in nested tags
Retrieval-Augmented Generation (RAG) systems need queryable context
Embeddings don't capture structural connections

Graph databases make metadata AI-accessible:

Relationships are explicit and traversable
Context is preserved in graph structure
Pattern matching is native, not reconstructed

Capability 1: Intelligent Survey Chatbots¶

Use Case: Stakeholders ask questions about survey design in natural language.

User: "What questions come after the employment section?"
Bot: [Queries graph for sequences following employment constructs]
     "After the employment questions, respondents answer 5 questions 
      about job satisfaction, then branch based on employment status..."

How It Works:

User question → LLM converts to graph query
Graph returns structured results
LLM converts results to natural language

Business Impact: Non-technical stakeholders can explore survey structure without training.

Capability 2: Retrieval-Augmented Generation (RAG)¶

Use Case: Generate survey documentation or answer policy questions grounded in actual metadata.

# RAG pipeline with graph context
def answer_question(user_query):
    # 1. Convert query to graph pattern
    graph_results = neo4j.run("""
        MATCH (q:QuestionItem)
        WHERE q.question_text CONTAINS $keywords
        MATCH path = (q)<-[:ASKS_QUESTION]-(:QuestionConstruct)
                        <-[:HAS_CONSTRUCT]-(seq:Sequence)
        RETURN q.question_text, seq.name, path
    """, keywords=extract_keywords(user_query))

    # 2. Feed graph context to LLM
    context = format_graph_results(graph_results)
    response = llm.generate(user_query, context=context)

    return response

Example:

Query: "How does the survey measure unemployment?"
Graph retrieves: Questions about employment status, hours worked, job search
LLM generates: "The survey measures unemployment through three key questions in the Employment section..."

Business Impact: Automated response to stakeholder inquiries, consistent with actual survey design.

Capability 3: Quality Validation with AI¶

Use Case: Automatically detect survey design issues.

// Find questions without response options (potential error)
MATCH (q:QuestionItem)
WHERE NOT (q)-[:USES_CODELIST]->()
  AND q.response_type = 'code'
RETURN q.name, q.question_text AS potentially_missing_codelist

// Find unreachable constructs (dead code in survey)
MATCH (c)
WHERE (c:Sequence OR c:QuestionConstruct)
  AND NOT ()-[:HAS_CONSTRUCT|THEN|ELSE]->(c)
  AND NOT c:EntryPoint
RETURN labels(c)[0] AS type, c.name AS orphaned_construct

AI Enhancement: Train ML models to predict common issues based on graph patterns.

Business Impact: Catch survey design errors before field deployment, reducing costly revisions.

Capability 4: Semantic Search and Recommendations¶

Use Case: Find similar questions across different surveys for harmonization.

# Generate embeddings for all questions
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

with driver.session() as session:
    questions = session.run("""
        MATCH (q:QuestionItem)
        RETURN q.fragment_id AS id, q.question_text AS text
    """)

    for q in questions:
        embedding = model.encode(q["text"])
        session.run("""
            MATCH (q:QuestionItem {fragment_id: $id})
            SET q.embedding = $embedding
        """, id=q["id"], embedding=embedding.tolist())

Query similar questions:

// Find questions semantically similar to "What is your age?"
MATCH (target:QuestionItem {fragment_id: 'q-age'})
MATCH (similar:QuestionItem)
WHERE similar <> target
WITH similar, 
     gds.similarity.cosine(target.embedding, similar.embedding) AS similarity
WHERE similarity > 0.8
RETURN similar.question_text, similarity
ORDER BY similarity DESC
LIMIT 5

Business Impact: Faster survey harmonization, better reuse of existing questions, improved comparability across time.

The AI-Ready Advantage¶

Capability	XML Archive	Graph + AI
Question answering	Manual lookup	Automated chatbot
Documentation	Static Word docs	Generated on-demand
Quality checks	Manual review	AI-powered validation
Harmonization	Spreadsheet comparison	Semantic search
Discovery	Limited to known queries	Exploratory analysis

Bottom Line: ddigraph + AI = Metadata that works for you, not the other way around.

Risk Mitigation¶

Technical Risks¶

Risk	Mitigation
Learning curve	Pre-built queries, visual tools, documentation
Infrastructure complexity	Use managed Neo4j Aura, Docker for development
Data migration	Incremental rollout, keep XML as source of truth
Tool integration	Adapter pattern supports export to CSV/JSON/pandas

Organizational Risks¶

Risk	Mitigation
Resistance to change	Start with PoC, demonstrate quick wins
Skill gaps	Training plan, external support available
Budget constraints	Open-source core, optional cloud scaling

Comparison: Graph vs. Traditional Approaches¶

vs. Relational Databases (SQL)¶

Feature	Relational DB	Graph DB (ddigraph)
Query complexity	Complex JOINs	Natural pattern matching
Variable-depth queries	Recursive CTEs (slow)	Native traversal (fast)
Schema flexibility	ALTER TABLE migrations	Add relationships dynamically
Relationship modeling	Foreign keys + junction tables	Direct relationships
Visualization	Not native	Built-in graph browser

Verdict: Graph databases are purpose-built for connected data like DDI metadata.

vs. XML Parsing Scripts¶

Feature	Custom Scripts	ddigraph
Maintenance	Scripts break with schema changes	Declarative schema, adapts automatically
Reusability	One-off scripts	Query library for all uses
Performance	Re-parse file each time	Query graph in milliseconds
Collaboration	Code reviews	Visual exploration

Verdict: ddigraph eliminates technical debt from brittle parsing scripts.

vs. Manual Documentation¶

Feature	Word/Excel Docs	ddigraph
Accuracy	Outdated immediately	Always reflects current structure
Discovery	Limited to documented paths	Explore any relationship
Updates	Manual editing	Automatic from data
Integration	Copy-paste	API/query access

Verdict: Graph data is self-documenting and always current.

Questions and Discussion¶

For Technical Managers¶

Q: How does this fit with our existing data infrastructure?
A: ddigraph exports to CSV, JSON, and pandas, so it works with any tool. The graph adds to your stack; it does not replace it.
Q: What if Neo4j becomes a bottleneck?
A: Neo4j scales to billions of nodes. Current metadata volumes are tiny by comparison.

For Non-Technical Managers¶

Q: Do staff need to learn programming?
A: No. You get a visual graph browser and ready-made queries for common tasks. Power users can learn Cypher, which is simpler than SQL.

Conclusion: Metadata as Strategic Asset¶

Survey metadata has long been a cost center. It is needed, but it takes manual work to keep up to date and to query.

ddigraph transforms metadata into a strategic asset by making it:

✓ Queryable in real-time
✓ Integrated with international standards
✓ AI-ready for modern capabilities
✓ Scalable across the organization

The opportunity: Treat metadata as infrastructure. Teams that do gain an edge in data quality, compliance, and day-to-day work.

The choice: Continue with manual processes, or leverage modern graph technology to work smarter.

Thank you. Questions?