Executive Summary¶
ddigraph is a Python package. It turns DDI (Data Documentation Initiative) metadata into a queryable knowledge graph in Neo4j. The input is static XML. This lets organizations:
- Understand complex survey instruments through visual exploration
- Integrate seamlessly with other international standards like SDMX
- Leverage AI and machine learning for metadata discovery
- Scale metadata operations beyond manual documentation
The Goal: From Archive to Asset¶
The Problem We Solve¶
Survey metadata today lives in XML files that are:
- ✗ Hard to navigate: Deeply nested hierarchies require specialized tools
- ✗ Difficult to integrate: No standardized way to connect to other systems
- ✗ Inaccessible to AI: Current formats can't leverage modern ML capabilities
- ✗ Slow to analyze: A question like "which variables use this code list?" takes manual work
What We Deliver¶
ddigraph converts this:
<!-- 148KB of nested XML -->
<Fragment>
<QuestionItem>
<QuestionReference>
<ID>q-001</ID>
</QuestionReference>
</QuestionItem>
</Fragment>
Into this:
// Simple, visual query
MATCH (q:QuestionItem)-[:USES_CODELIST]->(cl:CodeList)
WHERE cl.name = 'Age Groups'
RETURN q.question_text
Key Benefits¶
1. Instant Metadata Discovery¶
Before: Analysts read XML by hand to find which questions use a code list.
- Time: 2-4 hours per query
- Error-prone: Easy to miss references in nested structures
- Limited: Can only answer questions you explicitly plan for
After: Graph queries return answers in real-time.
// Find all questions using "Employment Status" codes
MATCH (q:QuestionItem)-[:USES_CODELIST]->(cl:CodeList {name: 'Employment Status'})
RETURN q.question_text, q.name
- Time: < 1 second
- Comprehensive: Returns all relationships automatically
- Flexible: Can explore unexpected connections
Business Impact: Faster questionnaire reviews, reduced data collection errors, improved survey quality.
2. Survey Instrument Visualization¶
Before: Survey flow is documented in lengthy Word documents or spreadsheets.
After: Interactive graph visualization shows the complete structure at a glance.
[Entry Point] → [Sequence: Demographics]
├→ [Question: Age]
├→ [If Age ≥ 18]
│ └→ [Sequence: Employment]
│ ├→ [Question: Employment Status]
│ └→ [Question: Hours Worked]
└→ [If Age < 18]
└→ [Sequence: Education]
Business Impact: Stakeholders can review survey logic visually, catching design flaws before field deployment.
3. Impact Analysis and Quality Assurance¶
Before: To change a code list, you check dozens of XML files by hand to see what depends on it.
After: Graph queries instantly show all affected elements.
// What will be affected if we change this code list?
MATCH (cl:CodeList {name: 'Industry Codes'})
MATCH (cl)<-[:USES_CODELIST]-(q:QuestionItem)
MATCH (q)<-[:ASKS_QUESTION]-(qc:QuestionConstruct)
MATCH (qc)<-[:HAS_CONSTRUCT*]-(seq:Sequence)
RETURN DISTINCT seq.name AS affected_sections,
count(q) AS question_count
Business Impact: Reduced risk of breaking changes, faster survey updates, improved data quality.
4. Automated Documentation¶
Before: Survey documentation is manually written and quickly becomes outdated.
After: Documentation is generated from the graph, always in sync with the actual structure.
# Generate natural language documentation
def document_survey(instrument_id):
results = session.run("""
MATCH (i:Instrument {id: $id})-[:HAS_CONSTRUCT]->(seq:Sequence)
MATCH (seq)-[:HAS_CONSTRUCT]->(c)
RETURN seq.name, labels(c)[0], count(*) as count
ORDER BY seq.name
""", id=instrument_id)
# Feed to LLM for natural language generation
return generate_documentation(results)
Output: "The Labour Force Survey consists of 5 sections: Demographics (8 questions), Employment (12 questions)..."
Business Impact: Always-current documentation, reduced manual effort, consistent formatting.
Interoperability with International Standards (SDMX)¶
The Integration Challenge¶
Organizations must report data to international bodies (Eurostat, OECD, UN). They use the SDMX (Statistical Data and Metadata eXchange) format. Mapping DDI survey metadata to SDMX is usually:
The Graph Solution¶
ddigraph makes DDI-SDMX integration explicit and queryable:
// Automatic mapping via shared identifiers
MATCH (v:Variable)
WHERE v.user_id IS NOT NULL
MATCH (d:Dimension {id: v.user_id})
MERGE (v)-[m:MAPS_TO_DIMENSION]->(d)
SET m.match_type = 'user_id',
m.confidence = 1.0,
m.mapped_on = datetime()
What This Means:
- DDI Variables (from your survey) link directly to SDMX Dimensions (international reporting structure)
- The mapping is documented in the graph with metadata about confidence and method
- Both directions are queryable:
- "What SDMX components does this variable map to?"
- "What DDI variables feed this SDMX dimension?"
Real-World Example¶
Scenario: Your Labour Force Survey needs to report to Eurostat using their SDMX template.
Traditional Approach:
- Export DDI variables to Excel
- Manually match to Eurostat dimension codes
- Maintain Excel mapping file
- Hope nothing changes
ddigraph Approach:
- Load DDI metadata:
ddigraph load survey.xml - Load SDMX structure into same graph
- Run automated mapping query
- Review and validate mappings visually
Business Impact:
- faster international reporting setup
- Zero maintenance as mappings update automatically
- Full auditability of which variables map where and why
- Reduced errors from manual transcription
Alignment Reports¶
Generate comprehensive alignment reports:
// Show mapping coverage
MATCH (v:Variable)
OPTIONAL MATCH (v)-[:MAPS_TO_DIMENSION]->(d:Dimension)
WITH count(v) as total, count(d) as mapped
RETURN total, mapped,
round(100.0 * mapped / total) AS coverage_pct
Output: "Coverage: 87% of variables mapped (1,400 of 1,609)"
AI-Readiness: Unlocking Machine Learning Capabilities¶
Why Graph Structure Matters for AI¶
Traditional XML files are not AI-ready:
- LLMs can't "see" relationships buried in nested tags
- Retrieval-Augmented Generation (RAG) systems need queryable context
- Embeddings don't capture structural connections
Graph databases make metadata AI-accessible:
- Relationships are explicit and traversable
- Context is preserved in graph structure
- Pattern matching is native, not reconstructed
Capability 1: Intelligent Survey Chatbots¶
Use Case: Stakeholders ask questions about survey design in natural language.
User: "What questions come after the employment section?"
Bot: [Queries graph for sequences following employment constructs]
"After the employment questions, respondents answer 5 questions
about job satisfaction, then branch based on employment status..."
How It Works:
- User question → LLM converts to graph query
- Graph returns structured results
- LLM converts results to natural language
Business Impact: Non-technical stakeholders can explore survey structure without training.
Capability 2: Retrieval-Augmented Generation (RAG)¶
Use Case: Generate survey documentation or answer policy questions grounded in actual metadata.
# RAG pipeline with graph context
def answer_question(user_query):
# 1. Convert query to graph pattern
graph_results = neo4j.run("""
MATCH (q:QuestionItem)
WHERE q.question_text CONTAINS $keywords
MATCH path = (q)<-[:ASKS_QUESTION]-(:QuestionConstruct)
<-[:HAS_CONSTRUCT]-(seq:Sequence)
RETURN q.question_text, seq.name, path
""", keywords=extract_keywords(user_query))
# 2. Feed graph context to LLM
context = format_graph_results(graph_results)
response = llm.generate(user_query, context=context)
return response
Example:
- Query: "How does the survey measure unemployment?"
- Graph retrieves: Questions about employment status, hours worked, job search
- LLM generates: "The survey measures unemployment through three key questions in the Employment section..."
Business Impact: Automated response to stakeholder inquiries, consistent with actual survey design.
Capability 3: Quality Validation with AI¶
Use Case: Automatically detect survey design issues.
// Find questions without response options (potential error)
MATCH (q:QuestionItem)
WHERE NOT (q)-[:USES_CODELIST]->()
AND q.response_type = 'code'
RETURN q.name, q.question_text AS potentially_missing_codelist
// Find unreachable constructs (dead code in survey)
MATCH (c)
WHERE (c:Sequence OR c:QuestionConstruct)
AND NOT ()-[:HAS_CONSTRUCT|THEN|ELSE]->(c)
AND NOT c:EntryPoint
RETURN labels(c)[0] AS type, c.name AS orphaned_construct
AI Enhancement: Train ML models to predict common issues based on graph patterns.
Business Impact: Catch survey design errors before field deployment, reducing costly revisions.
Capability 4: Semantic Search and Recommendations¶
Use Case: Find similar questions across different surveys for harmonization.
# Generate embeddings for all questions
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
with driver.session() as session:
questions = session.run("""
MATCH (q:QuestionItem)
RETURN q.fragment_id AS id, q.question_text AS text
""")
for q in questions:
embedding = model.encode(q["text"])
session.run("""
MATCH (q:QuestionItem {fragment_id: $id})
SET q.embedding = $embedding
""", id=q["id"], embedding=embedding.tolist())
Query similar questions:
// Find questions semantically similar to "What is your age?"
MATCH (target:QuestionItem {fragment_id: 'q-age'})
MATCH (similar:QuestionItem)
WHERE similar <> target
WITH similar,
gds.similarity.cosine(target.embedding, similar.embedding) AS similarity
WHERE similarity > 0.8
RETURN similar.question_text, similarity
ORDER BY similarity DESC
LIMIT 5
Business Impact: Faster survey harmonization, better reuse of existing questions, improved comparability across time.
The AI-Ready Advantage¶
| Capability | XML Archive | Graph + AI |
|---|---|---|
| Question answering | Manual lookup | Automated chatbot |
| Documentation | Static Word docs | Generated on-demand |
| Quality checks | Manual review | AI-powered validation |
| Harmonization | Spreadsheet comparison | Semantic search |
| Discovery | Limited to known queries | Exploratory analysis |
Bottom Line: ddigraph + AI = Metadata that works for you, not the other way around.
Risk Mitigation¶
Technical Risks¶
| Risk | Mitigation |
|---|---|
| Learning curve | Pre-built queries, visual tools, documentation |
| Infrastructure complexity | Use managed Neo4j Aura, Docker for development |
| Data migration | Incremental rollout, keep XML as source of truth |
| Tool integration | Adapter pattern supports export to CSV/JSON/pandas |
Organizational Risks¶
| Risk | Mitigation |
|---|---|
| Resistance to change | Start with PoC, demonstrate quick wins |
| Skill gaps | Training plan, external support available |
| Budget constraints | Open-source core, optional cloud scaling |
Comparison: Graph vs. Traditional Approaches¶
vs. Relational Databases (SQL)¶
| Feature | Relational DB | Graph DB (ddigraph) |
|---|---|---|
| Query complexity | Complex JOINs | Natural pattern matching |
| Variable-depth queries | Recursive CTEs (slow) | Native traversal (fast) |
| Schema flexibility | ALTER TABLE migrations | Add relationships dynamically |
| Relationship modeling | Foreign keys + junction tables | Direct relationships |
| Visualization | Not native | Built-in graph browser |
Verdict: Graph databases are purpose-built for connected data like DDI metadata.
vs. XML Parsing Scripts¶
| Feature | Custom Scripts | ddigraph |
|---|---|---|
| Maintenance | Scripts break with schema changes | Declarative schema, adapts automatically |
| Reusability | One-off scripts | Query library for all uses |
| Performance | Re-parse file each time | Query graph in milliseconds |
| Collaboration | Code reviews | Visual exploration |
Verdict: ddigraph eliminates technical debt from brittle parsing scripts.
vs. Manual Documentation¶
| Feature | Word/Excel Docs | ddigraph |
|---|---|---|
| Accuracy | Outdated immediately | Always reflects current structure |
| Discovery | Limited to documented paths | Explore any relationship |
| Updates | Manual editing | Automatic from data |
| Integration | Copy-paste | API/query access |
Verdict: Graph data is self-documenting and always current.
Questions and Discussion¶
For Technical Managers¶
- Q: How does this fit with our existing data infrastructure?
-
A: ddigraph exports to CSV, JSON, and pandas, so it works with any tool. The graph adds to your stack; it does not replace it.
-
Q: What if Neo4j becomes a bottleneck?
- A: Neo4j scales to billions of nodes. Current metadata volumes are tiny by comparison.
For Non-Technical Managers¶
- Q: Do staff need to learn programming?
- A: No. You get a visual graph browser and ready-made queries for common tasks. Power users can learn Cypher, which is simpler than SQL.
Conclusion: Metadata as Strategic Asset¶
Survey metadata has long been a cost center. It is needed, but it takes manual work to keep up to date and to query.
ddigraph transforms metadata into a strategic asset by making it:
- ✓ Queryable in real-time
- ✓ Integrated with international standards
- ✓ AI-ready for modern capabilities
- ✓ Scalable across the organization
The opportunity: Treat metadata as infrastructure. Teams that do gain an edge in data quality, compliance, and day-to-day work.
The choice: Continue with manual processes, or leverage modern graph technology to work smarter.
Thank you. Questions?