AI Readiness: DDI-L in Neo4j¶
DDI-L (Data Documentation Initiative Lifecycle) metadata starts as a static XML archive. Load it into Neo4j and it becomes a knowledge graph you can query. That shift unlocks strong AI and machine learning use cases.
Why Graph Structure Matters for AI¶
Traditional DDI-L files are deeply nested XML documents. They are thorough, but the format is hard for AI systems to use:
| XML Format | Graph Format |
|---|---|
| Sequential parsing required | Direct relationship traversal |
| Implicit connections via IDs | Explicit typed relationships |
| Difficult to query patterns | Native pattern matching |
| Context buried in hierarchy | Context visible in structure |
Survey instruments already have a graph shape. Load DDI-L into Neo4j and that shape becomes explicit. You can then walk it directly.
Key AI-Ready Capabilities¶
1. Retrieval-Augmented Generation (RAG)¶
Neo4j lets you run meaning-based search over survey metadata for RAG pipelines. RAG means the model fetches real data first, then writes its answer from it:
// Find questions related to a concept
MATCH (q:QuestionItem)-[:USES_CODELIST]->(cl:CodeList)
WHERE q.question_text CONTAINS 'employment'
RETURN q.name, q.question_text, cl.name
This lets LLMs base answers on the real survey structure. They no longer have to guess at question text or response options.
2. Context-Aware Question Answering¶
A flat file loses survey flow context. The graph keeps it:
// Get full context for a question: what comes before, after, and why
MATCH path = (prev)-[:HAS_CONSTRUCT]->(seq:Sequence)-[:HAS_CONSTRUCT]->(qc:QuestionConstruct)
WHERE qc.fragment_id = $question_id
MATCH (qc)-[:ASKS_QUESTION]->(q:QuestionItem)
OPTIONAL MATCH (q)-[:USES_CODELIST]->(cl:CodeList)-[:HAS_CATEGORY]->(cat:Category)
RETURN path, q, collect(cat.category_label) AS response_options
An AI assistant can now do more than say what a question asks. It can also explain where the question sits in the flow and what answers are allowed.
3. Survey Instrument Understanding¶
LLMs can reason about survey logic by running graph queries:
// Analyze conditional branching
MATCH (ite:IfThenElse)-[:THEN]->(then_seq:Sequence)
MATCH (ite)-[:ELSE]->(else_seq:Sequence)
RETURN ite.condition,
size((then_seq)-[:HAS_CONSTRUCT]->()) AS then_path_length,
size((else_seq)-[:HAS_CONSTRUCT]->()) AS else_path_length
This lets AI follow skip patterns, filter logic, and how respondents are routed.
4. Knowledge Graph Embeddings¶
Graph structure helps you build embeddings for:
- Question similarity: Find questions with close meaning across surveys
- Instrument comparison: Spot structural differences between survey versions
- Concept clustering: Group questions by the concepts behind them
# Example: Generate embeddings from graph structure
from neo4j import GraphDatabase
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
with driver.session() as session:
questions = session.run("""
MATCH (q:QuestionItem)
RETURN q.fragment_id AS id, q.question_text AS text
""")
for q in questions:
embedding = model.encode(q["text"])
session.run("""
MATCH (q:QuestionItem {fragment_id: $id})
SET q.embedding = $embedding
""", id=q["id"], embedding=embedding.tolist())
5. Agentic Workflows¶
AI agents can walk the survey graph to do hard tasks:
- Survey documentation: Write plain descriptions of instrument flow
- Quality assurance: Find orphaned questions or branches you cannot reach
- Translation assistance: Find every text element that needs localization
- Harmonization: Map questions across surveys to shared concepts
Graph-Native AI Patterns¶
Vector + Graph Search¶
Mix meaning-based similarity with structural limits:
// Find similar questions within the same survey section
MATCH (target:QuestionItem {fragment_id: $id})
MATCH (target)<-[:ASKS_QUESTION]-(qc:QuestionConstruct)<-[:HAS_CONSTRUCT]-(seq:Sequence)
MATCH (seq)-[:HAS_CONSTRUCT]->(other_qc:QuestionConstruct)-[:ASKS_QUESTION]->(similar:QuestionItem)
WHERE similar <> target
RETURN similar.name, similar.question_text,
gds.similarity.cosine(target.embedding, similar.embedding) AS similarity
ORDER BY similarity DESC
LIMIT 5
Graph Context Windows¶
Pull out subgraphs as context for LLM prompts:
// Get 2-hop neighborhood for context
MATCH (q:QuestionItem {fragment_id: $id})
MATCH path = (q)-[*1..2]-(related)
RETURN path
You can serialize this subgraph and add it to prompts. It gives LLMs a sense of structure that flat document retrieval cannot.
Reasoning Chains¶
Graph paths support step-by-step reasoning:
// Trace how a respondent reaches a specific question
MATCH path = (entry:EntryPoint)-[:HAS_CONSTRUCT*]->(qc:QuestionConstruct)
WHERE (qc)-[:ASKS_QUESTION]->(:QuestionItem {fragment_id: $target})
RETURN [node IN nodes(path) | labels(node)[0]] AS node_types,
length(path) AS steps
Practical Applications¶
Survey Chatbots¶
Build chat interfaces that know the survey structure:
User: "What questions come after the employment section?"
Bot: [Queries graph for sequences following employment-related constructs]
"After the employment questions, respondents answer 5 questions about
job satisfaction, then branch based on employment status..."
Automated Documentation¶
Generate documentation from graph structure:
def describe_instrument(instrument_id: str) -> str:
"""Generate natural language description of survey instrument."""
with driver.session() as session:
structure = session.run("""
MATCH (i:Instrument {fragment_id: $id})-[:HAS_CONSTRUCT]->(seq:Sequence)
MATCH (seq)-[:HAS_CONSTRUCT]->(c)
RETURN seq.name AS section, labels(c)[0] AS type, count(*) AS count
ORDER BY section
""", id=instrument_id)
# Feed to LLM for natural language generation
return llm.generate(structure.data())
Quality Validation¶
AI-powered validation of survey instruments:
// Find questions without response options
MATCH (q:QuestionItem)
WHERE NOT (q)-[:USES_CODELIST]->()
RETURN q.name, q.question_text AS potentially_missing_codelist
// Find unreachable constructs
MATCH (c)
WHERE c:Sequence OR c:QuestionConstruct
AND NOT ()-[:HAS_CONSTRUCT|THEN|ELSE]->(c)
AND NOT c:EntryPoint
RETURN labels(c)[0] AS type, c.fragment_id AS orphaned
From Archive to AI Asset¶
Moving from DDI-L XML to a Neo4j graph is a basic shift:
| Archive Perspective | AI-Ready Perspective |
|---|---|
| Storage format | Knowledge representation |
| Documentation | Queryable structure |
| Metadata catalog | Reasoning substrate |
| Static reference | Dynamic context source |
Load DDI-L into Neo4j and survey metadata becomes a first-class part of AI pipelines. It is no longer just data to fetch. It is structure the model can reason over.
Getting Started¶
- Load your DDI-L files:
ddigraph bootstrap
ddigraph load instrument.xml
- Add embeddings for semantic search (optional):
// After generating embeddings externally
CALL db.index.vector.createNodeIndex(
'question_embeddings', 'QuestionItem', 'embedding', 384, 'cosine'
)
- Query for AI applications:
// RAG context retrieval
MATCH (q:QuestionItem)
WHERE q.question_text CONTAINS $search_term
MATCH path = (q)<-[:ASKS_QUESTION]-(:QuestionConstruct)<-[:HAS_CONSTRUCT]-(s:Sequence)
RETURN q, s, path
Further Reading¶
- Neo4j Graph Data Science
- DDI Lifecycle Specification
- LangChain Neo4j Integration
- Architecture - ddigraph design and components
- Relationships - Graph schema and relationship types