Graph vs Relational Databases for DDI Metadata¶
ddigraph uses a graph database (Neo4j), not a relational one. This page explains why. It covers the benefits for DDI metadata. It also lists points to weigh if your team is evaluating this approach.
The Challenge: DDI's Inherent Complexity¶
DDI (Data Documentation Initiative) metadata describes survey instruments, questionnaires, and variables. It also describes how they relate. The result is a highly connected structure:
- Questions reference code lists and categories
- Variables link to questions, concepts, and universes
- Control flow constructs (sequences, conditionals, loops) form execution paths
- Instruments contain nested hierarchies of constructs
A Typical DDI Query¶
Consider answering: "Which questions use a particular code list, and what variables are derived from those questions?"
This requires traversing:
CodeList → QuestionItem → QuestionConstruct → Variable
Relational Database Approach¶
Relational Schema Design¶
In a relational database, you might model this with normalized tables:
CREATE TABLE code_lists (
id VARCHAR PRIMARY KEY,
agency VARCHAR,
version VARCHAR,
label TEXT
);
CREATE TABLE question_items (
id VARCHAR PRIMARY KEY,
agency VARCHAR,
version VARCHAR,
label TEXT,
question_text TEXT,
code_list_id VARCHAR REFERENCES code_lists(id)
);
CREATE TABLE question_constructs (
id VARCHAR PRIMARY KEY,
question_item_id VARCHAR REFERENCES question_items(id)
);
CREATE TABLE variables (
id VARCHAR PRIMARY KEY,
label TEXT,
question_construct_id VARCHAR REFERENCES question_constructs(id)
);
The Query Problem¶
Finding all variables using a specific code list requires multiple JOINs:
SELECT v.*
FROM variables v
JOIN question_constructs qc ON v.question_construct_id = qc.id
JOIN question_items qi ON qc.question_item_id = qi.id
WHERE qi.code_list_id = 'cl-age-groups';
This seems manageable, but DDI complexity grows quickly:
- Variable-depth paths: Survey flow can nest arbitrarily deep (Sequence → Sequence → IfThenElse → Sequence → QuestionConstruct)
- Multiple relationship types: A single entity may have many different relationships
- Polymorphic references: ControlConstructReference can point to Sequence, QuestionConstruct, IfThenElse, Loop, etc.
- Cross-cutting concerns: Finding all constructs affected by a code list change requires traversing multiple paths
Relational Limitations¶
| Challenge | Relational Impact |
|---|---|
| Variable-depth traversal | Requires recursive CTEs or multiple queries |
| Path finding | Complex self-joins, hard to optimize |
| Schema evolution | Adding relationship types requires ALTER TABLE |
| Polymorphic references | Requires type columns or multiple foreign keys |
| Impact analysis | Expensive multi-table scans |
Example: Variable-Depth Query in SQL¶
Finding all constructs reachable from an instrument:
WITH RECURSIVE reachable AS (
-- Base case: start from instrument
SELECT child_id, child_type, 1 as depth
FROM control_construct_refs
WHERE parent_id = 'instrument-1'
UNION ALL
-- Recursive case: follow references
SELECT ccr.child_id, ccr.child_type, r.depth + 1
FROM control_construct_refs ccr
JOIN reachable r ON ccr.parent_id = r.child_id
WHERE r.depth < 20 -- Safety limit
)
SELECT * FROM reachable;
This query:
- Becomes slow as depth increases
- Requires careful index tuning
- May hit database recursion limits
- Doesn't easily filter by relationship type
Graph Database Approach¶
Graph Schema Design¶
In Neo4j, the same structure is expressed as nodes and relationships:
// Nodes
(:CodeList {fragment_id: "cl-1", label: "Age Groups"})
(:QuestionItem {fragment_id: "q-1", question_text: "What is your age?"})
(:QuestionConstruct {fragment_id: "qc-1"})
(:Variable {fragment_id: "v-1", label: "Age"})
// Relationships
(q:QuestionItem)-[:USES_CODELIST]->(cl:CodeList)
(qc:QuestionConstruct)-[:ASKS_QUESTION]->(q:QuestionItem)
(v:Variable)-[:DERIVED_FROM]->(qc:QuestionConstruct)
The Same Query, Simplified¶
Finding all variables using a specific code list:
MATCH (cl:CodeList {fragment_id: 'cl-age-groups'})
<-[:USES_CODELIST]-(q:QuestionItem)
<-[:ASKS_QUESTION]-(qc:QuestionConstruct)
<-[:DERIVED_FROM]-(v:Variable)
RETURN v
Variable-Depth Traversal¶
Finding all constructs reachable from an instrument:
MATCH (inst:Instrument {fragment_id: 'instrument-1'})
-[:HAS_CONSTRUCT*]->(construct)
RETURN construct
The * operator handles arbitrary depth naturally, without recursion limits or complex CTEs.
Benefits of Graph Databases for DDI¶
1. Natural Data Model¶
DDI metadata is a graph. A fragment is a self-contained DDI object. A reference is a pointer from one object to another. The XML defines nodes (fragments) and edges (references) directly:
<QuestionConstruct>
<QuestionReference> <!-- This IS an edge -->
<ID>question-1</ID>
</QuestionReference>
</QuestionConstruct>
A graph database stores this shape as is. A relational database must encode it in an indirect way.
2. Query Performance for Traversals¶
| Operation | Relational | Graph |
|---|---|---|
| Single JOIN | Fast | Fast |
| 3-4 JOINs | Moderate | Fast |
| Variable-depth traversal | Slow/Complex | Fast |
| Path finding | Very slow | Optimized |
| Pattern matching | Manual coding | Native support |
Graph databases use index-free adjacency. Each node points straight to its neighbors. So traversal time scales with the result size, not the total data size.
3. Schema Flexibility¶
Adding a new relationship type in a relational database:
ALTER TABLE question_items
ADD COLUMN instruction_id VARCHAR REFERENCES instructions(id);
-- Plus migration scripts, ORM updates, etc.
Adding a new relationship type in Neo4j:
MATCH (q:QuestionItem {fragment_id: 'q-1'})
MATCH (i:Instruction {fragment_id: 'inst-1'})
CREATE (q)-[:HAS_INSTRUCTION]->(i)
No schema migration required. New relationship types can coexist with existing data.
4. Expressive Query Language¶
Cypher expresses graph patterns directly:
// Find questions with conditional branching that use a specific code list
MATCH (cl:CodeList {fragment_id: 'cl-yes-no'})
<-[:USES_CODELIST]-(q:QuestionItem)
<-[:ASKS_QUESTION]-(qc:QuestionConstruct)
<-[:HAS_CONSTRUCT]-(seq:Sequence)
<-[:THEN|ELSE]-(ite:IfThenElse)
RETURN q.fragment_id, q.question_text, ite.condition
5. Impact Analysis¶
Understanding what's affected by a change:
// Find everything that depends on a code list
MATCH (cl:CodeList {fragment_id: 'cl-occupation'})
MATCH path = (cl)<-[*1..5]-(dependent)
RETURN DISTINCT labels(dependent)[0] as type,
dependent.fragment_id as id,
length(path) as distance
ORDER BY distance
Potential Drawbacks and Considerations¶
1. Learning Curve¶
Challenge: Teams familiar with SQL must learn Cypher and graph thinking.
Mitigation:
- Cypher syntax is intuitive for pattern matching
- ddigraph provides pre-built queries in the audit scripts
- Documentation includes common query patterns
2. Operational Complexity¶
Challenge: Running Neo4j requires additional infrastructure.
Mitigation:
- Neo4j Aura provides managed cloud hosting
- Docker makes local deployment simple
- Neo4j Community Edition is free for single-instance use
3. Tooling Ecosystem¶
Challenge: Fewer BI tools natively support graph databases.
Mitigation:
- Neo4j has JDBC/ODBC connectors for BI tools
- ddigraph's adapter pattern allows export to pandas, JSON, CSV
- Graph results can be flattened for traditional reporting
4. Transaction Semantics¶
Challenge: Graph databases optimize for reads; write-heavy workloads may differ.
Mitigation:
- DDI metadata is read-heavy (load once, query many times)
- ddigraph uses batched writes with UNWIND for efficient ingestion
- Neo4j supports ACID transactions
5. Aggregate Queries¶
Challenge: Some aggregate operations are more natural in SQL.
Example: Counting variables per dataset is simpler in SQL:
SELECT dataset_id, COUNT(*) FROM variables GROUP BY dataset_id;
Cypher equivalent:
MATCH (d:Dataset)<-[:IN_DATASET]-(v:Variable)
RETURN d.id, count(v)
Both work, but SQL's GROUP BY may feel more familiar.
6. Data Volume Considerations¶
Challenge: Very large datasets (millions of nodes) require tuning.
Mitigation:
- ddigraph creates indexes on key fields
- Batch sizes are configurable
- Neo4j Enterprise offers clustering for scale
When to Use Each Approach¶
Graph Database (Neo4j) is Better When¶
- Queries frequently traverse relationships
- Path analysis is important (reachability, impact)
- Schema evolves with new relationship types
- You need to visualize metadata structure
- Integration with other graph data (SDMX, ontologies)
Relational Database May Be Better When¶
- Queries are primarily aggregations and counts
- Team has no graph database experience and limited learning time
- Existing infrastructure is purely relational
- Data volume is small and queries are simple
- Compliance requires specific database technologies
Hybrid Approaches¶
You don't have to choose exclusively. Common patterns include:
1. Graph for Navigation, Relational for Reporting¶
Use Neo4j for structural queries and path analysis, but sync summaries to a relational data warehouse for BI dashboards.
2. Graph as Metadata Layer¶
Keep survey response data in a relational database. Use Neo4j for the metadata that describes that data. Link the two by dataset and variable IDs.
3. Export for Analysis¶
Use ddigraph's adapter pattern to export graph data to pandas DataFrames or CSV files for teams more comfortable with those tools:
python demo/load_pandas.py # Export to DataFrames
python demo/export_files.py # Export to JSON/CSV
Query Comparison Reference¶
| Task | SQL | Cypher |
|---|---|---|
| Find by ID | SELECT * FROM questions WHERE id = 'q1' |
MATCH (q:QuestionItem {fragment_id: 'q1'}) RETURN q |
| Simple JOIN | SELECT * FROM questions q JOIN code_lists cl ON q.code_list_id = cl.id |
MATCH (q:QuestionItem)-[:USES_CODELIST]->(cl) RETURN q, cl |
| Variable depth | Recursive CTE (complex) | MATCH path = (a)-[*1..5]->(b) RETURN path |
| Shortest path | Not native | MATCH p = shortestPath((a)-[*]-(b)) RETURN p |
| All paths | Very complex | MATCH p = allShortestPaths((a)-[*]-(b)) RETURN p |
| Pattern match | Multiple JOINs + conditions | MATCH (a)-[:R1]->(b)-[:R2]->(c) WHERE ... |
Conclusion¶
For DDI metadata, graph databases win on three fronts. Queries are easier to write. Traversals run faster. The schema is more flexible. The trade-offs are a learning curve and more setup. Both stay manageable, especially given:
- DDI's inherently graph-like structure
- The importance of relationship traversal in metadata use cases
- Neo4j's mature tooling and cloud options
- ddigraph's adapter pattern for integration flexibility
Teams should evaluate based on their specific query patterns, existing infrastructure, and willingness to invest in graph database skills.