FAQ¶
Frequently asked questions about ddigraph.
What DDI formats are supported?¶
ddigraph supports three DDI formats:
| Format | DDI Version | Root Element | Use Case |
|---|---|---|---|
| DDI Codebook | 2.5 / 2.6 | <codeBook> |
Survey archives, data catalogs, variable-level metadata |
| DDI Lifecycle | 3.2 / 3.3 | <FragmentInstance> |
Questionnaire flows, CAPI/CAWI instruments, complex study designs |
| DDI-CDI | 1.0 | <Wrapper> |
Cross-domain integration, variable harmonisation, concept mapping |
The CLI auto-detects the format by inspecting the XML root element and namespace. You can also
force a specific format with --format codebook or --format lifecycle.
# Auto-detect (default)
ddigraph load survey.xml --dataset-id demo
# Explicit format
ddigraph load survey.xml --format codebook --dataset-id demo
ddigraph load questionnaire.xml --format lifecycle
Do I need Neo4j to use ddigraph?¶
No. While Neo4j is the default and most fully supported backend, ddigraph supports a multi-backend adapter architecture. You can load DDI data into:
- Neo4j -- full support with Bolt driver, schema bootstrap, UNWIND batching
- RDF / SPARQL -- via rdflib, with export to Turtle, N-Triples, RDF/XML, JSON-LD
- Gremlin -- compatible with JanusGraph, Amazon Neptune, Azure Cosmos DB
- NetworkX -- in-memory graph for local analysis, no external services needed
- pandas -- DataFrames for tabular analysis and Excel/CSV export
Quick start without a database
For exploration and prototyping, the NetworkX backend requires no external services:
import networkx as nx
from ddigraph import DDIFragmentParser
G = nx.MultiDiGraph()
parser = DDIFragmentParser()
for fragment in parser.parse("survey.xml"):
G.add_node(fragment.fragment_id, label=fragment.element_type)
print(f"Loaded {G.number_of_nodes()} nodes")
See Graph Backends for detailed setup guides for each backend.
How do I handle large files?¶
ddigraph is designed for large files. The streaming iterparse-based parser processes XML
incrementally, so memory usage stays bounded regardless of file size.
Key strategies for large files:
-
Tune batch size: Increase
--chunk-sizeto reduce transaction overhead for large files. Start with 500-1000 for files with 10,000+ elements.ddigraph load large_survey.xml --dataset-id demo --chunk-size 1000 -
Increase writer concurrency (Codebook only): Allow multiple batches to flush in parallel.
ddigraph load large_codebook.xml --dataset-id demo \ --chunk-size 1000 --writer-concurrency 4 --queue-maxsize 4 -
Monitor with batch metrics: Enable per-batch timing to identify bottlenecks.
ddigraph load large_survey.xml --dataset-id demo --batch-metrics -
Increase transaction timeout: Large batches may exceed the default server-side timeout.
export DDIGRAPH_TRANSACTION_TIMEOUT=60
See Performance Tuning for detailed recommendations.
Why is ingestion slow?¶
Common causes of slow ingestion and their solutions:
1. Missing indexes or constraints
Always run schema bootstrap before loading data:
ddigraph bootstrap
Without indexes, Neo4j performs full scans for every MERGE operation.
2. Batch size too small
The default chunk_size of 200 is conservative. For large files, increase it:
ddigraph load file.xml --dataset-id demo --chunk-size 1000
3. Network latency to Neo4j
For cloud instances (Neo4j Aura, remote clusters), network round-trips dominate. Use larger batches and increase retry tolerance:
ddigraph load file.xml --dataset-id demo \
--chunk-size 500 \
--write-retry-attempts 5 \
--write-retry-base-delay 2.0
4. Neo4j server resources
Check that the Neo4j instance has sufficient memory and CPU. The Neo4j browser
(http://localhost:7474) shows active queries and resource usage.
DDI-L performance
DDI-L FragmentInstance ingestion uses UNWIND batching, reducing Neo4j queries to ~30 for a typical file.
How do I connect to Neo4j Aura (cloud)?¶
Neo4j Aura uses encrypted connections. Use the neo4j+s:// URI scheme, which enables TLS
automatically:
export DDIGRAPH_NEO4J_URI=neo4j+s://xxxx.databases.neo4j.io
export DDIGRAPH_NEO4J_USER=neo4j
export DDIGRAPH_NEO4J_PASSWORD=your-aura-password
Then run commands as usual:
ddigraph bootstrap
ddigraph load survey.xml --dataset-id demo
Aura tuning
Cloud instances have higher network latency. Consider these settings:
export DDIGRAPH_CONNECTION_TIMEOUT=10
export DDIGRAPH_CHUNK_SIZE=200
export DDIGRAPH_WRITE_RETRY_ATTEMPTS=5
export DDIGRAPH_WRITE_RETRY_BASE_DELAY=2.0
You do not need --encrypted or --trusted-certificates when using the neo4j+s:// scheme,
as TLS is handled by the URI scheme itself.
Can I use ddigraph without async?¶
The GraphWriteAdapter protocol accepts both synchronous and asynchronous implementations.
Your adapter can return None (sync) or Awaitable[None] (async):
from ddigraph.schema.adapter import GraphWriteAdapter
# Synchronous adapter
class SyncAdapter:
def write_batch(self, graph, **kwargs) -> None:
for node in graph.nodes():
self.backend.create_node(node["label"], node["properties"])
def purge_dataset(self, dataset_id, **kwargs) -> None:
self.backend.delete_by_dataset(dataset_id)
The CLI and built-in loaders use asyncio.run() internally, so you do not need to manage an
event loop yourself when using the command line. The async machinery is transparent.
How do I add a custom graph backend?¶
Implement the GraphWriteAdapter protocol with two methods: write_batch and purge_dataset.
from ddigraph.schema.adapter import GraphWriteAdapter
from ddigraph.schema.ddi_graph import DDIIngestGraph
class MyBackendAdapter:
async def write_batch(
self,
graph: DDIIngestGraph,
*,
session_config: dict[str, object] | None = None,
transaction_config: dict[str, object] | None = None,
) -> None:
for node in graph.nodes():
await self.backend.create(node["label"], node["properties"])
for rel in graph.relationships():
await self.backend.link(rel["start"], rel["end"], rel["type"])
async def purge_dataset(
self,
dataset_id: str,
*,
session_config: dict[str, object] | None = None,
transaction_config: dict[str, object] | None = None,
) -> None:
await self.backend.delete_all(dataset_id)
Then pass your adapter to the loader:
from ddigraph.ingest.loader import DDILoader
adapter = MyBackendAdapter()
loader = DDILoader(driver=None, adapter=adapter)
See Adapter Architecture for complete examples and Contributing for guidelines on submitting new backends.
What Python versions are supported?¶
ddigraph supports Python 3.12-3.14. This is specified
in pyproject.toml:
requires-python = ">=3.12,<3.15"
Python 3.12-3.14 is required for:
- Modern
typestatement syntax andtypealiases collections.abcimprovements used in protocol definitions- Performance improvements in
asyncio
How do I migrate from neo4ddi to ddigraph?¶
The package was renamed from neo4ddi to ddigraph to reflect multi-backend support.
Step 1: Update the package¶
pip uninstall neo4ddi
pip install ddigraph
Step 2: Update imports¶
# Before
from neo4ddi.config import Settings
from neo4ddi.ingest.loader import DDILoader
# After
from ddigraph.config import Settings
from ddigraph.ingest.loader import DDILoader
Step 3: Update environment variables¶
The DDIGRAPH_ prefix is now preferred, but NEO4DDI_ is still recognized:
# Before
export NEO4DDI_NEO4J_URI=bolt://localhost:7687
# After (preferred)
export DDIGRAPH_NEO4J_URI=bolt://localhost:7687
Step 4: Update CLI usage¶
The CLI command changed from neo4ddi to ddigraph:
# Before
neo4ddi load survey.xml --dataset-id demo
# After
ddigraph load survey.xml --dataset-id demo
Backward compatibility
The NEO4DDI_ environment variable prefix and bare NEO4J_ variables are still recognized
for backward compatibility. The DDIGRAPH_ prefix takes precedence when multiple prefixes
are set.
Troubleshooting: Connection Errors¶
"ServiceUnavailable: Unable to retrieve routing information"¶
This usually means ddigraph cannot reach the Neo4j server.
-
Check that Neo4j is running:
# Docker docker ps | grep neo4j # Or try connecting directly curl -s http://localhost:7474 -
Verify the URI: Make sure
DDIGRAPH_NEO4J_URIuses the correct protocol and port:- Local:
bolt://localhost:7687 - Aura:
neo4j+s://xxxx.databases.neo4j.io
- Local:
-
Check credentials: Verify username and password are correct.
-
Firewall/network: Ensure port 7687 (Bolt) is open and accessible.
"AuthError: The client is unauthorized"¶
The username or password is incorrect. Double-check:
echo $DDIGRAPH_NEO4J_USER
echo $DDIGRAPH_NEO4J_PASSWORD
"ClientError: database not found"¶
The specified database does not exist. Check DDIGRAPH_NEO4J_DATABASE (default: neo4j).
Neo4j Community Edition only supports the default neo4j database.
Troubleshooting: Memory Errors¶
"MemoryError" or process killed during parsing¶
ddigraph uses streaming XML parsing (iterparse), which should keep memory usage bounded. If
you still hit memory limits:
-
Reduce batch size:
ddigraph load file.xml --dataset-id demo --chunk-size 50 -
Reduce queue size (Codebook format):
ddigraph load file.xml --dataset-id demo --queue-maxsize 1 -
Reduce writer concurrency:
ddigraph load file.xml --dataset-id demo --writer-concurrency 1 -
Check for non-streaming parsing: If you are using a custom adapter, make sure you are not accumulating all nodes in memory. Process and discard batches incrementally.
Monitoring memory
Use --batch-metrics to see per-batch statistics. If batch sizes look unexpectedly large,
your chunk_size may be too high for the available memory.
How is ddigraph different from other DDI tools?¶
ddigraph is specifically designed for graph-based representations of DDI metadata. Here is how it compares to other approaches:
| Aspect | ddigraph | Colectica | NESSTAR | Custom scripts |
|---|---|---|---|---|
| Output | Knowledge graph (Neo4j, RDF, etc.) | Relational database | Proprietary format | Varies |
| DDI formats | Codebook + Lifecycle + CDI | Lifecycle | Codebook | Usually one |
| Graph backends | 5 (Neo4j, RDF, Gremlin, NetworkX, pandas) | N/A | N/A | Usually one |
| Streaming | Yes (bounded memory) | Yes | Varies | Usually no |
| Open source | Yes (MIT) | No (commercial) | No (discontinued) | Varies |
| Extensible | Adapter protocol | Plugin API | No | By definition |
Key differentiators:
- Graph-native: ddigraph preserves the inherent graph structure of DDI metadata (variables linked to questions linked to concepts linked to code lists), making traversals and path queries natural.
- Multi-backend: A single parsing pipeline feeds any graph backend through the
GraphWriteAdapterinterface. - Streaming architecture: Memory usage is bounded regardless of file size, making it suitable for large national survey archives.
- Production-ready: Built-in retry logic, structured logging, and configurable back-pressure support automated pipelines and CI/CD workflows.
Still have questions? Open an issue on GitHub.