FAQ¶

Frequently asked questions about ddigraph.

What DDI formats are supported?¶

ddigraph supports three DDI formats:

Format	DDI Version	Root Element	Use Case
DDI Codebook	2.5 / 2.6	`<codeBook>`	Survey archives, data catalogs, variable-level metadata
DDI Lifecycle	3.2 / 3.3	`<FragmentInstance>`	Questionnaire flows, CAPI/CAWI instruments, complex study designs
DDI-CDI	1.0	`<Wrapper>`	Cross-domain integration, variable harmonisation, concept mapping

The CLI auto-detects the format by inspecting the XML root element and namespace. You can also force a specific format with --format codebook or --format lifecycle.

# Auto-detect (default)
ddigraph load survey.xml --dataset-id demo

# Explicit format
ddigraph load survey.xml --format codebook --dataset-id demo
ddigraph load questionnaire.xml --format lifecycle

Do I need Neo4j to use ddigraph?¶

No. While Neo4j is the default and most fully supported backend, ddigraph supports a multi-backend adapter architecture. You can load DDI data into:

Neo4j -- full support with Bolt driver, schema bootstrap, UNWIND batching
RDF / SPARQL -- via rdflib, with export to Turtle, N-Triples, RDF/XML, JSON-LD
Gremlin -- compatible with JanusGraph, Amazon Neptune, Azure Cosmos DB
NetworkX -- in-memory graph for local analysis, no external services needed
pandas -- DataFrames for tabular analysis and Excel/CSV export

Quick start without a database

For exploration and prototyping, the NetworkX backend requires no external services:

import networkx as nx
from ddigraph import DDIFragmentParser

G = nx.MultiDiGraph()
parser = DDIFragmentParser()
for fragment in parser.parse("survey.xml"):
    G.add_node(fragment.fragment_id, label=fragment.element_type)
print(f"Loaded {G.number_of_nodes()} nodes")

See Graph Backends for detailed setup guides for each backend.

How do I handle large files?¶

ddigraph is designed for large files. The streaming iterparse-based parser processes XML incrementally, so memory usage stays bounded regardless of file size.

Key strategies for large files:

Tune batch size: Increase --chunk-size to reduce transaction overhead for large files. Start with 500-1000 for files with 10,000+ elements.
```
ddigraph load large_survey.xml --dataset-id demo --chunk-size 1000
```

Increase writer concurrency (Codebook only): Allow multiple batches to flush in parallel.

ddigraph load large_codebook.xml --dataset-id demo \
    --chunk-size 1000 --writer-concurrency 4 --queue-maxsize 4

Monitor with batch metrics: Enable per-batch timing to identify bottlenecks.
```
ddigraph load large_survey.xml --dataset-id demo --batch-metrics
```
Increase transaction timeout: Large batches may exceed the default server-side timeout.
```
export DDIGRAPH_TRANSACTION_TIMEOUT=60
```

See Performance Tuning for detailed recommendations.

Why is ingestion slow?¶

Common causes of slow ingestion and their solutions:

1. Missing indexes or constraints

Always run schema bootstrap before loading data:

ddigraph bootstrap

Without indexes, Neo4j performs full scans for every MERGE operation.

2. Batch size too small

The default chunk_size of 200 is conservative. For large files, increase it:

ddigraph load file.xml --dataset-id demo --chunk-size 1000

3. Network latency to Neo4j

For cloud instances (Neo4j Aura, remote clusters), network round-trips dominate. Use larger batches and increase retry tolerance:

ddigraph load file.xml --dataset-id demo \
    --chunk-size 500 \
    --write-retry-attempts 5 \
    --write-retry-base-delay 2.0

4. Neo4j server resources

Check that the Neo4j instance has sufficient memory and CPU. The Neo4j browser (http://localhost:7474) shows active queries and resource usage.

DDI-L performance

DDI-L FragmentInstance ingestion uses UNWIND batching, reducing Neo4j queries to ~30 for a typical file.

How do I connect to Neo4j Aura (cloud)?¶

Neo4j Aura uses encrypted connections. Use the neo4j+s:// URI scheme, which enables TLS automatically:

export DDIGRAPH_NEO4J_URI=neo4j+s://xxxx.databases.neo4j.io
export DDIGRAPH_NEO4J_USER=neo4j
export DDIGRAPH_NEO4J_PASSWORD=your-aura-password

Then run commands as usual:

ddigraph bootstrap
ddigraph load survey.xml --dataset-id demo

Aura tuning

Cloud instances have higher network latency. Consider these settings:

export DDIGRAPH_CONNECTION_TIMEOUT=10
export DDIGRAPH_CHUNK_SIZE=200
export DDIGRAPH_WRITE_RETRY_ATTEMPTS=5
export DDIGRAPH_WRITE_RETRY_BASE_DELAY=2.0

You do not need --encrypted or --trusted-certificates when using the neo4j+s:// scheme, as TLS is handled by the URI scheme itself.

Can I use ddigraph without async?¶

The GraphWriteAdapter protocol accepts both synchronous and asynchronous implementations. Your adapter can return None (sync) or Awaitable[None] (async):

from ddigraph.schema.adapter import GraphWriteAdapter

# Synchronous adapter
class SyncAdapter:
    def write_batch(self, graph, **kwargs) -> None:
        for node in graph.nodes():
            self.backend.create_node(node["label"], node["properties"])

    def purge_dataset(self, dataset_id, **kwargs) -> None:
        self.backend.delete_by_dataset(dataset_id)

The CLI and built-in loaders use asyncio.run() internally, so you do not need to manage an event loop yourself when using the command line. The async machinery is transparent.

How do I add a custom graph backend?¶

Implement the GraphWriteAdapter protocol with two methods: write_batch and purge_dataset.

from ddigraph.schema.adapter import GraphWriteAdapter
from ddigraph.schema.ddi_graph import DDIIngestGraph

class MyBackendAdapter:
    async def write_batch(
        self,
        graph: DDIIngestGraph,
        *,
        session_config: dict[str, object] | None = None,
        transaction_config: dict[str, object] | None = None,
    ) -> None:
        for node in graph.nodes():
            await self.backend.create(node["label"], node["properties"])
        for rel in graph.relationships():
            await self.backend.link(rel["start"], rel["end"], rel["type"])

    async def purge_dataset(
        self,
        dataset_id: str,
        *,
        session_config: dict[str, object] | None = None,
        transaction_config: dict[str, object] | None = None,
    ) -> None:
        await self.backend.delete_all(dataset_id)

Then pass your adapter to the loader:

from ddigraph.ingest.loader import DDILoader

adapter = MyBackendAdapter()
loader = DDILoader(driver=None, adapter=adapter)

See Adapter Architecture for complete examples and Contributing for guidelines on submitting new backends.

What Python versions are supported?¶

ddigraph supports Python 3.12-3.14. This is specified in pyproject.toml:

requires-python = ">=3.12,<3.15"

Python 3.12-3.14 is required for:

Modern type statement syntax and type aliases
collections.abc improvements used in protocol definitions
Performance improvements in asyncio

How do I migrate from neo4ddi to ddigraph?¶

The package was renamed from neo4ddi to ddigraph to reflect multi-backend support.

Step 1: Update the package¶

pip uninstall neo4ddi
pip install ddigraph

Step 2: Update imports¶

# Before
from neo4ddi.config import Settings
from neo4ddi.ingest.loader import DDILoader

# After
from ddigraph.config import Settings
from ddigraph.ingest.loader import DDILoader

Step 3: Update environment variables¶

The DDIGRAPH_ prefix is now preferred, but NEO4DDI_ is still recognized:

# Before
export NEO4DDI_NEO4J_URI=bolt://localhost:7687

# After (preferred)
export DDIGRAPH_NEO4J_URI=bolt://localhost:7687

Step 4: Update CLI usage¶

The CLI command changed from neo4ddi to ddigraph:

# Before
neo4ddi load survey.xml --dataset-id demo

# After
ddigraph load survey.xml --dataset-id demo

Backward compatibility

The NEO4DDI_ environment variable prefix and bare NEO4J_ variables are still recognized for backward compatibility. The DDIGRAPH_ prefix takes precedence when multiple prefixes are set.

Troubleshooting: Connection Errors¶

"ServiceUnavailable: Unable to retrieve routing information"¶

This usually means ddigraph cannot reach the Neo4j server.

Check that Neo4j is running:

# Docker
docker ps | grep neo4j

# Or try connecting directly
curl -s http://localhost:7474

Verify the URI: Make sure DDIGRAPH_NEO4J_URI uses the correct protocol and port:
- Local: bolt://localhost:7687
- Aura: neo4j+s://xxxx.databases.neo4j.io
Check credentials: Verify username and password are correct.
Firewall/network: Ensure port 7687 (Bolt) is open and accessible.

"AuthError: The client is unauthorized"¶

The username or password is incorrect. Double-check:

echo $DDIGRAPH_NEO4J_USER
echo $DDIGRAPH_NEO4J_PASSWORD

"ClientError: database not found"¶

The specified database does not exist. Check DDIGRAPH_NEO4J_DATABASE (default: neo4j). Neo4j Community Edition only supports the default neo4j database.

Troubleshooting: Memory Errors¶

"MemoryError" or process killed during parsing¶

ddigraph uses streaming XML parsing (iterparse), which should keep memory usage bounded. If you still hit memory limits:

Reduce batch size:

ddigraph load file.xml --dataset-id demo --chunk-size 50

Reduce queue size (Codebook format):

ddigraph load file.xml --dataset-id demo --queue-maxsize 1

Reduce writer concurrency:

ddigraph load file.xml --dataset-id demo --writer-concurrency 1

Check for non-streaming parsing: If you are using a custom adapter, make sure you are not accumulating all nodes in memory. Process and discard batches incrementally.

Monitoring memory

Use --batch-metrics to see per-batch statistics. If batch sizes look unexpectedly large, your chunk_size may be too high for the available memory.

How is ddigraph different from other DDI tools?¶

ddigraph is specifically designed for graph-based representations of DDI metadata. Here is how it compares to other approaches:

Aspect	ddigraph	Colectica	NESSTAR	Custom scripts
Output	Knowledge graph (Neo4j, RDF, etc.)	Relational database	Proprietary format	Varies
DDI formats	Codebook + Lifecycle + CDI	Lifecycle	Codebook	Usually one
Graph backends	5 (Neo4j, RDF, Gremlin, NetworkX, pandas)	N/A	N/A	Usually one
Streaming	Yes (bounded memory)	Yes	Varies	Usually no
Open source	Yes (MIT)	No (commercial)	No (discontinued)	Varies
Extensible	Adapter protocol	Plugin API	No	By definition

Key differentiators:

Graph-native: ddigraph preserves the inherent graph structure of DDI metadata (variables linked to questions linked to concepts linked to code lists), making traversals and path queries natural.
Multi-backend: A single parsing pipeline feeds any graph backend through the GraphWriteAdapter interface.
Streaming architecture: Memory usage is bounded regardless of file size, making it suitable for large national survey archives.
Production-ready: Built-in retry logic, structured logging, and configurable back-pressure support automated pipelines and CI/CD workflows.

Still have questions? Open an issue on GitHub.