Skip to content

Duffy SDK Vision

Status: Draft Last updated: 2026-02-27


The Problem

A data scientist or AI engineer who wants to work with graph-structured data today faces a fragmented stack:

  1. Graph database (Neo4j, or NetworkX until it falls over at 5M nodes)
  2. Vector database (Pinecone, Weaviate, Chroma) — separate system, separate API
  3. Relational database (Postgres) — where their "real" data lives
  4. Glue code — bespoke ETL to move data between all three, convert results into DataFrames, wrangle embeddings, build feature tensors for training

Each boundary is a place where time is wasted, bugs hide, and semantic context is lost.

The AGE + pgvector + PostgreSQL combination eliminates the infrastructure fragmentation — graph, vector, relational, time-series, geospatial, documents, all in one engine. But that potential is unrealized because there is no SDK that makes it accessible. The current AGE Python driver (v0.0.7, last released Sep 2023) provides raw Cypher execution over psycopg2 with no DataFrame output, no async, no type hints, no vector awareness, no query builder, and no connection pooling.

Duffy is the SDK that makes the unified PostgreSQL stack usable for data science and AI workflows.


Target Personas

Primary: The Data Scientist / ML Engineer

Works in Jupyter. Thinks in DataFrames and tensors. Knows enough SQL to be productive. Has heard of Cypher but hasn't committed to a graph database because the setup cost is too high and integration with their existing tools (pandas, PyTorch, scikit-learn) is poor.

What they want: - Graph query results as DataFrames, not custom objects they have to parse - Embeddings managed alongside graph structure, not in a separate system - Subgraph extraction directly into PyTorch Geometric for GNN training - One pip install and a Postgres connection string, not three databases to operate

Secondary: The AI/RAG Engineer

Building retrieval-augmented generation pipelines. Currently using LangChain or LlamaIndex with a vector store and maybe a knowledge graph, but the two retrieval paths are duct-taped together.

What they want: - A single backend that serves as both knowledge graph AND vector store - Graph-aware retrieval: "find semantically similar nodes within this graph neighborhood" - Drop-in LangChain/LlamaIndex integration that replaces separate Neo4j + Pinecone configs - Entity resolution and knowledge graph construction pipelines that actually work

Knows their domain deeply. Recognizes that their problems are graph problems. Doesn't want to become a graph database expert to solve them.

What they want: - Domain-specific APIs that encode expert graph modeling patterns - "Model my attack surface" not "write Cypher to traverse your network topology" - Compliance frameworks, case law networks, financial dependency graphs — as library primitives


Architecture: Three Layers

┌─────────────────────────────────────────────────────────────┐
│                   LAYER 3: DOMAIN APIs                       │
│                                                              │
│  Security & Risk │ Knowledge Graphs │ Financial │ Legal      │
│  Attack graphs   │ Entity resolution│ Fraud     │ Case law   │
│  Vuln propagation│ Ontology mgmt    │ TBM/ROI   │ Precedent  │
│  Compliance      │ Graph-aware RAG  │ Lineage   │ Filing     │
├─────────────────────────────────────────────────────────────┤
│                 LAYER 2: TOOL INTEGRATION                    │
│                                                              │
│  pandas/Arrow │ PyTorch Geometric │ LangChain/LlamaIndex    │
│  NetworkX     │ DuckDB            │ MLflow/W&B              │
│  Jupyter viz  │ scikit-learn      │ Hugging Face            │
├─────────────────────────────────────────────────────────────┤
│                 LAYER 1: SMART PRIMITIVES                    │
│                                                              │
│  Graph ops    │ Vector ops    │ Hybrid ops   │ Bulk ops      │
│  Cypher query │ Embed mgmt   │ Graph+vector │ Ingest/export │
│  Query builder│ Index mgmt   │ Constrained  │ Schema mgmt   │
│  Traversal    │ Similarity   │ similarity   │ Migration     │
│  Algorithms   │ Versioning   │              │               │
├─────────────────────────────────────────────────────────────┤
│                       FOUNDATION                             │
│                                                              │
│  Connection pool │ Async (psycopg3) │ Type safety           │
│  Transaction mgmt│ Result parsing   │ Error handling         │
│  Schema intro-   │ AGE + pgvector   │ DataFrame/Arrow       │
│  spection        │ dialect mgmt     │ native output         │
└─────────────────────────────────────────────────────────────┘
                   PostgreSQL 18 + AGE + pgvector

Layer 1: Smart Primitives

The foundation. Every operation a data scientist does with graph + vector data, without glue code.

1.1 Modern Driver (replaces apache-age-python)

The current AGE Python driver is a thin psycopg2 wrapper that parses AGType results. Duffy's foundation replaces it entirely:

  • psycopg3 (async-native) — connection pooling, pipeline mode, COPY protocol
  • DataFrame-native outputquery() returns pd.DataFrame by default, not custom objects. Optional pyarrow.Table output for zero-copy interop.
  • Type-safe — full type hints, dataclass result mapping, IDE completion
  • Query builder — programmatic Cypher construction with parameterization, not string concatenation
  • Schema introspection — discover graphs, labels, property types, indexes programmatically
  • Transaction management — context managers, savepoints, read/write routing
import duffy

db = duffy.connect("postgresql://localhost:5433/duffy")

# Cypher results as DataFrame — zero conversion
df = db.cypher("MATCH (p:Person)-[:KNOWS]->(f) RETURN p.name, f.name, f.age")
# Returns pd.DataFrame with columns: [p.name, f.name, f.age]

# Query builder
from duffy import match, node, rel
q = match(node("p", "Person")).rel("KNOWS").node("f").where(f.age > 30).ret("p", "f")
df = db.execute(q)

# Async
async with duffy.connect_async(dsn) as db:
    df = await db.cypher("MATCH (n:Asset) RETURN n")

Depends on upstream: agtype-to-JSONB cast (#350) would simplify the driver significantly but is not blocking — we can parse AGType directly like the current driver does, just better.

1.2 Graph Operations

Graph algorithms and traversal patterns as callable functions, not raw Cypher that the user has to write and debug:

  • Centrality — degree, betweenness, closeness, PageRank (push computation to PG where possible)
  • Community detection — Louvain, label propagation
  • Path analysis — shortest path, all paths, weighted paths
  • Subgraph extraction — by label, by property filter, by neighborhood (k-hop)
  • Graph metrics — density, diameter, clustering coefficient, component analysis
# PageRank on a subgraph
ranks = db.graph("social").pagerank(
    label="Person",
    rel_type="KNOWS",
    damping=0.85,
    iterations=20
)
# Returns DataFrame: [node_id, name, rank]

# K-hop neighborhood extraction
subgraph = db.graph("network").neighborhood(
    start={"name": "router-01"},
    hops=3,
    rel_types=["CONNECTS_TO", "DEPENDS_ON"]
)
# Returns a Subgraph object with .nodes_df, .edges_df, .to_pyg(), .to_networkx()

Implementation strategy: Start with Python-side computation over query results. As we identify hot paths, push them into PG functions or AGE extensions. The API stays the same either way.

1.3 Vector Operations

Embedding lifecycle management — generate, store, index, search, update — as a coherent API instead of raw pgvector SQL:

  • Embedding management — store, retrieve, update, delete embeddings on graph nodes/edges
  • Index management — create/manage HNSW and IVFFlat indexes, tune parameters
  • Similarity search — k-NN, filtered search, range search
  • Embedding generation — pluggable providers (sentence-transformers, OpenAI, Cohere, local models)
  • Versioning — track embedding model versions, A/B test different embeddings
# Attach embeddings to graph nodes
from duffy.embeddings import SentenceTransformerProvider

embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")

# Embed a property of existing nodes
db.graph("legal").embed(
    label="Case",
    source_property="summary",
    embedding_property="summary_vec",
    provider=embedder,
    batch_size=256
)

# Similarity search
similar = db.vector.search(
    label="Case",
    embedding_property="summary_vec",
    query="wrongful termination in financial services",
    k=20,
    provider=embedder
)
# Returns DataFrame: [node_id, case_name, summary, similarity_score]

Depends on upstream: pgvector integration (#1121) at the AGE level would make hybrid queries cleaner, but the SDK can work today using SQL-bridge patterns (Cypher for graph structure, SQL for vector ops, joined at the SDK layer).

1.4 Hybrid Operations (the differentiator)

The thing nobody else provides: graph traversal constrained by vector similarity, and vector search scoped by graph structure, in a single API call.

# "Find cases semantically similar to this query,
#  cited by cases in the same court"
results = db.hybrid.graph_then_vector(
    # Graph: traverse to candidates
    match="(c:Case {court: 'NYSC'})-[:CITES]->(p:Precedent)",
    # Vector: rank candidates by similarity
    embed_property="p.summary_vec",
    query="wrongful termination in financial services",
    k=10,
    provider=embedder
)

# "Find the 20 most similar assets to this compromised one,
#  then find everything within 3 hops of those"
results = db.hybrid.vector_then_graph(
    # Vector: find similar starting points
    label="Asset",
    embed_property="config_vec",
    query_node={"name": "compromised-server-01"},
    k=20,
    # Graph: expand neighborhood
    expand_hops=3,
    expand_rels=["CONNECTS_TO", "RUNS_ON"]
)

Implementation: Under the hood, this generates a SQL query that combines ag_catalog.cypher() with pgvector operators. The SDK handles the impedance mismatch so the user never writes that SQL.

1.5 Bulk Operations

Data scientists work with datasets, not individual rows. Bulk operations must be first-class:

  • Ingest from DataFramedb.graph("g").load_nodes(df, label="Person", id_col="email")
  • Ingest from CSV/Parquet — direct file loading with schema inference
  • Upsert semantics — MERGE-based idempotent loading (depends on MERGE ON SET #1619)
  • Export — graph to CSV/Parquet/Arrow for downstream tools
  • Schema migration — add properties, rename labels, evolve the graph schema over time
import pandas as pd

# DataFrame -> graph
people_df = pd.read_csv("people.csv")
edges_df = pd.read_csv("relationships.csv")

db.graph("social").load_nodes(people_df, label="Person", id_col="email")
db.graph("social").load_edges(
    edges_df,
    rel_type="KNOWS",
    source_col="from_email", source_label="Person",
    target_col="to_email", target_label="Person"
)

# Graph -> DataFrame
nodes_df, edges_df = db.graph("social").export(labels=["Person"], rel_types=["KNOWS"])

Depends on upstream: Bulk load via client connection (#432) and edge creation performance (#2198) are relevant. The SDK can use COPY protocol via psycopg3 regardless, bypassing AGE's load_from_file.


Layer 2: Tool Integration

Meet data scientists where they already work. These aren't wrappers — they're native integrations that understand the semantics of graph + vector data.

2.1 PyTorch Geometric

Implement PyG's FeatureStore and GraphStore interfaces backed by AGE + pgvector. This makes Duffy a training data source for GNNs without loading the entire graph into memory.

from duffy.integrations.pyg import DuffyGraphStore, DuffyFeatureStore
from torch_geometric.loader import NeighborLoader

graph_store = DuffyGraphStore(db, graph_name="social")
feature_store = DuffyFeatureStore(db, graph_name="social")

# Neighbor sampling happens in-database, not in Python
loader = NeighborLoader(
    data=(feature_store, graph_store),
    num_neighbors=[10, 5],
    batch_size=128,
    input_nodes="Person"
)

for batch in loader:
    # batch is a standard PyG HeteroData object
    out = model(batch.x_dict, batch.edge_index_dict)

No upstream dependency. This is pure SDK work. Kuzu had the best implementation of this pattern before they archived — we can learn from their design.

2.2 LangChain / LlamaIndex

Drop-in backends that make AGE+pgvector a combined knowledge graph and vector store:

# LlamaIndex
from duffy.integrations.llamaindex import DuffyGraphVectorStore

store = DuffyGraphVectorStore(db, graph_name="knowledge")
index = KnowledgeGraphIndex.from_documents(documents, storage_context=store)
# Hybrid retrieval: graph structure + vector similarity in one query

# LangChain
from duffy.integrations.langchain import DuffyGraphStore, DuffyVectorStore

graph = DuffyGraphStore(db)
vectors = DuffyVectorStore(db, embedding=embedder)
# Both backed by the same PostgreSQL instance

Addresses AGE issue #1783 directly. Can be contributed upstream to LlamaIndex and used within Duffy.

2.3 pandas / Arrow (built into Layer 1)

Not a separate integration — this IS the default output format. Every query returns a DataFrame. Every input accepts a DataFrame. Arrow is available for zero-copy handoff to DuckDB, Polars, or any Arrow-compatible tool.

2.4 NetworkX / igraph

Export for visualization, analysis, and interop with the existing Python graph ecosystem:

G = db.graph("social").to_networkx(labels=["Person"], rel_types=["KNOWS"])
# Standard NetworkX graph, works with all nx algorithms and matplotlib viz

ig = db.graph("social").to_igraph(labels=["Person"], rel_types=["KNOWS"])
# igraph for large-graph analysis (faster than NetworkX)

2.5 Jupyter

Native notebook visualization — graph rendering in cells without exporting to external tools:

# In a Jupyter cell
db.graph("social").viz(
    labels=["Person", "Company"],
    rel_types=["WORKS_AT", "KNOWS"],
    layout="force",
    color_by="label",
    size_by="pagerank"
)
# Renders an interactive graph visualization inline

Implementation: Likely wraps PyGraphistry or pyvis for rendering. The value is the integration with the Duffy query layer, not the rendering engine.


Layer 3: Domain APIs

Thin, opinionated layers over Layers 1 and 2 that encode how an expert in each domain would model and query graph + vector data. The user doesn't need to become a graph expert — the API encodes the expertise.

3.1 Security & Risk (duffy.domains.security)

from duffy.domains.security import AttackGraph, ComplianceGraph

# Build attack graph from network topology + vulnerability scan
ag = AttackGraph(db, graph_name="attack_surface")
ag.ingest_topology(nmap_df)           # network scan results
ag.ingest_vulnerabilities(cve_df)     # vulnerability scan results
ag.compute_attack_paths(target="dc-01", max_depth=5)

# Query: what's the blast radius if this host is compromised?
blast = ag.blast_radius("web-server-03")
# Returns subgraph of reachable assets with propagation scores

# Compliance posture as a graph
cg = ComplianceGraph(db, graph_name="compliance")
cg.load_framework("NIST_CSF_2.0")    # reference layer
cg.map_controls(controls_df)          # instance layer
cg.assess_coverage()
# Returns DataFrame: [control_id, requirement, status, evidence_count, gaps]

3.2 Knowledge Graphs & RAG (duffy.domains.knowledge)

from duffy.domains.knowledge import KnowledgeGraph

kg = KnowledgeGraph(db, graph_name="docs")

# Build KG from documents with entity resolution
kg.ingest_documents(
    documents,
    extractor="llm",                  # LLM-based entity/relation extraction
    embedder=embedder,                # embed chunks alongside graph structure
    resolve_entities=True             # deduplicate entities across documents
)

# Graph-aware RAG retrieval
results = kg.retrieve(
    query="What are the compliance requirements for PCI DSS in cloud environments?",
    strategy="hybrid",                # vector similarity + graph traversal
    graph_hops=2,                     # expand to 2-hop neighbors of vector matches
    k=10
)

3.3 Financial (duffy.domains.financial)

from duffy.domains.financial import DependencyGraph, CostModel

# Technology dependency graph with cost propagation
dg = DependencyGraph(db, graph_name="tech_stack")
dg.load_assets(assets_df)
dg.load_dependencies(deps_df)
dg.load_costs(costs_df)              # star schema: measures stay relational

# What-if analysis: "what happens if we decommission Oracle?"
impact = dg.decommission_impact("Oracle Database", cascade=True)
# Returns: affected_assets, cost_delta, migration_paths, risk_score
from duffy.domains.legal import CaseLawGraph

clg = CaseLawGraph(db, graph_name="case_law")
clg.ingest_cases(cases_df, embed_property="summary")

# Find persuasive precedent
precedents = clg.find_precedent(
    query="wrongful termination constructive dismissal",
    court="NYSC",
    strategy="citation_weighted",     # weight by citation graph PageRank
    k=10
)

What Must Exist in AGE for This to Work

The SDK can work around AGE limitations at the Python layer, but certain upstream improvements make the SDK dramatically better. These are our Track B feature contributions — AGE PRs driven by what the SDK needs:

SDK need AGE gap Issue Priority
Idempotent graph loading MERGE ON CREATE/MATCH SET #1619 Immediate (we own this)
Graph traversal filtering Predicate functions (all/any/none/single) #552-556 Next sprint
Pattern-based filtering Pattern matching in WHERE #1577 Next sprint
Complex ETL pipelines CALL subquery #1690 Medium-term
Data integrity Unique constraints #45 Medium-term
Driver simplification agtype → JSONB cast #350 Quick win
Hybrid graph+vector queries pgvector integration #1121 Strategic (our differentiator)
Bulk data loading Client-side bulk load #432 SDK can work around
Real-world graph modeling Multiple labels #2082 Help existing PR land

Build Sequence

Phase 1: Foundation + First Proof of Value (Months 1-2)

Build the driver and one integration that proves the concept.

Milestone 1a — Modern Driver: - [ ] psycopg3 async connection pool - [ ] Cypher execution with DataFrame output (pd.DataFrame, pyarrow.Table) - [ ] AGType parsing (port + modernize existing parser) - [ ] Type hints, schema introspection - [ ] Basic query builder - [ ] Package structure, CI, pyproject.toml

Milestone 1b — Hybrid Query Proof of Concept: - [ ] Vector operations: embed, store, search on graph nodes - [ ] One hybrid query pattern working end-to-end (graph_then_vector) - [ ] Jupyter notebook demo: "Graph + Vector search in one PostgreSQL instance"

Milestone 1c — Parallel AGE Contributions: - [ ] MERGE ON CREATE/MATCH SET PR submitted (#1619) - [ ] Predicate functions PR submitted (#552-556) - [ ] agtype → JSONB cast PR submitted (#350)

Ship: pip install duffy with basic graph queries returning DataFrames, vector search on graph nodes, and one hybrid query pattern. Enough for a blog post and a demo.

Phase 2: Integrations (Months 3-4)

Make Duffy work with the tools people already use.

  • [ ] PyTorch Geometric FeatureStore/GraphStore backend
  • [ ] LlamaIndex GraphStore integration (contributes to issue #1783)
  • [ ] LangChain vector store + graph store backends
  • [ ] NetworkX/igraph export
  • [ ] Bulk operations (DataFrame → graph, graph → DataFrame)
  • [ ] Graph algorithm library (PageRank, community detection, centrality)

Ship: Duffy as a viable alternative to Neo4j's Python ecosystem for data science and RAG workflows.

Phase 3: Domain APIs (Months 5-8)

Build the opinionated layers that make specific verticals productive immediately.

  • [ ] duffy.domains.security — attack graphs, compliance graphs, vulnerability propagation
  • [ ] duffy.domains.knowledge — entity resolution, KG construction, graph-aware RAG
  • [ ] duffy.domains.financial — dependency graphs, cost models, what-if analysis
  • [ ] duffy.domains.legal — case law graphs, precedent search, citation analysis

Ship: Domain-specific demos and documentation that show Duffy solving real problems, not just providing infrastructure.

Phase 4: Production Hardening (Months 9-12)

  • [ ] Benchmark suite (vs Neo4j Python driver, vs raw psycopg3, vs standalone pgvector)
  • [ ] Connection pool tuning, query caching, materialized view integration
  • [ ] Observability (query logging, performance metrics, OpenTelemetry)
  • [ ] Migration tools (Neo4j → Duffy, standalone pgvector → Duffy)
  • [ ] Comprehensive documentation, tutorials, example notebooks

Competitive Position

Capability Duffy Neo4j Python Kuzu (archived) Raw psycopg3 + pgvector
Graph queries Cypher via AGE Cypher (native) Cypher (embedded) No graph support
Vector search pgvector (same DB) Bolt-on No pgvector only
Hybrid graph+vector First-class API Limited No Manual SQL
DataFrame output Default Supported Manual Manual
PyG integration FeatureStore/GraphStore Via GDS (enterprise) Had it (best in class) No
LangChain/LlamaIndex Drop-in backend Drop-in (Neo4j only) No Separate vector store
Domain APIs Security, Legal, Financial, KG No No No
Operational burden One PostgreSQL instance Separate DB server Embedded (no server) One PostgreSQL instance
License Apache 2.0 GPL/Commercial MIT (archived) PostgreSQL + Apache 2.0
Async support Yes (psycopg3) Yes No Yes

The pitch: Neo4j's Python ecosystem is the gold standard, but it requires a separate database engine and the commercial license is expensive. Duffy provides the same developer experience on the PostgreSQL stack you already run, with the added capability of hybrid graph+vector queries that Neo4j can't match natively.


Open Design Questions

  1. Package name: duffy on PyPI? Or age-toolkit / pg-graph for discoverability? The name needs to signal "graph + vector + PostgreSQL" to someone scanning PyPI.
  2. Minimum viable driver scope: How thin can Milestone 1a be while still being useful? The temptation is to build everything — resist it.
  3. Graph algorithm execution: Python-side vs PG function vs AGE extension? Start Python-side for speed, push to PG for performance-critical paths. But where's the line?
  4. Embedding provider abstraction: How tightly should we couple to specific embedding providers? Pluggable interface, but which providers are first-class?
  5. Domain API granularity: Each domain could be its own package (duffy-security, duffy-legal) or part of the monorepo. Separate packages reduce install size but increase maintenance.

This document is the product vision for the Duffy SDK. It should be updated as design decisions are made and milestones are completed. See docs/research/age-feature-prioritization.md for the AGE upstream backlog that feeds Track B contributions.