Duffy SDK Vision¶
Status: Draft Last updated: 2026-02-27
The Problem¶
A data scientist or AI engineer who wants to work with graph-structured data today faces a fragmented stack:
- Graph database (Neo4j, or NetworkX until it falls over at 5M nodes)
- Vector database (Pinecone, Weaviate, Chroma) — separate system, separate API
- Relational database (Postgres) — where their "real" data lives
- Glue code — bespoke ETL to move data between all three, convert results into DataFrames, wrangle embeddings, build feature tensors for training
Each boundary is a place where time is wasted, bugs hide, and semantic context is lost.
The AGE + pgvector + PostgreSQL combination eliminates the infrastructure fragmentation — graph, vector, relational, time-series, geospatial, documents, all in one engine. But that potential is unrealized because there is no SDK that makes it accessible. The current AGE Python driver (v0.0.7, last released Sep 2023) provides raw Cypher execution over psycopg2 with no DataFrame output, no async, no type hints, no vector awareness, no query builder, and no connection pooling.
Duffy is the SDK that makes the unified PostgreSQL stack usable for data science and AI workflows.
Target Personas¶
Primary: The Data Scientist / ML Engineer¶
Works in Jupyter. Thinks in DataFrames and tensors. Knows enough SQL to be productive. Has heard of Cypher but hasn't committed to a graph database because the setup cost is too high and integration with their existing tools (pandas, PyTorch, scikit-learn) is poor.
What they want:
- Graph query results as DataFrames, not custom objects they have to parse
- Embeddings managed alongside graph structure, not in a separate system
- Subgraph extraction directly into PyTorch Geometric for GNN training
- One pip install and a Postgres connection string, not three databases to operate
Secondary: The AI/RAG Engineer¶
Building retrieval-augmented generation pipelines. Currently using LangChain or LlamaIndex with a vector store and maybe a knowledge graph, but the two retrieval paths are duct-taped together.
What they want: - A single backend that serves as both knowledge graph AND vector store - Graph-aware retrieval: "find semantically similar nodes within this graph neighborhood" - Drop-in LangChain/LlamaIndex integration that replaces separate Neo4j + Pinecone configs - Entity resolution and knowledge graph construction pipelines that actually work
Tertiary: The Domain Engineer (Security, Legal, Financial)¶
Knows their domain deeply. Recognizes that their problems are graph problems. Doesn't want to become a graph database expert to solve them.
What they want: - Domain-specific APIs that encode expert graph modeling patterns - "Model my attack surface" not "write Cypher to traverse your network topology" - Compliance frameworks, case law networks, financial dependency graphs — as library primitives
Architecture: Three Layers¶
┌─────────────────────────────────────────────────────────────┐
│ LAYER 3: DOMAIN APIs │
│ │
│ Security & Risk │ Knowledge Graphs │ Financial │ Legal │
│ Attack graphs │ Entity resolution│ Fraud │ Case law │
│ Vuln propagation│ Ontology mgmt │ TBM/ROI │ Precedent │
│ Compliance │ Graph-aware RAG │ Lineage │ Filing │
├─────────────────────────────────────────────────────────────┤
│ LAYER 2: TOOL INTEGRATION │
│ │
│ pandas/Arrow │ PyTorch Geometric │ LangChain/LlamaIndex │
│ NetworkX │ DuckDB │ MLflow/W&B │
│ Jupyter viz │ scikit-learn │ Hugging Face │
├─────────────────────────────────────────────────────────────┤
│ LAYER 1: SMART PRIMITIVES │
│ │
│ Graph ops │ Vector ops │ Hybrid ops │ Bulk ops │
│ Cypher query │ Embed mgmt │ Graph+vector │ Ingest/export │
│ Query builder│ Index mgmt │ Constrained │ Schema mgmt │
│ Traversal │ Similarity │ similarity │ Migration │
│ Algorithms │ Versioning │ │ │
├─────────────────────────────────────────────────────────────┤
│ FOUNDATION │
│ │
│ Connection pool │ Async (psycopg3) │ Type safety │
│ Transaction mgmt│ Result parsing │ Error handling │
│ Schema intro- │ AGE + pgvector │ DataFrame/Arrow │
│ spection │ dialect mgmt │ native output │
└─────────────────────────────────────────────────────────────┘
│
PostgreSQL 18 + AGE + pgvector
Layer 1: Smart Primitives¶
The foundation. Every operation a data scientist does with graph + vector data, without glue code.
1.1 Modern Driver (replaces apache-age-python)¶
The current AGE Python driver is a thin psycopg2 wrapper that parses AGType results. Duffy's foundation replaces it entirely:
- psycopg3 (async-native) — connection pooling, pipeline mode, COPY protocol
- DataFrame-native output —
query()returnspd.DataFrameby default, not custom objects. Optionalpyarrow.Tableoutput for zero-copy interop. - Type-safe — full type hints, dataclass result mapping, IDE completion
- Query builder — programmatic Cypher construction with parameterization, not string concatenation
- Schema introspection — discover graphs, labels, property types, indexes programmatically
- Transaction management — context managers, savepoints, read/write routing
import duffy
db = duffy.connect("postgresql://localhost:5433/duffy")
# Cypher results as DataFrame — zero conversion
df = db.cypher("MATCH (p:Person)-[:KNOWS]->(f) RETURN p.name, f.name, f.age")
# Returns pd.DataFrame with columns: [p.name, f.name, f.age]
# Query builder
from duffy import match, node, rel
q = match(node("p", "Person")).rel("KNOWS").node("f").where(f.age > 30).ret("p", "f")
df = db.execute(q)
# Async
async with duffy.connect_async(dsn) as db:
df = await db.cypher("MATCH (n:Asset) RETURN n")
Depends on upstream: agtype-to-JSONB cast (#350) would simplify the driver significantly but is not blocking — we can parse AGType directly like the current driver does, just better.
1.2 Graph Operations¶
Graph algorithms and traversal patterns as callable functions, not raw Cypher that the user has to write and debug:
- Centrality — degree, betweenness, closeness, PageRank (push computation to PG where possible)
- Community detection — Louvain, label propagation
- Path analysis — shortest path, all paths, weighted paths
- Subgraph extraction — by label, by property filter, by neighborhood (k-hop)
- Graph metrics — density, diameter, clustering coefficient, component analysis
# PageRank on a subgraph
ranks = db.graph("social").pagerank(
label="Person",
rel_type="KNOWS",
damping=0.85,
iterations=20
)
# Returns DataFrame: [node_id, name, rank]
# K-hop neighborhood extraction
subgraph = db.graph("network").neighborhood(
start={"name": "router-01"},
hops=3,
rel_types=["CONNECTS_TO", "DEPENDS_ON"]
)
# Returns a Subgraph object with .nodes_df, .edges_df, .to_pyg(), .to_networkx()
Implementation strategy: Start with Python-side computation over query results. As we identify hot paths, push them into PG functions or AGE extensions. The API stays the same either way.
1.3 Vector Operations¶
Embedding lifecycle management — generate, store, index, search, update — as a coherent API instead of raw pgvector SQL:
- Embedding management — store, retrieve, update, delete embeddings on graph nodes/edges
- Index management — create/manage HNSW and IVFFlat indexes, tune parameters
- Similarity search — k-NN, filtered search, range search
- Embedding generation — pluggable providers (sentence-transformers, OpenAI, Cohere, local models)
- Versioning — track embedding model versions, A/B test different embeddings
# Attach embeddings to graph nodes
from duffy.embeddings import SentenceTransformerProvider
embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")
# Embed a property of existing nodes
db.graph("legal").embed(
label="Case",
source_property="summary",
embedding_property="summary_vec",
provider=embedder,
batch_size=256
)
# Similarity search
similar = db.vector.search(
label="Case",
embedding_property="summary_vec",
query="wrongful termination in financial services",
k=20,
provider=embedder
)
# Returns DataFrame: [node_id, case_name, summary, similarity_score]
Depends on upstream: pgvector integration (#1121) at the AGE level would make hybrid queries cleaner, but the SDK can work today using SQL-bridge patterns (Cypher for graph structure, SQL for vector ops, joined at the SDK layer).
1.4 Hybrid Operations (the differentiator)¶
The thing nobody else provides: graph traversal constrained by vector similarity, and vector search scoped by graph structure, in a single API call.
# "Find cases semantically similar to this query,
# cited by cases in the same court"
results = db.hybrid.graph_then_vector(
# Graph: traverse to candidates
match="(c:Case {court: 'NYSC'})-[:CITES]->(p:Precedent)",
# Vector: rank candidates by similarity
embed_property="p.summary_vec",
query="wrongful termination in financial services",
k=10,
provider=embedder
)
# "Find the 20 most similar assets to this compromised one,
# then find everything within 3 hops of those"
results = db.hybrid.vector_then_graph(
# Vector: find similar starting points
label="Asset",
embed_property="config_vec",
query_node={"name": "compromised-server-01"},
k=20,
# Graph: expand neighborhood
expand_hops=3,
expand_rels=["CONNECTS_TO", "RUNS_ON"]
)
Implementation: Under the hood, this generates a SQL query that combines ag_catalog.cypher() with pgvector operators. The SDK handles the impedance mismatch so the user never writes that SQL.
1.5 Bulk Operations¶
Data scientists work with datasets, not individual rows. Bulk operations must be first-class:
- Ingest from DataFrame —
db.graph("g").load_nodes(df, label="Person", id_col="email") - Ingest from CSV/Parquet — direct file loading with schema inference
- Upsert semantics — MERGE-based idempotent loading (depends on MERGE ON SET #1619)
- Export — graph to CSV/Parquet/Arrow for downstream tools
- Schema migration — add properties, rename labels, evolve the graph schema over time
import pandas as pd
# DataFrame -> graph
people_df = pd.read_csv("people.csv")
edges_df = pd.read_csv("relationships.csv")
db.graph("social").load_nodes(people_df, label="Person", id_col="email")
db.graph("social").load_edges(
edges_df,
rel_type="KNOWS",
source_col="from_email", source_label="Person",
target_col="to_email", target_label="Person"
)
# Graph -> DataFrame
nodes_df, edges_df = db.graph("social").export(labels=["Person"], rel_types=["KNOWS"])
Depends on upstream: Bulk load via client connection (#432) and edge creation performance (#2198) are relevant. The SDK can use COPY protocol via psycopg3 regardless, bypassing AGE's load_from_file.
Layer 2: Tool Integration¶
Meet data scientists where they already work. These aren't wrappers — they're native integrations that understand the semantics of graph + vector data.
2.1 PyTorch Geometric¶
Implement PyG's FeatureStore and GraphStore interfaces backed by AGE + pgvector. This makes Duffy a training data source for GNNs without loading the entire graph into memory.
from duffy.integrations.pyg import DuffyGraphStore, DuffyFeatureStore
from torch_geometric.loader import NeighborLoader
graph_store = DuffyGraphStore(db, graph_name="social")
feature_store = DuffyFeatureStore(db, graph_name="social")
# Neighbor sampling happens in-database, not in Python
loader = NeighborLoader(
data=(feature_store, graph_store),
num_neighbors=[10, 5],
batch_size=128,
input_nodes="Person"
)
for batch in loader:
# batch is a standard PyG HeteroData object
out = model(batch.x_dict, batch.edge_index_dict)
No upstream dependency. This is pure SDK work. Kuzu had the best implementation of this pattern before they archived — we can learn from their design.
2.2 LangChain / LlamaIndex¶
Drop-in backends that make AGE+pgvector a combined knowledge graph and vector store:
# LlamaIndex
from duffy.integrations.llamaindex import DuffyGraphVectorStore
store = DuffyGraphVectorStore(db, graph_name="knowledge")
index = KnowledgeGraphIndex.from_documents(documents, storage_context=store)
# Hybrid retrieval: graph structure + vector similarity in one query
# LangChain
from duffy.integrations.langchain import DuffyGraphStore, DuffyVectorStore
graph = DuffyGraphStore(db)
vectors = DuffyVectorStore(db, embedding=embedder)
# Both backed by the same PostgreSQL instance
Addresses AGE issue #1783 directly. Can be contributed upstream to LlamaIndex and used within Duffy.
2.3 pandas / Arrow (built into Layer 1)¶
Not a separate integration — this IS the default output format. Every query returns a DataFrame. Every input accepts a DataFrame. Arrow is available for zero-copy handoff to DuckDB, Polars, or any Arrow-compatible tool.
2.4 NetworkX / igraph¶
Export for visualization, analysis, and interop with the existing Python graph ecosystem:
G = db.graph("social").to_networkx(labels=["Person"], rel_types=["KNOWS"])
# Standard NetworkX graph, works with all nx algorithms and matplotlib viz
ig = db.graph("social").to_igraph(labels=["Person"], rel_types=["KNOWS"])
# igraph for large-graph analysis (faster than NetworkX)
2.5 Jupyter¶
Native notebook visualization — graph rendering in cells without exporting to external tools:
# In a Jupyter cell
db.graph("social").viz(
labels=["Person", "Company"],
rel_types=["WORKS_AT", "KNOWS"],
layout="force",
color_by="label",
size_by="pagerank"
)
# Renders an interactive graph visualization inline
Implementation: Likely wraps PyGraphistry or pyvis for rendering. The value is the integration with the Duffy query layer, not the rendering engine.
Layer 3: Domain APIs¶
Thin, opinionated layers over Layers 1 and 2 that encode how an expert in each domain would model and query graph + vector data. The user doesn't need to become a graph expert — the API encodes the expertise.
3.1 Security & Risk (duffy.domains.security)¶
from duffy.domains.security import AttackGraph, ComplianceGraph
# Build attack graph from network topology + vulnerability scan
ag = AttackGraph(db, graph_name="attack_surface")
ag.ingest_topology(nmap_df) # network scan results
ag.ingest_vulnerabilities(cve_df) # vulnerability scan results
ag.compute_attack_paths(target="dc-01", max_depth=5)
# Query: what's the blast radius if this host is compromised?
blast = ag.blast_radius("web-server-03")
# Returns subgraph of reachable assets with propagation scores
# Compliance posture as a graph
cg = ComplianceGraph(db, graph_name="compliance")
cg.load_framework("NIST_CSF_2.0") # reference layer
cg.map_controls(controls_df) # instance layer
cg.assess_coverage()
# Returns DataFrame: [control_id, requirement, status, evidence_count, gaps]
3.2 Knowledge Graphs & RAG (duffy.domains.knowledge)¶
from duffy.domains.knowledge import KnowledgeGraph
kg = KnowledgeGraph(db, graph_name="docs")
# Build KG from documents with entity resolution
kg.ingest_documents(
documents,
extractor="llm", # LLM-based entity/relation extraction
embedder=embedder, # embed chunks alongside graph structure
resolve_entities=True # deduplicate entities across documents
)
# Graph-aware RAG retrieval
results = kg.retrieve(
query="What are the compliance requirements for PCI DSS in cloud environments?",
strategy="hybrid", # vector similarity + graph traversal
graph_hops=2, # expand to 2-hop neighbors of vector matches
k=10
)
3.3 Financial (duffy.domains.financial)¶
from duffy.domains.financial import DependencyGraph, CostModel
# Technology dependency graph with cost propagation
dg = DependencyGraph(db, graph_name="tech_stack")
dg.load_assets(assets_df)
dg.load_dependencies(deps_df)
dg.load_costs(costs_df) # star schema: measures stay relational
# What-if analysis: "what happens if we decommission Oracle?"
impact = dg.decommission_impact("Oracle Database", cascade=True)
# Returns: affected_assets, cost_delta, migration_paths, risk_score
3.4 Legal (duffy.domains.legal)¶
from duffy.domains.legal import CaseLawGraph
clg = CaseLawGraph(db, graph_name="case_law")
clg.ingest_cases(cases_df, embed_property="summary")
# Find persuasive precedent
precedents = clg.find_precedent(
query="wrongful termination constructive dismissal",
court="NYSC",
strategy="citation_weighted", # weight by citation graph PageRank
k=10
)
What Must Exist in AGE for This to Work¶
The SDK can work around AGE limitations at the Python layer, but certain upstream improvements make the SDK dramatically better. These are our Track B feature contributions — AGE PRs driven by what the SDK needs:
| SDK need | AGE gap | Issue | Priority |
|---|---|---|---|
| Idempotent graph loading | MERGE ON CREATE/MATCH SET | #1619 | Immediate (we own this) |
| Graph traversal filtering | Predicate functions (all/any/none/single) | #552-556 | Next sprint |
| Pattern-based filtering | Pattern matching in WHERE | #1577 | Next sprint |
| Complex ETL pipelines | CALL subquery | #1690 | Medium-term |
| Data integrity | Unique constraints | #45 | Medium-term |
| Driver simplification | agtype → JSONB cast | #350 | Quick win |
| Hybrid graph+vector queries | pgvector integration | #1121 | Strategic (our differentiator) |
| Bulk data loading | Client-side bulk load | #432 | SDK can work around |
| Real-world graph modeling | Multiple labels | #2082 | Help existing PR land |
Build Sequence¶
Phase 1: Foundation + First Proof of Value (Months 1-2)¶
Build the driver and one integration that proves the concept.
Milestone 1a — Modern Driver: - [ ] psycopg3 async connection pool - [ ] Cypher execution with DataFrame output (pd.DataFrame, pyarrow.Table) - [ ] AGType parsing (port + modernize existing parser) - [ ] Type hints, schema introspection - [ ] Basic query builder - [ ] Package structure, CI, pyproject.toml
Milestone 1b — Hybrid Query Proof of Concept: - [ ] Vector operations: embed, store, search on graph nodes - [ ] One hybrid query pattern working end-to-end (graph_then_vector) - [ ] Jupyter notebook demo: "Graph + Vector search in one PostgreSQL instance"
Milestone 1c — Parallel AGE Contributions: - [ ] MERGE ON CREATE/MATCH SET PR submitted (#1619) - [ ] Predicate functions PR submitted (#552-556) - [ ] agtype → JSONB cast PR submitted (#350)
Ship: pip install duffy with basic graph queries returning DataFrames, vector search on graph nodes, and one hybrid query pattern. Enough for a blog post and a demo.
Phase 2: Integrations (Months 3-4)¶
Make Duffy work with the tools people already use.
- [ ] PyTorch Geometric FeatureStore/GraphStore backend
- [ ] LlamaIndex GraphStore integration (contributes to issue #1783)
- [ ] LangChain vector store + graph store backends
- [ ] NetworkX/igraph export
- [ ] Bulk operations (DataFrame → graph, graph → DataFrame)
- [ ] Graph algorithm library (PageRank, community detection, centrality)
Ship: Duffy as a viable alternative to Neo4j's Python ecosystem for data science and RAG workflows.
Phase 3: Domain APIs (Months 5-8)¶
Build the opinionated layers that make specific verticals productive immediately.
- [ ]
duffy.domains.security— attack graphs, compliance graphs, vulnerability propagation - [ ]
duffy.domains.knowledge— entity resolution, KG construction, graph-aware RAG - [ ]
duffy.domains.financial— dependency graphs, cost models, what-if analysis - [ ]
duffy.domains.legal— case law graphs, precedent search, citation analysis
Ship: Domain-specific demos and documentation that show Duffy solving real problems, not just providing infrastructure.
Phase 4: Production Hardening (Months 9-12)¶
- [ ] Benchmark suite (vs Neo4j Python driver, vs raw psycopg3, vs standalone pgvector)
- [ ] Connection pool tuning, query caching, materialized view integration
- [ ] Observability (query logging, performance metrics, OpenTelemetry)
- [ ] Migration tools (Neo4j → Duffy, standalone pgvector → Duffy)
- [ ] Comprehensive documentation, tutorials, example notebooks
Competitive Position¶
| Capability | Duffy | Neo4j Python | Kuzu (archived) | Raw psycopg3 + pgvector |
|---|---|---|---|---|
| Graph queries | Cypher via AGE | Cypher (native) | Cypher (embedded) | No graph support |
| Vector search | pgvector (same DB) | Bolt-on | No | pgvector only |
| Hybrid graph+vector | First-class API | Limited | No | Manual SQL |
| DataFrame output | Default | Supported | Manual | Manual |
| PyG integration | FeatureStore/GraphStore | Via GDS (enterprise) | Had it (best in class) | No |
| LangChain/LlamaIndex | Drop-in backend | Drop-in (Neo4j only) | No | Separate vector store |
| Domain APIs | Security, Legal, Financial, KG | No | No | No |
| Operational burden | One PostgreSQL instance | Separate DB server | Embedded (no server) | One PostgreSQL instance |
| License | Apache 2.0 | GPL/Commercial | MIT (archived) | PostgreSQL + Apache 2.0 |
| Async support | Yes (psycopg3) | Yes | No | Yes |
The pitch: Neo4j's Python ecosystem is the gold standard, but it requires a separate database engine and the commercial license is expensive. Duffy provides the same developer experience on the PostgreSQL stack you already run, with the added capability of hybrid graph+vector queries that Neo4j can't match natively.
Open Design Questions¶
- Package name:
duffyon PyPI? Orage-toolkit/pg-graphfor discoverability? The name needs to signal "graph + vector + PostgreSQL" to someone scanning PyPI. - Minimum viable driver scope: How thin can Milestone 1a be while still being useful? The temptation is to build everything — resist it.
- Graph algorithm execution: Python-side vs PG function vs AGE extension? Start Python-side for speed, push to PG for performance-critical paths. But where's the line?
- Embedding provider abstraction: How tightly should we couple to specific embedding providers? Pluggable interface, but which providers are first-class?
- Domain API granularity: Each domain could be its own package (
duffy-security,duffy-legal) or part of the monorepo. Separate packages reduce install size but increase maintenance.
This document is the product vision for the Duffy SDK. It should be updated as design decisions are made and milestones are completed. See docs/research/age-feature-prioritization.md for the AGE upstream backlog that feeds Track B contributions.