Neo4j Capacity Planning Report
Neo4j Capacity Planning Report
Section titled “Neo4j Capacity Planning Report”Date: 2026-02-12 Instance: EC2 (i-0f6e375c5dc124526, 10.4.19.205) Neo4j Version: 5.x (Community/Enterprise) Branch: feature/relevancy-airflow
1. Current Graph State
Section titled “1. Current Graph State”Data queried from production Neo4j on 2026-02-12.
Node Counts
Section titled “Node Counts”| Label | Count |
|---|---|
| Product | 40,211 |
| Community | 31,359 |
| Category | 18,121 |
| User | 16,367 |
| Retailer | 1,936 |
| Total Nodes | 107,994 |
Relationship Counts
Section titled “Relationship Counts”| Type | Count |
|---|---|
| PURCHASED | 95,650 |
| SHOPS_IN | 93,223 |
| MEMBER_OF | 42,602 |
| SHOPS_AT | 23,389 |
| Total Relationships | 254,864 |
Current Neo4j Memory Configuration
Section titled “Current Neo4j Memory Configuration”| Setting | Value |
|---|---|
| Heap (initial) | 4 GiB |
| Heap (max) | 8 GiB |
| Page cache | 4 GiB |
| Transaction memory limit | Unlimited (0B) |
2. Per-User Profile (Measured)
Section titled “2. Per-User Profile (Measured)”Relationship Distribution
Section titled “Relationship Distribution”| Relationship | Avg/User | Median | p95 | Max |
|---|---|---|---|---|
| PURCHASED | 5.8 | 3 | 20 | 103 |
| SHOPS_IN | 5.7 | — | 17 | — |
| MEMBER_OF | 2.6 | — | 6 | — |
| SHOPS_AT | 1.6 | — | 4 | — |
| Total | ~15.7 | — | ~47 | — |
Property Sizes (Sampled)
Section titled “Property Sizes (Sampled)”User node (4 properties):
user_id(string),timezone(string),created_at(datetime),last_updated_at(datetime)- Estimated: ~250 bytes
PURCHASED relationship (6 properties):
times(int),last(datetime),avg_interval_days(int),repurchase_likelihood(float),timestamps[](datetime array),receipt_ids[](string array)- Array growth: avg 2.1 entries for repeat purchases, max 9, p95 = 3
- Estimated: ~400 bytes per relationship
Product node (5 properties):
product_id,name,category,brand,created_at- Estimated: ~200 bytes
SHOPS_AT relationship (2 properties):
last_visit(datetime),frequency(int)- Estimated: ~120 bytes
SHOPS_IN relationship (~2 properties):
- Estimated: ~120 bytes
MEMBER_OF relationship (no properties):
- Estimated: ~60 bytes
Community node (5 properties):
community_id,primary_category,name,member_count,zip_code- Estimated: ~150 bytes
Per-User Storage Estimate
Section titled “Per-User Storage Estimate”User node ~250 bytes5.8 PURCHASED rels × 400 bytes ~2,320 bytes2.5 Product nodes (amortized, shared across users) ~500 bytes5.7 SHOPS_IN rels × 120 bytes ~684 bytes2.6 MEMBER_OF rels × 60 bytes ~156 bytes1.6 Community nodes (amortized, shared) × 150 bytes ~240 bytes1.6 SHOPS_AT rels × 120 bytes ~192 bytes────────────────────────────────────────────────────────────────────────Average per user: ~4.3 KBp95 per user (heavy purchasers): ~12 KB3. Data Source Inventory
Section titled “3. Data Source Inventory”Currently Ingested (Kafka — Real-Time)
Section titled “Currently Ingested (Kafka — Real-Time)”| Source | Topic | Graph Entities | Status |
|---|---|---|---|
| Receipt events | pipeline-v1-processed-receipt-events | User, Product, Category, Retailer + PURCHASED, SHOPS_IN, SHOPS_AT, MEMBER_OF | Active (~247 writes/10min) |
| Factoid events | factoid-stream | Same as receipts (purchase data from order/reward events) | Active (~184 writes/10min, backfilling from 2025-11-13) |
Planned (Airflow — Daily Batch from Snowflake)
Section titled “Planned (Airflow — Daily Batch from Snowflake)”| # | Source | Table | Graph Target | Est. Records | Per-User Impact |
|---|---|---|---|---|---|
| 1 | User Brand Affinity | RELEVANCY.USER_BRAND_AFFINITY_SCD | (User)-[:HAS_BRAND_AFFINITY]->(Brand) | 562M flattened | ~10 rels/user, ~1.5 KB |
| 2 | User Category Affinity | RELEVANCY.USER_CATEGORY_AFFINITY_SCD | (User)-[:HAS_CATEGORY_AFFINITY]->(Category) | 190M flattened | ~5 rels/user, ~750 bytes |
| 3 | User Reward Features | RELEVANCY.USER_OFFER_AWARD_FEATURES | User node properties + (User)-[:IS_ELIGIBLE]->(Offer) | — | ~500 bytes (node props) |
| 4-5 | Offer Redemptions / Engagement | RELEVANCY.OFFER_IMPRESSIONS, WEBSOCKET_SERVICE.WS_OFFER_BRAND_IMPRESSIONS_STAGE | TBD | — | Depends on design |
| 6 | User Retailer Affinity | RELEVANCY.USER_RETAILER_AFFINITY_SCD | Properties on retailer rels | — | ~200 bytes |
| 7 | User Receipt Scan Features | RELEVANCY.USER_RECEIPT_SCAN_FEATURES | User node properties | — | ~200 bytes |
| 8 | Offer Brand/Category Affinity | RELEVANCY.OFFER_BRAND_AFFINITY_SCD / OFFER_CATEGORY_AFFINITY_SCD | (Offer)-[:APPLIES_TO]->(Brand/Category) | — | Amortized |
| 9 | Offer Retailer Affinity | RELEVANCY.OFFER_RETAILER_AFFINITY_UNIFIED | (Offer)-[:AVAILABLE_AT]->(Retailer) | — | Amortized |
Overlap Analysis: Kafka vs. Airflow
Section titled “Overlap Analysis: Kafka vs. Airflow”No true duplication. Kafka handles real-time individual purchase events; Airflow provides pre-computed aggregate features (affinity scores, engagement metrics). They target different relationship types.
One naming inconsistency to resolve: Kafka uses SHOPS_AT for user-retailer; Airflow doc references PURCHASED_AT.
Per-User Storage After Airflow Sources (1-3)
Section titled “Per-User Storage After Airflow Sources (1-3)”Current graph data: ~4.3 KB+ Brand affinity (~10 rels × 150 bytes): ~1.5 KB+ Category affinity (~5 rels × 150 bytes): ~750 bytes+ Reward features (node properties): ~500 bytes+ Retailer affinity (rel properties): ~200 bytes+ Receipt scan features (node props): ~200 bytes─────────────────────────────────────────────────With Airflow Phase 1: ~7.5 KB avg per user ~20 KB at p954. Capacity Projections — Graph Only (No Vectors)
Section titled “4. Capacity Projections — Graph Only (No Vectors)”Current Schema
Section titled “Current Schema”| Users | Est. Store Size | Fits in 4 GB Page Cache? | Instance Needed |
|---|---|---|---|
| 16K (current) | ~70 MB | Yes | Current setup fine |
| 100K | ~430 MB | Yes | Current setup fine |
| 500K | ~2.1 GB | Yes | Current setup fine |
| 1M | ~4.3 GB | Borderline | Increase page cache to 8 GB |
| 5M | ~21 GB | No | r6g.xlarge (32 GB RAM, ~16 GB page cache) |
| 10M | ~43 GB | No | r6g.2xlarge (64 GB RAM, ~40 GB page cache) |
| 50M | ~215 GB | No | r6g.8xlarge (256 GB RAM) or AuraDB |
With Airflow Affinity Data (Phase 1)
Section titled “With Airflow Affinity Data (Phase 1)”| Users | Est. Store Size | Instance Needed |
|---|---|---|
| 100K | ~750 MB | Current setup fine |
| 500K | ~3.7 GB | Current setup (borderline) |
| 1M | ~7.5 GB | r6g.large (16 GB) — increase page cache |
| 5M | ~37.5 GB | r6g.2xlarge (64 GB RAM) |
| 10M | ~75 GB | r6g.4xlarge (128 GB RAM) |
5. Capacity Projections — With Vector Embeddings
Section titled “5. Capacity Projections — With Vector Embeddings”Vector Storage Cost
Section titled “Vector Storage Cost”A 1024-dimensional float32 vector = 1024 × 4 bytes = 4,096 bytes (~4 KB) per embedding, plus Neo4j array property overhead (~100 bytes) = ~4.2 KB per embedding.
HNSW Vector Index Overhead
Section titled “HNSW Vector Index Overhead”Neo4j 5.x vector indexes use HNSW. Index memory ≈ num_vectors × dimensions × 4 bytes × ~1.5 (for HNSW graph structure). This must fit in memory for fast similarity search.
Scenario A: Embeddings on Users + Products
Section titled “Scenario A: Embeddings on Users + Products”| Users | Products (est.) | Vector Storage | Vector Index | Graph Storage | Total | Instance Needed |
|---|---|---|---|---|---|---|
| 16K | 40K | 231 MB | 500 MB | 70 MB | ~800 MB | Current (16 GB) |
| 100K | 150K | 1 GB | 4 GB | 430 MB | ~5.4 GB | r6g.xlarge (32 GB) |
| 500K | 500K | 4.1 GB | 18 GB | 2.1 GB | ~24 GB | r6g.2xlarge (64 GB) |
| 1M | 1M | 8.2 GB | 30 GB | 4.3 GB | ~42 GB | r6g.2xlarge (64 GB, tight) |
| 5M | 2.5M | 33 GB | 90 GB | 21 GB | ~144 GB | r6g.4xlarge (128 GB, tight) or r6g.8xlarge (256 GB) |
Scenario B: Embeddings on Users + Products + Airflow Affinity Data
Section titled “Scenario B: Embeddings on Users + Products + Airflow Affinity Data”At 1M users: graph (~7.5 GB) + vectors (~8.2 GB) + vector index (~30 GB) = ~46 GB → r6g.4xlarge (128 GB).
Scenario C: Reduced Dimensions (512-dim instead of 1024)
Section titled “Scenario C: Reduced Dimensions (512-dim instead of 1024)”Halves vector storage and index overhead. At 1M users with User + Product embeddings: ~21 GB total → r6g.xlarge (32 GB) may suffice.
6. AWS Vector Storage Alternatives
Section titled “6. AWS Vector Storage Alternatives”If vectors are stored externally (not in Neo4j), graph-only sizing from Section 4 applies.
| Service | Type | Max Dims | Index | Serverless | Latency | Cost (1M vectors, 1024-dim) |
|---|---|---|---|---|---|---|
| OpenSearch Serverless | Dedicated vector DB | 16,000 | HNSW, IVF, FAISS | Yes | ~10-50ms | ~$350/mo base (4 OCUs) |
| Aurora pgvector | Extension on PostgreSQL | 2,000 | HNSW, IVFFlat | Yes (v2) | ~5-20ms | ~$200-400/mo (depends on ACU) |
| Neptune Analytics | Graph-native vectors | 65,535 | HNSW | Yes | ~10-30ms | Usage-based |
| MemoryDB (Valkey VSS) | In-memory vector search | Unlimited | HNSW, FLAT | No | <1ms | Incremental (need ~10 GB more RAM) |
| DocumentDB | Native vector support | 2,000 | HNSW | Elastic | ~5-20ms | ~$200/mo |
Recommended Architecture by Scale
Section titled “Recommended Architecture by Scale”Current → 100K users: Keep vectors in Neo4j
- Simplest. Single query combines graph traversal + vector similarity.
- Instance: r6g.xlarge (32 GB) handles graph + vectors + index.
100K → 1M users: Neo4j (graph) + Valkey VSS (vectors)
- Reuses existing Valkey infrastructure.
- Sub-millisecond vector search latency.
- Query pattern: Valkey kNN → candidate IDs → Neo4j graph enrichment.
- Requires sizing up Valkey node by ~10 GB for 1M × 1024-dim vectors.
1M+ users: Neo4j (graph) + OpenSearch Serverless (vectors)
- Scales independently. FAISS-backed, no capacity planning for vectors.
- Slightly higher latency (~10-50ms) but fully managed.
- Query pattern: OpenSearch kNN → candidate IDs → Neo4j graph enrichment.
- Best for production at scale.
Alternative: Neptune Analytics (graph + vectors unified)
- Eliminates Neo4j entirely. Graph traversal + vector similarity in one query.
- Requires migration from Neo4j (openCypher compatible but not identical).
- Worth evaluating if starting fresh or planning a major migration.
7. Instance Sizing Reference
Section titled “7. Instance Sizing Reference”| EC2 Instance | vCPU | RAM | Network | On-Demand $/hr | Reserved $/hr (1yr) |
|---|---|---|---|---|---|
| r6g.large | 2 | 16 GB | Up to 10 Gbps | $0.1008 | ~$0.063 |
| r6g.xlarge | 4 | 32 GB | Up to 10 Gbps | $0.2016 | ~$0.126 |
| r6g.2xlarge | 8 | 64 GB | Up to 10 Gbps | $0.4032 | ~$0.252 |
| r6g.4xlarge | 16 | 128 GB | Up to 10 Gbps | $0.8064 | ~$0.504 |
| r6g.8xlarge | 32 | 256 GB | 12 Gbps | $1.6128 | ~$1.008 |
Memory allocation rule of thumb:
- JVM Heap: 8-16 GB (larger heap = longer GC pauses)
- Page cache: as large as possible (should cover store size)
- Vector index: must fit in remaining memory
- OS overhead: ~2-4 GB
Disk recommendations:
- gp3: Baseline 3,000 IOPS, 125 MB/s. Sufficient for current write volume (~430 writes/10min).
- io2: Provision higher IOPS if write throughput increases significantly with Airflow bulk syncs.
8. Recommendations
Section titled “8. Recommendations”Short Term (Current → 100K users)
Section titled “Short Term (Current → 100K users)”- No changes needed. Current instance handles graph + vectors at this scale.
- Monitor store size growth as Kafka consumers backfill and Airflow sync begins.
- Run
MATCH (u:User) RETURN count(u)periodically to track user growth.
Medium Term (100K → 1M users)
Section titled “Medium Term (100K → 1M users)”- Upgrade to r6g.xlarge (32 GB RAM). Increase page cache to 16 GB.
- If adding vectors: evaluate Valkey VSS to offload vector storage from Neo4j.
- Implement Airflow Phase 1 sources (brand/category/retailer affinity, reward features).
Long Term (1M+ users)
Section titled “Long Term (1M+ users)”- Upgrade to r6g.2xlarge or r6g.4xlarge depending on vector strategy.
- Move vectors to OpenSearch Serverless or Valkey VSS.
- Consider Neptune Analytics if a full migration is acceptable.
- Evaluate AuraDB (managed Neo4j) for reduced operational overhead.
Key Metrics to Monitor
Section titled “Key Metrics to Monitor”- Neo4j store size on disk (
ls -lh /var/lib/neo4j/data/databases/neo4j/) - Page cache hit ratio (target > 98%; below = disk reads = latency spikes)
- JVM heap usage and GC pause times
- Query latency p95 (current baseline: measure and record)
- User count growth rate (to forecast when upgrades are needed)
Appendix: Queries Used for This Analysis
Section titled “Appendix: Queries Used for This Analysis”-- Node counts by labelMATCH (n) RETURN labels(n)[0] AS label, count(n) AS count ORDER BY count DESC
-- Relationship counts by typeMATCH ()-[r]->() RETURN type(r) AS type, count(r) AS count ORDER BY count DESC
-- Per-user relationship distributionMATCH (u:User)-[r:PURCHASED]->(p:Product)WITH u, count(r) AS purchasesRETURN avg(purchases), max(purchases), min(purchases), percentileCont(purchases, 0.5) AS median, percentileCont(purchases, 0.95) AS p95
-- Timestamp array growth (repeat purchases)MATCH (u:User)-[r:PURCHASED]->(p:Product) WHERE size(r.timestamps) > 1WITH u, r, size(r.timestamps) AS ts_countRETURN avg(ts_count), max(ts_count), percentileCont(ts_count, 0.95)
-- Memory configurationCALL dbms.listConfig() YIELD name, valueWHERE name IN ['server.memory.heap.max_size', 'server.memory.pagecache.size']RETURN name, value
-- Measure actual store size (run on EC2)-- ls -lh /var/lib/neo4j/data/databases/neo4j/-- Or: CALL apoc.monitor.store() YIELD totalStoreSize