Neo4j Capacity Planning Report

Date: 2026-02-12 Instance: EC2 (i-0f6e375c5dc124526, 10.4.19.205) Neo4j Version: 5.x (Community/Enterprise) Branch: feature/relevancy-airflow

1. Current Graph State

Data queried from production Neo4j on 2026-02-12.

Node Counts

Label	Count
Product	40,211
Community	31,359
Category	18,121
User	16,367
Retailer	1,936
Total Nodes	107,994

Relationship Counts

Type	Count
PURCHASED	95,650
SHOPS_IN	93,223
MEMBER_OF	42,602
SHOPS_AT	23,389
Total Relationships	254,864

Current Neo4j Memory Configuration

Setting	Value
Heap (initial)	4 GiB
Heap (max)	8 GiB
Page cache	4 GiB
Transaction memory limit	Unlimited (0B)

2. Per-User Profile (Measured)

Relationship Distribution

Relationship	Avg/User	Median	p95	Max
PURCHASED	5.8	3	20	103
SHOPS_IN	5.7	—	17	—
MEMBER_OF	2.6	—	6	—
SHOPS_AT	1.6	—	4	—
Total	~15.7	—	~47	—

Property Sizes (Sampled)

User node (4 properties):

user_id (string), timezone (string), created_at (datetime), last_updated_at (datetime)
Estimated: ~250 bytes

PURCHASED relationship (6 properties):

times (int), last (datetime), avg_interval_days (int), repurchase_likelihood (float), timestamps[] (datetime array), receipt_ids[] (string array)
Array growth: avg 2.1 entries for repeat purchases, max 9, p95 = 3
Estimated: ~400 bytes per relationship

Product node (5 properties):

product_id, name, category, brand, created_at
Estimated: ~200 bytes

SHOPS_AT relationship (2 properties):

last_visit (datetime), frequency (int)
Estimated: ~120 bytes

SHOPS_IN relationship (~2 properties):

Estimated: ~120 bytes

MEMBER_OF relationship (no properties):

Estimated: ~60 bytes

Community node (5 properties):

community_id, primary_category, name, member_count, zip_code
Estimated: ~150 bytes

Per-User Storage Estimate

User node                                                    ~250 bytes
5.8 PURCHASED rels × 400 bytes                             ~2,320 bytes
2.5 Product nodes (amortized, shared across users)           ~500 bytes
5.7 SHOPS_IN rels × 120 bytes                               ~684 bytes
2.6 MEMBER_OF rels × 60 bytes                               ~156 bytes
1.6 Community nodes (amortized, shared) × 150 bytes          ~240 bytes
1.6 SHOPS_AT rels × 120 bytes                               ~192 bytes
────────────────────────────────────────────────────────────────────────
Average per user:                                           ~4.3 KB
p95 per user (heavy purchasers):                            ~12 KB

3. Data Source Inventory

Currently Ingested (Kafka — Real-Time)

Source	Topic	Graph Entities	Status
Receipt events	`pipeline-v1-processed-receipt-events`	User, Product, Category, Retailer + PURCHASED, SHOPS_IN, SHOPS_AT, MEMBER_OF	Active (~247 writes/10min)
Factoid events	`factoid-stream`	Same as receipts (purchase data from order/reward events)	Active (~184 writes/10min, backfilling from 2025-11-13)

Planned (Airflow — Daily Batch from Snowflake)

#	Source	Table	Graph Target	Est. Records	Per-User Impact
1	User Brand Affinity	`RELEVANCY.USER_BRAND_AFFINITY_SCD`	`(User)-[:HAS_BRAND_AFFINITY]->(Brand)`	562M flattened	~10 rels/user, ~1.5 KB
2	User Category Affinity	`RELEVANCY.USER_CATEGORY_AFFINITY_SCD`	`(User)-[:HAS_CATEGORY_AFFINITY]->(Category)`	190M flattened	~5 rels/user, ~750 bytes
3	User Reward Features	`RELEVANCY.USER_OFFER_AWARD_FEATURES`	User node properties + `(User)-[:IS_ELIGIBLE]->(Offer)`	—	~500 bytes (node props)
4-5	Offer Redemptions / Engagement	`RELEVANCY.OFFER_IMPRESSIONS`, `WEBSOCKET_SERVICE.WS_OFFER_BRAND_IMPRESSIONS_STAGE`	TBD	—	Depends on design
6	User Retailer Affinity	`RELEVANCY.USER_RETAILER_AFFINITY_SCD`	Properties on retailer rels	—	~200 bytes
7	User Receipt Scan Features	`RELEVANCY.USER_RECEIPT_SCAN_FEATURES`	User node properties	—	~200 bytes
8	Offer Brand/Category Affinity	`RELEVANCY.OFFER_BRAND_AFFINITY_SCD` / `OFFER_CATEGORY_AFFINITY_SCD`	`(Offer)-[:APPLIES_TO]->(Brand/Category)`	—	Amortized
9	Offer Retailer Affinity	`RELEVANCY.OFFER_RETAILER_AFFINITY_UNIFIED`	`(Offer)-[:AVAILABLE_AT]->(Retailer)`	—	Amortized

Overlap Analysis: Kafka vs. Airflow

No true duplication. Kafka handles real-time individual purchase events; Airflow provides pre-computed aggregate features (affinity scores, engagement metrics). They target different relationship types.

One naming inconsistency to resolve: Kafka uses SHOPS_AT for user-retailer; Airflow doc references PURCHASED_AT.

Per-User Storage After Airflow Sources (1-3)

Current graph data:                      ~4.3 KB
+ Brand affinity (~10 rels × 150 bytes): ~1.5 KB
+ Category affinity (~5 rels × 150 bytes): ~750 bytes
+ Reward features (node properties):      ~500 bytes
+ Retailer affinity (rel properties):     ~200 bytes
+ Receipt scan features (node props):     ~200 bytes
─────────────────────────────────────────────────
With Airflow Phase 1:                    ~7.5 KB avg per user
                                         ~20 KB at p95

4. Capacity Projections — Graph Only (No Vectors)

Current Schema

Users	Est. Store Size	Fits in 4 GB Page Cache?	Instance Needed
16K (current)	~70 MB	Yes	Current setup fine
100K	~430 MB	Yes	Current setup fine
500K	~2.1 GB	Yes	Current setup fine
1M	~4.3 GB	Borderline	Increase page cache to 8 GB
5M	~21 GB	No	r6g.xlarge (32 GB RAM, ~16 GB page cache)
10M	~43 GB	No	r6g.2xlarge (64 GB RAM, ~40 GB page cache)
50M	~215 GB	No	r6g.8xlarge (256 GB RAM) or AuraDB

With Airflow Affinity Data (Phase 1)

Users	Est. Store Size	Instance Needed
100K	~750 MB	Current setup fine
500K	~3.7 GB	Current setup (borderline)
1M	~7.5 GB	r6g.large (16 GB) — increase page cache
5M	~37.5 GB	r6g.2xlarge (64 GB RAM)
10M	~75 GB	r6g.4xlarge (128 GB RAM)

5. Capacity Projections — With Vector Embeddings

Vector Storage Cost

A 1024-dimensional float32 vector = 1024 × 4 bytes = 4,096 bytes (~4 KB) per embedding, plus Neo4j array property overhead (~100 bytes) = ~4.2 KB per embedding.

HNSW Vector Index Overhead

Neo4j 5.x vector indexes use HNSW. Index memory ≈ num_vectors × dimensions × 4 bytes × ~1.5 (for HNSW graph structure). This must fit in memory for fast similarity search.

Scenario A: Embeddings on Users + Products

Users	Products (est.)	Vector Storage	Vector Index	Graph Storage	Total	Instance Needed
16K	40K	231 MB	500 MB	70 MB	~800 MB	Current (16 GB)
100K	150K	1 GB	4 GB	430 MB	~5.4 GB	r6g.xlarge (32 GB)
500K	500K	4.1 GB	18 GB	2.1 GB	~24 GB	r6g.2xlarge (64 GB)
1M	1M	8.2 GB	30 GB	4.3 GB	~42 GB	r6g.2xlarge (64 GB, tight)
5M	2.5M	33 GB	90 GB	21 GB	~144 GB	r6g.4xlarge (128 GB, tight) or r6g.8xlarge (256 GB)

Scenario B: Embeddings on Users + Products + Airflow Affinity Data

At 1M users: graph (~7.5 GB) + vectors (~8.2 GB) + vector index (~30 GB) = ~46 GB → r6g.4xlarge (128 GB).

Scenario C: Reduced Dimensions (512-dim instead of 1024)

Halves vector storage and index overhead. At 1M users with User + Product embeddings: ~21 GB total → r6g.xlarge (32 GB) may suffice.

6. AWS Vector Storage Alternatives

If vectors are stored externally (not in Neo4j), graph-only sizing from Section 4 applies.

Service	Type	Max Dims	Index	Serverless	Latency	Cost (1M vectors, 1024-dim)
OpenSearch Serverless	Dedicated vector DB	16,000	HNSW, IVF, FAISS	Yes	~10-50ms	~$350/mo base (4 OCUs)
Aurora pgvector	Extension on PostgreSQL	2,000	HNSW, IVFFlat	Yes (v2)	~5-20ms	~$200-400/mo (depends on ACU)
Neptune Analytics	Graph-native vectors	65,535	HNSW	Yes	~10-30ms	Usage-based
MemoryDB (Valkey VSS)	In-memory vector search	Unlimited	HNSW, FLAT	No	`<1ms`	Incremental (need ~10 GB more RAM)
DocumentDB	Native vector support	2,000	HNSW	Elastic	~5-20ms	~$200/mo

Recommended Architecture by Scale

Current → 100K users: Keep vectors in Neo4j

Simplest. Single query combines graph traversal + vector similarity.
Instance: r6g.xlarge (32 GB) handles graph + vectors + index.

100K → 1M users: Neo4j (graph) + Valkey VSS (vectors)

Reuses existing Valkey infrastructure.
Sub-millisecond vector search latency.
Query pattern: Valkey kNN → candidate IDs → Neo4j graph enrichment.
Requires sizing up Valkey node by ~10 GB for 1M × 1024-dim vectors.

1M+ users: Neo4j (graph) + OpenSearch Serverless (vectors)

Scales independently. FAISS-backed, no capacity planning for vectors.
Slightly higher latency (~10-50ms) but fully managed.
Query pattern: OpenSearch kNN → candidate IDs → Neo4j graph enrichment.
Best for production at scale.

Alternative: Neptune Analytics (graph + vectors unified)

Eliminates Neo4j entirely. Graph traversal + vector similarity in one query.
Requires migration from Neo4j (openCypher compatible but not identical).
Worth evaluating if starting fresh or planning a major migration.

7. Instance Sizing Reference

EC2 Instance	vCPU	RAM	Network	On-Demand $/hr	Reserved $/hr (1yr)
r6g.large	2	16 GB	Up to 10 Gbps	$0.1008	~$0.063
r6g.xlarge	4	32 GB	Up to 10 Gbps	$0.2016	~$0.126
r6g.2xlarge	8	64 GB	Up to 10 Gbps	$0.4032	~$0.252
r6g.4xlarge	16	128 GB	Up to 10 Gbps	$0.8064	~$0.504
r6g.8xlarge	32	256 GB	12 Gbps	$1.6128	~$1.008

Memory allocation rule of thumb:

JVM Heap: 8-16 GB (larger heap = longer GC pauses)
Page cache: as large as possible (should cover store size)
Vector index: must fit in remaining memory
OS overhead: ~2-4 GB

Disk recommendations:

gp3: Baseline 3,000 IOPS, 125 MB/s. Sufficient for current write volume (~430 writes/10min).
io2: Provision higher IOPS if write throughput increases significantly with Airflow bulk syncs.

8. Recommendations

Short Term (Current → 100K users)

No changes needed. Current instance handles graph + vectors at this scale.
Monitor store size growth as Kafka consumers backfill and Airflow sync begins.
Run MATCH (u:User) RETURN count(u) periodically to track user growth.

Medium Term (100K → 1M users)

Upgrade to r6g.xlarge (32 GB RAM). Increase page cache to 16 GB.
If adding vectors: evaluate Valkey VSS to offload vector storage from Neo4j.
Implement Airflow Phase 1 sources (brand/category/retailer affinity, reward features).

Long Term (1M+ users)

Upgrade to r6g.2xlarge or r6g.4xlarge depending on vector strategy.
Move vectors to OpenSearch Serverless or Valkey VSS.
Consider Neptune Analytics if a full migration is acceptable.
Evaluate AuraDB (managed Neo4j) for reduced operational overhead.

Key Metrics to Monitor

Neo4j store size on disk (ls -lh /var/lib/neo4j/data/databases/neo4j/)
Page cache hit ratio (target > 98%; below = disk reads = latency spikes)
JVM heap usage and GC pause times
Query latency p95 (current baseline: measure and record)
User count growth rate (to forecast when upgrades are needed)

Appendix: Queries Used for This Analysis

-- Node counts by label
MATCH (n) RETURN labels(n)[0] AS label, count(n) AS count ORDER BY count DESC

-- Relationship counts by type
MATCH ()-[r]->() RETURN type(r) AS type, count(r) AS count ORDER BY count DESC

-- Per-user relationship distribution
MATCH (u:User)-[r:PURCHASED]->(p:Product)
WITH u, count(r) AS purchases
RETURN avg(purchases), max(purchases), min(purchases),
       percentileCont(purchases, 0.5) AS median,
       percentileCont(purchases, 0.95) AS p95

-- Timestamp array growth (repeat purchases)
MATCH (u:User)-[r:PURCHASED]->(p:Product) WHERE size(r.timestamps) > 1
WITH u, r, size(r.timestamps) AS ts_count
RETURN avg(ts_count), max(ts_count), percentileCont(ts_count, 0.95)

-- Memory configuration
CALL dbms.listConfig() YIELD name, value
WHERE name IN ['server.memory.heap.max_size', 'server.memory.pagecache.size']
RETURN name, value

-- Measure actual store size (run on EC2)
-- ls -lh /var/lib/neo4j/data/databases/neo4j/
-- Or: CALL apoc.monitor.store() YIELD totalStoreSize