Skip to content

Neo4j Capacity Planning Report

Date: 2026-02-12 Instance: EC2 (i-0f6e375c5dc124526, 10.4.19.205) Neo4j Version: 5.x (Community/Enterprise) Branch: feature/relevancy-airflow


Data queried from production Neo4j on 2026-02-12.

LabelCount
Product40,211
Community31,359
Category18,121
User16,367
Retailer1,936
Total Nodes107,994
TypeCount
PURCHASED95,650
SHOPS_IN93,223
MEMBER_OF42,602
SHOPS_AT23,389
Total Relationships254,864
SettingValue
Heap (initial)4 GiB
Heap (max)8 GiB
Page cache4 GiB
Transaction memory limitUnlimited (0B)

RelationshipAvg/UserMedianp95Max
PURCHASED5.8320103
SHOPS_IN5.717
MEMBER_OF2.66
SHOPS_AT1.64
Total~15.7~47

User node (4 properties):

  • user_id (string), timezone (string), created_at (datetime), last_updated_at (datetime)
  • Estimated: ~250 bytes

PURCHASED relationship (6 properties):

  • times (int), last (datetime), avg_interval_days (int), repurchase_likelihood (float), timestamps[] (datetime array), receipt_ids[] (string array)
  • Array growth: avg 2.1 entries for repeat purchases, max 9, p95 = 3
  • Estimated: ~400 bytes per relationship

Product node (5 properties):

  • product_id, name, category, brand, created_at
  • Estimated: ~200 bytes

SHOPS_AT relationship (2 properties):

  • last_visit (datetime), frequency (int)
  • Estimated: ~120 bytes

SHOPS_IN relationship (~2 properties):

  • Estimated: ~120 bytes

MEMBER_OF relationship (no properties):

  • Estimated: ~60 bytes

Community node (5 properties):

  • community_id, primary_category, name, member_count, zip_code
  • Estimated: ~150 bytes
User node ~250 bytes
5.8 PURCHASED rels × 400 bytes ~2,320 bytes
2.5 Product nodes (amortized, shared across users) ~500 bytes
5.7 SHOPS_IN rels × 120 bytes ~684 bytes
2.6 MEMBER_OF rels × 60 bytes ~156 bytes
1.6 Community nodes (amortized, shared) × 150 bytes ~240 bytes
1.6 SHOPS_AT rels × 120 bytes ~192 bytes
────────────────────────────────────────────────────────────────────────
Average per user: ~4.3 KB
p95 per user (heavy purchasers): ~12 KB

SourceTopicGraph EntitiesStatus
Receipt eventspipeline-v1-processed-receipt-eventsUser, Product, Category, Retailer + PURCHASED, SHOPS_IN, SHOPS_AT, MEMBER_OFActive (~247 writes/10min)
Factoid eventsfactoid-streamSame as receipts (purchase data from order/reward events)Active (~184 writes/10min, backfilling from 2025-11-13)

Planned (Airflow — Daily Batch from Snowflake)

Section titled “Planned (Airflow — Daily Batch from Snowflake)”
#SourceTableGraph TargetEst. RecordsPer-User Impact
1User Brand AffinityRELEVANCY.USER_BRAND_AFFINITY_SCD(User)-[:HAS_BRAND_AFFINITY]->(Brand)562M flattened~10 rels/user, ~1.5 KB
2User Category AffinityRELEVANCY.USER_CATEGORY_AFFINITY_SCD(User)-[:HAS_CATEGORY_AFFINITY]->(Category)190M flattened~5 rels/user, ~750 bytes
3User Reward FeaturesRELEVANCY.USER_OFFER_AWARD_FEATURESUser node properties + (User)-[:IS_ELIGIBLE]->(Offer)~500 bytes (node props)
4-5Offer Redemptions / EngagementRELEVANCY.OFFER_IMPRESSIONS, WEBSOCKET_SERVICE.WS_OFFER_BRAND_IMPRESSIONS_STAGETBDDepends on design
6User Retailer AffinityRELEVANCY.USER_RETAILER_AFFINITY_SCDProperties on retailer rels~200 bytes
7User Receipt Scan FeaturesRELEVANCY.USER_RECEIPT_SCAN_FEATURESUser node properties~200 bytes
8Offer Brand/Category AffinityRELEVANCY.OFFER_BRAND_AFFINITY_SCD / OFFER_CATEGORY_AFFINITY_SCD(Offer)-[:APPLIES_TO]->(Brand/Category)Amortized
9Offer Retailer AffinityRELEVANCY.OFFER_RETAILER_AFFINITY_UNIFIED(Offer)-[:AVAILABLE_AT]->(Retailer)Amortized

No true duplication. Kafka handles real-time individual purchase events; Airflow provides pre-computed aggregate features (affinity scores, engagement metrics). They target different relationship types.

One naming inconsistency to resolve: Kafka uses SHOPS_AT for user-retailer; Airflow doc references PURCHASED_AT.

Per-User Storage After Airflow Sources (1-3)

Section titled “Per-User Storage After Airflow Sources (1-3)”
Current graph data: ~4.3 KB
+ Brand affinity (~10 rels × 150 bytes): ~1.5 KB
+ Category affinity (~5 rels × 150 bytes): ~750 bytes
+ Reward features (node properties): ~500 bytes
+ Retailer affinity (rel properties): ~200 bytes
+ Receipt scan features (node props): ~200 bytes
─────────────────────────────────────────────────
With Airflow Phase 1: ~7.5 KB avg per user
~20 KB at p95

4. Capacity Projections — Graph Only (No Vectors)

Section titled “4. Capacity Projections — Graph Only (No Vectors)”
UsersEst. Store SizeFits in 4 GB Page Cache?Instance Needed
16K (current)~70 MBYesCurrent setup fine
100K~430 MBYesCurrent setup fine
500K~2.1 GBYesCurrent setup fine
1M~4.3 GBBorderlineIncrease page cache to 8 GB
5M~21 GBNor6g.xlarge (32 GB RAM, ~16 GB page cache)
10M~43 GBNor6g.2xlarge (64 GB RAM, ~40 GB page cache)
50M~215 GBNor6g.8xlarge (256 GB RAM) or AuraDB
UsersEst. Store SizeInstance Needed
100K~750 MBCurrent setup fine
500K~3.7 GBCurrent setup (borderline)
1M~7.5 GBr6g.large (16 GB) — increase page cache
5M~37.5 GBr6g.2xlarge (64 GB RAM)
10M~75 GBr6g.4xlarge (128 GB RAM)

5. Capacity Projections — With Vector Embeddings

Section titled “5. Capacity Projections — With Vector Embeddings”

A 1024-dimensional float32 vector = 1024 × 4 bytes = 4,096 bytes (~4 KB) per embedding, plus Neo4j array property overhead (~100 bytes) = ~4.2 KB per embedding.

Neo4j 5.x vector indexes use HNSW. Index memory ≈ num_vectors × dimensions × 4 bytes × ~1.5 (for HNSW graph structure). This must fit in memory for fast similarity search.

Scenario A: Embeddings on Users + Products

Section titled “Scenario A: Embeddings on Users + Products”
UsersProducts (est.)Vector StorageVector IndexGraph StorageTotalInstance Needed
16K40K231 MB500 MB70 MB~800 MBCurrent (16 GB)
100K150K1 GB4 GB430 MB~5.4 GBr6g.xlarge (32 GB)
500K500K4.1 GB18 GB2.1 GB~24 GBr6g.2xlarge (64 GB)
1M1M8.2 GB30 GB4.3 GB~42 GBr6g.2xlarge (64 GB, tight)
5M2.5M33 GB90 GB21 GB~144 GBr6g.4xlarge (128 GB, tight) or r6g.8xlarge (256 GB)

Scenario B: Embeddings on Users + Products + Airflow Affinity Data

Section titled “Scenario B: Embeddings on Users + Products + Airflow Affinity Data”

At 1M users: graph (~7.5 GB) + vectors (~8.2 GB) + vector index (~30 GB) = ~46 GB → r6g.4xlarge (128 GB).

Scenario C: Reduced Dimensions (512-dim instead of 1024)

Section titled “Scenario C: Reduced Dimensions (512-dim instead of 1024)”

Halves vector storage and index overhead. At 1M users with User + Product embeddings: ~21 GB total → r6g.xlarge (32 GB) may suffice.


If vectors are stored externally (not in Neo4j), graph-only sizing from Section 4 applies.

ServiceTypeMax DimsIndexServerlessLatencyCost (1M vectors, 1024-dim)
OpenSearch ServerlessDedicated vector DB16,000HNSW, IVF, FAISSYes~10-50ms~$350/mo base (4 OCUs)
Aurora pgvectorExtension on PostgreSQL2,000HNSW, IVFFlatYes (v2)~5-20ms~$200-400/mo (depends on ACU)
Neptune AnalyticsGraph-native vectors65,535HNSWYes~10-30msUsage-based
MemoryDB (Valkey VSS)In-memory vector searchUnlimitedHNSW, FLATNo<1msIncremental (need ~10 GB more RAM)
DocumentDBNative vector support2,000HNSWElastic~5-20ms~$200/mo

Current → 100K users: Keep vectors in Neo4j

  • Simplest. Single query combines graph traversal + vector similarity.
  • Instance: r6g.xlarge (32 GB) handles graph + vectors + index.

100K → 1M users: Neo4j (graph) + Valkey VSS (vectors)

  • Reuses existing Valkey infrastructure.
  • Sub-millisecond vector search latency.
  • Query pattern: Valkey kNN → candidate IDs → Neo4j graph enrichment.
  • Requires sizing up Valkey node by ~10 GB for 1M × 1024-dim vectors.

1M+ users: Neo4j (graph) + OpenSearch Serverless (vectors)

  • Scales independently. FAISS-backed, no capacity planning for vectors.
  • Slightly higher latency (~10-50ms) but fully managed.
  • Query pattern: OpenSearch kNN → candidate IDs → Neo4j graph enrichment.
  • Best for production at scale.

Alternative: Neptune Analytics (graph + vectors unified)

  • Eliminates Neo4j entirely. Graph traversal + vector similarity in one query.
  • Requires migration from Neo4j (openCypher compatible but not identical).
  • Worth evaluating if starting fresh or planning a major migration.

EC2 InstancevCPURAMNetworkOn-Demand $/hrReserved $/hr (1yr)
r6g.large216 GBUp to 10 Gbps$0.1008~$0.063
r6g.xlarge432 GBUp to 10 Gbps$0.2016~$0.126
r6g.2xlarge864 GBUp to 10 Gbps$0.4032~$0.252
r6g.4xlarge16128 GBUp to 10 Gbps$0.8064~$0.504
r6g.8xlarge32256 GB12 Gbps$1.6128~$1.008

Memory allocation rule of thumb:

  • JVM Heap: 8-16 GB (larger heap = longer GC pauses)
  • Page cache: as large as possible (should cover store size)
  • Vector index: must fit in remaining memory
  • OS overhead: ~2-4 GB

Disk recommendations:

  • gp3: Baseline 3,000 IOPS, 125 MB/s. Sufficient for current write volume (~430 writes/10min).
  • io2: Provision higher IOPS if write throughput increases significantly with Airflow bulk syncs.

  • No changes needed. Current instance handles graph + vectors at this scale.
  • Monitor store size growth as Kafka consumers backfill and Airflow sync begins.
  • Run MATCH (u:User) RETURN count(u) periodically to track user growth.
  • Upgrade to r6g.xlarge (32 GB RAM). Increase page cache to 16 GB.
  • If adding vectors: evaluate Valkey VSS to offload vector storage from Neo4j.
  • Implement Airflow Phase 1 sources (brand/category/retailer affinity, reward features).
  • Upgrade to r6g.2xlarge or r6g.4xlarge depending on vector strategy.
  • Move vectors to OpenSearch Serverless or Valkey VSS.
  • Consider Neptune Analytics if a full migration is acceptable.
  • Evaluate AuraDB (managed Neo4j) for reduced operational overhead.
  • Neo4j store size on disk (ls -lh /var/lib/neo4j/data/databases/neo4j/)
  • Page cache hit ratio (target > 98%; below = disk reads = latency spikes)
  • JVM heap usage and GC pause times
  • Query latency p95 (current baseline: measure and record)
  • User count growth rate (to forecast when upgrades are needed)

-- Node counts by label
MATCH (n) RETURN labels(n)[0] AS label, count(n) AS count ORDER BY count DESC
-- Relationship counts by type
MATCH ()-[r]->() RETURN type(r) AS type, count(r) AS count ORDER BY count DESC
-- Per-user relationship distribution
MATCH (u:User)-[r:PURCHASED]->(p:Product)
WITH u, count(r) AS purchases
RETURN avg(purchases), max(purchases), min(purchases),
percentileCont(purchases, 0.5) AS median,
percentileCont(purchases, 0.95) AS p95
-- Timestamp array growth (repeat purchases)
MATCH (u:User)-[r:PURCHASED]->(p:Product) WHERE size(r.timestamps) > 1
WITH u, r, size(r.timestamps) AS ts_count
RETURN avg(ts_count), max(ts_count), percentileCont(ts_count, 0.95)
-- Memory configuration
CALL dbms.listConfig() YIELD name, value
WHERE name IN ['server.memory.heap.max_size', 'server.memory.pagecache.size']
RETURN name, value
-- Measure actual store size (run on EC2)
-- ls -lh /var/lib/neo4j/data/databases/neo4j/
-- Or: CALL apoc.monitor.store() YIELD totalStoreSize