Skip to content

Scalability Analysis: Consumer Graph System

Scalability Analysis: Consumer Graph System

Section titled “Scalability Analysis: Consumer Graph System”

This Neo4j-based consumer recommendation system has excellent scalability for repurchase predictions but potential bottlenecks in the recommendations query at scale. The system is optimized for read-heavy workloads with pre-computed metrics.


  • Users: 1M - 100M nodes
  • Products: 10K - 100K nodes
  • Offers: 100 - 1K active nodes
  • Total Nodes: ~1M - 100M (dominated by Users)
  • PURCHASED: 10M - 10B relationships (10-100 purchases per user avg)
  • SIMILAR_TO: 100K - 10M relationships (sparse product similarity graph)
  • ELIGIBLE: 100M - 10B relationships (many users × many offers)
  • APPLIES_TO: 1K - 100K relationships (offers × applicable products)
  • Total Relationships: ~10M - 20B+
  • Small scale (1M users): ~10GB
  • Medium scale (10M users): ~100GB
  • Large scale (100M users): ~1TB+

2.1 Repurchase Prediction Query ✅ EXCELLENT

Section titled “2.1 Repurchase Prediction Query ✅ EXCELLENT”

Query: GetRepurchasePredictions (lines 183-200)

MATCH (u:User {user_id: $user_id})-[bp:PURCHASED]->(p:Product)
WHERE bp.last >= datetime() - duration({days: $lookback_days})
AND bp.times >= $min_intervals
AND bp.avg_interval_days > 0
WITH p, bp,
bp.last + duration({days: toInteger(bp.avg_interval_days)}) AS next_expected_date
RETURN p.product_id, p.name, bp.last, bp.avg_interval_days,
next_expected_date, bp.times, bp.times - 1
ORDER BY next_expected_date ASC
LIMIT $limitN

Complexity Analysis:

  • Time Complexity: O(P) where P = products purchased by user
  • Index Usage: user_id_unique constraint (indexed lookup)
  • Traversal Pattern: Single-hop from User → PURCHASED → Product
  • Filtering: On relationship properties (pre-computed values)

Scalability Assessment:

MetricSmall (10 purchases)Medium (100 purchases)Large (1000 purchases)
Query Time<5ms<10ms<50ms
Scaling FactorO(1) effectivelyO(1) effectivelyO(1) effectively

Why It Scales:

  1. Pre-computed intervals: avg_interval_days stored on relationships
  2. User-scoped: Query starts with unique user lookup (indexed)
  3. Bounded result: Limited by user’s purchase history
  4. Simple traversal: Single relationship type, no graph expansion
  5. Minimal computation: Just date arithmetic on relationship properties

Bottlenecks: ⚠️ NONE for typical use cases

  • Even power users with 1000s of purchases will see <100ms response times
  • Scales linearly with individual user purchase history (typically bounded)

2.2 Recommendations Query ⚠️ POTENTIAL BOTTLENECK

Section titled “2.2 Recommendations Query ⚠️ POTENTIAL BOTTLENECK”

Query: GetRecommendations (lines 56-86)

MATCH (u:User {user_id:$user_id})
MATCH (u)-[bp:PURCHASED]->(p:Product)
WHERE bp.last >= datetime() - duration({days:$lookback_days})
MATCH (p)-[s:SIMILAR_TO]->(cand:Product)
MATCH (u)-[e:ELIGIBLE]->(o:Offer)-[:APPLIES_TO]->(cand)
WHERE e.start <= $as_of
AND o.start <= $as_of AND $as_of <= o.end
OPTIONAL MATCH (u)-[rp:PURCHASED]->(cand)
WITH p, cand, s, o, rp
WHERE rp IS NULL OR rp.last < datetime() - duration({days:$exclude_days})
WITH p, cand, s, o
ORDER BY p.product_id, s.score DESC
WITH p, collect({cand:cand, score:s.score, o:o})[..$per_seed_cap] AS perSeed
UNWIND perSeed AS row
WITH row.cand AS cand, row.score AS sim_score, row.o AS o
WITH cand, o, sum(sim_score) AS agg_sim
WITH cand, o, toInteger(agg_sim * (log(1 + o.points)/log(2)) * 100) AS rank
ORDER BY rank DESC, o.priority DESC, o.end ASC
RETURN cand.product_id, cand.name, o.offer_id, o.title, o.points, o.end, rank
LIMIT $limitN

Complexity Analysis:

  • Time Complexity: O(P × S × E) where:
    • P = products purchased by user (10-100 typical)
    • S = similar products per seed (capped at per_seed_cap = 80)
    • E = eligible offers per user (10-1000s potential)
  • Worst Case: O(100 × 80 × 1000) = 8M relationship traversals

Traversal Pattern:

  1. User → PURCHASED → Product (P products)
  2. Product → SIMILAR_TO → Candidate Products (P × S candidates)
  3. User → ELIGIBLE → Offer (E offers)
  4. Offer → APPLIES_TO → Candidate (filter join)
  5. User → PURCHASED → Candidate (existence check)

Scalability Assessment:

ScalePurchasesSimilar/ProductOffersTraversalsEst. Time
Small1020102K20-50ms
Medium5050100250K100-200ms
Large1008010008M500-2000ms

Bottlenecks Identified:

  1. ❌ ELIGIBLE relationship fan-out

    • Each user may have 100s-1000s of ELIGIBLE relationships
    • This creates a large cartesian product with SIMILAR_TO candidates
    • Impact: Quadratic growth with number of offers
  2. ❌ Missing index on relationship properties

    • bp.last filtered but no index (full scan of PURCHASED relationships)
    • e.start and o.start/end filtered but limited index coverage
    • Impact: Linear scan of relationships
  3. ❌ Graph expansion

    • Query expands from 1 user → P seeds → (P × S) candidates
    • Then filters against E offers
    • Impact: Potentially millions of intermediate paths
  4. ✅ Mitigations in place:

    • per_seed_cap limits expansion per product (80 max)
    • LIMIT caps final results (20 default)
    • Offers are time-bounded (active offers only)

IndexCoversUsed ByEffectiveness
user_id_uniqueUser.user_idAll queries✅ Excellent
product_id_uniqueProduct.product_idLookups✅ Excellent
offer_id_uniqueOffer.offer_idLookups✅ Excellent
offer_datesOffer.(start, end)Recommendations⚠️ Partial
product_categoryProduct.categoryNot used❌ Unused
user_zipUser.zipNot used❌ Unused
product_brandProduct.brandNot used❌ Unused
  1. Relationship property index on PURCHASED.last

    WHERE bp.last >= datetime() - duration({days:$lookback_days})

    Impact: Full scan of user’s purchase relationships

  2. Relationship property index on ELIGIBLE.start

    WHERE e.start <= $as_of

    Impact: Full scan of user’s eligibility relationships

  3. Composite index on PURCHASED(user_id, last, product_id)

    • Would accelerate both queries significantly
    • Neo4j doesn’t support relationship property indexes in Community Edition

Repurchase Endpoint:

  • Linear scaling: Each user query is independent
  • Bounded complexity: O(user’s purchases) regardless of total users
  • Expected: <50ms @ 100M users

Recommendations Endpoint:

  • ⚠️ Degradation possible: Depends on ELIGIBLE relationship growth
  • ⚠️ If offers scale with users: Query time could grow to 1-2s
  • ⚠️ If offers stay bounded (likely): 200-500ms stable

Repurchase Endpoint:

  • No impact: Product count doesn’t affect user-scoped queries

Recommendations Endpoint:

  • ⚠️ Moderate impact: More SIMILAR_TO relationships
  • ⚠️ Mitigated by per_seed_cap limiting expansion
  • ⚠️ Expected: +50-100ms per 10x product growth

Repurchase Endpoint:

  • No global impact: Only user’s own purchases matter
  • User-level bounded: Even power users <1000 purchases

Recommendations Endpoint:

  • ⚠️ Indirect impact: More historical purchase data
  • ⚠️ Filtered by lookback window: Typically only last 120 days matter
  • ⚠️ Archival strategy needed: Older purchases should be moved to cold storage

Linear Growth (Expected):

  • New users added steadily
  • New purchases added daily
  • Older data remains but is filtered by time windows

Impact on Query Performance:

TimelineUsersPurchasesRepurchase QueryRecommendations Query
Month 1100K1M5ms50ms
Month 6500K10M5ms100ms
Year 11M50M5-10ms200ms
Year 25M500M5-10ms500ms
Year 520M5B10-20ms1-2s ⚠️

Critical Threshold: ~10M users / 1B purchases

  • Recommendations query may exceed 500ms SLA
  • Requires optimization (see Section 6)

  1. Add relationship property indexes (requires Neo4j Enterprise)

    CREATE INDEX purchased_last FOR ()-[r:PURCHASED]-() ON (r.last)
    CREATE INDEX eligible_start FOR ()-[r:ELIGIBLE]-() ON (r.start)

    Expected improvement: 50-70% reduction in query time

  2. Denormalize offer eligibility

    • Pre-compute eligible offers per user
    • Store as User property: eligible_offer_ids: [id1, id2, ...]
    • Update on eligibility changes Expected improvement: 60-80% reduction for large offer sets
  3. Add read replicas

    • All queries are read-only
    • Scale horizontally with replicas
    • Route traffic via load balancer Expected improvement: 5-10x throughput capacity
  1. Implement caching layer

    • Cache recommendations per user (5-15 min TTL)
    • Redis/Memcached for hot data
    • 80-90% cache hit rate expected Expected improvement: 10-20x reduction in DB load
  2. Optimize SIMILAR_TO graph

    • Prune low-score similarities (< 0.6 threshold)
    • Limit max similarities per product to 50
    • Reduce graph fan-out Expected improvement: 30-40% reduction in traversals
  3. Partition historical data

    • Move purchases >180 days to archive table/graph
    • Keep hot data (last 6 months) in main graph
    • Archive accessible for long-term queries only Expected improvement: 40-50% reduction in relationship scans
  1. Pre-compute recommendations

    • Batch job to generate recommendations nightly
    • Store top 20 recommendations per user
    • Serve from pre-computed table Expected improvement: Near-instant queries (<5ms)
  2. Implement graph partitioning

    • Shard by user_id ranges
    • Dedicated Neo4j instances per shard
    • Requires sophisticated routing layer Expected improvement: Unbounded horizontal scaling

Query Performance:

  • P50, P95, P99 response times per endpoint
  • Alert threshold: P95 > 500ms
  • Track over time to detect degradation

Database Health:

  • Neo4j heap usage (alert >80%)
  • Page cache hit rate (alert <85%)
  • Transaction throughput
  • Connection pool saturation

Business Metrics:

  • Queries per user per day
  • Purchase history growth rate
  • Offer creation rate
  • SIMILAR_TO relationship density
alerts:
- name: "Slow Recommendations Query"
condition: "p95_latency > 500ms for 5 minutes"
action: "Scale read replicas or optimize query"
- name: "Slow Repurchase Query"
condition: "p95_latency > 100ms for 5 minutes"
action: "Investigate index issues"
- name: "Database Memory Pressure"
condition: "heap_usage > 85%"
action: "Increase heap or add replicas"
- name: "Purchase Growth Anomaly"
condition: "daily_purchases > 2x 7-day average"
action: "Check for data quality issues"

  • Repurchase prediction endpoint is exceptionally scalable
  • Pre-computed metrics eliminate runtime complexity
  • User-scoped queries naturally partition workload
  • Graph database well-suited for relationship-heavy queries
  • Recommendations query has quadratic growth potential
  • Missing critical relationship property indexes
  • ELIGIBLE relationship fan-out is primary bottleneck
  • No caching layer for frequently accessed data

Phase 1: Current State (0-1M users)

  • System works well out of the box
  • Monitor query performance
  • No immediate optimizations needed

Phase 2: Growth (1-10M users)

  • Implement caching layer (Redis)
  • Add read replicas (2-3 instances)
  • Prune similarity graph
  • Cost: ~$500-1000/month

Phase 3: Scale (10-50M users)

  • Upgrade to Neo4j Enterprise (relationship indexes)
  • Denormalize offer eligibility
  • Partition historical data
  • Add 5-10 read replicas
  • Cost: ~$5000-10000/month

Phase 4: Massive Scale (50M+ users)

  • Pre-compute recommendations (batch jobs)
  • Implement graph sharding
  • Dedicated recommendation clusters
  • Cost: ~$20000+/month

Final Verdict: PRODUCTION-READY with optimization roadmap in place

Section titled “Final Verdict: PRODUCTION-READY with optimization roadmap in place”