Scalability Analysis: Consumer Graph System
Scalability Analysis: Consumer Graph System
Section titled “Scalability Analysis: Consumer Graph System”Executive Summary
Section titled “Executive Summary”This Neo4j-based consumer recommendation system has excellent scalability for repurchase predictions but potential bottlenecks in the recommendations query at scale. The system is optimized for read-heavy workloads with pre-computed metrics.
1. Data Model Scale Analysis
Section titled “1. Data Model Scale Analysis”Node Counts (Projected)
Section titled “Node Counts (Projected)”- Users: 1M - 100M nodes
- Products: 10K - 100K nodes
- Offers: 100 - 1K active nodes
- Total Nodes: ~1M - 100M (dominated by Users)
Relationship Counts (Projected)
Section titled “Relationship Counts (Projected)”- PURCHASED: 10M - 10B relationships (10-100 purchases per user avg)
- SIMILAR_TO: 100K - 10M relationships (sparse product similarity graph)
- ELIGIBLE: 100M - 10B relationships (many users × many offers)
- APPLIES_TO: 1K - 100K relationships (offers × applicable products)
- Total Relationships: ~10M - 20B+
Storage Estimates
Section titled “Storage Estimates”- Small scale (1M users): ~10GB
- Medium scale (10M users): ~100GB
- Large scale (100M users): ~1TB+
2. Query Performance Analysis
Section titled “2. Query Performance Analysis”2.1 Repurchase Prediction Query ✅ EXCELLENT
Section titled “2.1 Repurchase Prediction Query ✅ EXCELLENT”Query: GetRepurchasePredictions (lines 183-200)
MATCH (u:User {user_id: $user_id})-[bp:PURCHASED]->(p:Product)WHERE bp.last >= datetime() - duration({days: $lookback_days}) AND bp.times >= $min_intervals AND bp.avg_interval_days > 0WITH p, bp, bp.last + duration({days: toInteger(bp.avg_interval_days)}) AS next_expected_dateRETURN p.product_id, p.name, bp.last, bp.avg_interval_days, next_expected_date, bp.times, bp.times - 1ORDER BY next_expected_date ASCLIMIT $limitNComplexity Analysis:
- Time Complexity: O(P) where P = products purchased by user
- Index Usage:
user_id_uniqueconstraint (indexed lookup) - Traversal Pattern: Single-hop from User → PURCHASED → Product
- Filtering: On relationship properties (pre-computed values)
Scalability Assessment:
| Metric | Small (10 purchases) | Medium (100 purchases) | Large (1000 purchases) |
|---|---|---|---|
| Query Time | <5ms | <10ms | <50ms |
| Scaling Factor | O(1) effectively | O(1) effectively | O(1) effectively |
Why It Scales:
- ✅ Pre-computed intervals:
avg_interval_daysstored on relationships - ✅ User-scoped: Query starts with unique user lookup (indexed)
- ✅ Bounded result: Limited by user’s purchase history
- ✅ Simple traversal: Single relationship type, no graph expansion
- ✅ Minimal computation: Just date arithmetic on relationship properties
Bottlenecks: ⚠️ NONE for typical use cases
- Even power users with 1000s of purchases will see
<100msresponse times - Scales linearly with individual user purchase history (typically bounded)
2.2 Recommendations Query ⚠️ POTENTIAL BOTTLENECK
Section titled “2.2 Recommendations Query ⚠️ POTENTIAL BOTTLENECK”Query: GetRecommendations (lines 56-86)
MATCH (u:User {user_id:$user_id})MATCH (u)-[bp:PURCHASED]->(p:Product)WHERE bp.last >= datetime() - duration({days:$lookback_days})
MATCH (p)-[s:SIMILAR_TO]->(cand:Product)MATCH (u)-[e:ELIGIBLE]->(o:Offer)-[:APPLIES_TO]->(cand)WHERE e.start <= $as_of AND o.start <= $as_of AND $as_of <= o.end
OPTIONAL MATCH (u)-[rp:PURCHASED]->(cand)WITH p, cand, s, o, rpWHERE rp IS NULL OR rp.last < datetime() - duration({days:$exclude_days})
WITH p, cand, s, oORDER BY p.product_id, s.score DESCWITH p, collect({cand:cand, score:s.score, o:o})[..$per_seed_cap] AS perSeedUNWIND perSeed AS rowWITH row.cand AS cand, row.score AS sim_score, row.o AS o
WITH cand, o, sum(sim_score) AS agg_simWITH cand, o, toInteger(agg_sim * (log(1 + o.points)/log(2)) * 100) AS rankORDER BY rank DESC, o.priority DESC, o.end ASCRETURN cand.product_id, cand.name, o.offer_id, o.title, o.points, o.end, rankLIMIT $limitNComplexity Analysis:
- Time Complexity: O(P × S × E) where:
- P = products purchased by user (10-100 typical)
- S = similar products per seed (capped at
per_seed_cap= 80) - E = eligible offers per user (10-1000s potential)
- Worst Case: O(100 × 80 × 1000) = 8M relationship traversals
Traversal Pattern:
- User → PURCHASED → Product (P products)
- Product → SIMILAR_TO → Candidate Products (P × S candidates)
- User → ELIGIBLE → Offer (E offers)
- Offer → APPLIES_TO → Candidate (filter join)
- User → PURCHASED → Candidate (existence check)
Scalability Assessment:
| Scale | Purchases | Similar/Product | Offers | Traversals | Est. Time |
|---|---|---|---|---|---|
| Small | 10 | 20 | 10 | 2K | 20-50ms |
| Medium | 50 | 50 | 100 | 250K | 100-200ms |
| Large | 100 | 80 | 1000 | 8M | 500-2000ms |
Bottlenecks Identified:
-
❌ ELIGIBLE relationship fan-out
- Each user may have 100s-1000s of ELIGIBLE relationships
- This creates a large cartesian product with SIMILAR_TO candidates
- Impact: Quadratic growth with number of offers
-
❌ Missing index on relationship properties
bp.lastfiltered but no index (full scan of PURCHASED relationships)e.startando.start/endfiltered but limited index coverage- Impact: Linear scan of relationships
-
❌ Graph expansion
- Query expands from 1 user → P seeds → (P × S) candidates
- Then filters against E offers
- Impact: Potentially millions of intermediate paths
-
✅ Mitigations in place:
per_seed_caplimits expansion per product (80 max)LIMITcaps final results (20 default)- Offers are time-bounded (active offers only)
3. Index Coverage Analysis
Section titled “3. Index Coverage Analysis”Current Indexes (schema.go:51-71)
Section titled “Current Indexes (schema.go:51-71)”| Index | Covers | Used By | Effectiveness |
|---|---|---|---|
user_id_unique | User.user_id | All queries | ✅ Excellent |
product_id_unique | Product.product_id | Lookups | ✅ Excellent |
offer_id_unique | Offer.offer_id | Lookups | ✅ Excellent |
offer_dates | Offer.(start, end) | Recommendations | ⚠️ Partial |
product_category | Product.category | Not used | ❌ Unused |
user_zip | User.zip | Not used | ❌ Unused |
product_brand | Product.brand | Not used | ❌ Unused |
Missing Critical Indexes ❌
Section titled “Missing Critical Indexes ❌”-
Relationship property index on PURCHASED.last
WHERE bp.last >= datetime() - duration({days:$lookback_days})Impact: Full scan of user’s purchase relationships
-
Relationship property index on ELIGIBLE.start
WHERE e.start <= $as_ofImpact: Full scan of user’s eligibility relationships
-
Composite index on PURCHASED(user_id, last, product_id)
- Would accelerate both queries significantly
- Neo4j doesn’t support relationship property indexes in Community Edition
4. Scale Projections
Section titled “4. Scale Projections”4.1 Users: 1M → 10M → 100M
Section titled “4.1 Users: 1M → 10M → 100M”Repurchase Endpoint:
- ✅ Linear scaling: Each user query is independent
- ✅ Bounded complexity: O(user’s purchases) regardless of total users
- ✅ Expected:
<50ms@ 100M users
Recommendations Endpoint:
- ⚠️ Degradation possible: Depends on ELIGIBLE relationship growth
- ⚠️ If offers scale with users: Query time could grow to 1-2s
- ⚠️ If offers stay bounded (likely): 200-500ms stable
4.2 Products: 10K → 100K → 1M
Section titled “4.2 Products: 10K → 100K → 1M”Repurchase Endpoint:
- ✅ No impact: Product count doesn’t affect user-scoped queries
Recommendations Endpoint:
- ⚠️ Moderate impact: More SIMILAR_TO relationships
- ⚠️ Mitigated by
per_seed_caplimiting expansion - ⚠️ Expected: +50-100ms per 10x product growth
4.3 Purchases: 10M → 1B → 10B
Section titled “4.3 Purchases: 10M → 1B → 10B”Repurchase Endpoint:
- ✅ No global impact: Only user’s own purchases matter
- ✅ User-level bounded: Even power users
<1000purchases
Recommendations Endpoint:
- ⚠️ Indirect impact: More historical purchase data
- ⚠️ Filtered by lookback window: Typically only last 120 days matter
- ⚠️ Archival strategy needed: Older purchases should be moved to cold storage
5. Performance Over Time
Section titled “5. Performance Over Time”Data Growth Patterns
Section titled “Data Growth Patterns”Linear Growth (Expected):
- New users added steadily
- New purchases added daily
- Older data remains but is filtered by time windows
Impact on Query Performance:
| Timeline | Users | Purchases | Repurchase Query | Recommendations Query |
|---|---|---|---|---|
| Month 1 | 100K | 1M | 5ms | 50ms |
| Month 6 | 500K | 10M | 5ms | 100ms |
| Year 1 | 1M | 50M | 5-10ms | 200ms |
| Year 2 | 5M | 500M | 5-10ms | 500ms |
| Year 5 | 20M | 5B | 10-20ms | 1-2s ⚠️ |
Critical Threshold: ~10M users / 1B purchases
- Recommendations query may exceed 500ms SLA
- Requires optimization (see Section 6)
6. Optimization Recommendations
Section titled “6. Optimization Recommendations”🔴 High Priority (Do Before Scale)
Section titled “🔴 High Priority (Do Before Scale)”-
Add relationship property indexes (requires Neo4j Enterprise)
CREATE INDEX purchased_last FOR ()-[r:PURCHASED]-() ON (r.last)CREATE INDEX eligible_start FOR ()-[r:ELIGIBLE]-() ON (r.start)Expected improvement: 50-70% reduction in query time
-
Denormalize offer eligibility
- Pre-compute eligible offers per user
- Store as User property:
eligible_offer_ids: [id1, id2, ...] - Update on eligibility changes Expected improvement: 60-80% reduction for large offer sets
-
Add read replicas
- All queries are read-only
- Scale horizontally with replicas
- Route traffic via load balancer Expected improvement: 5-10x throughput capacity
🟡 Medium Priority (Before Year 2)
Section titled “🟡 Medium Priority (Before Year 2)”-
Implement caching layer
- Cache recommendations per user (5-15 min TTL)
- Redis/Memcached for hot data
- 80-90% cache hit rate expected Expected improvement: 10-20x reduction in DB load
-
Optimize SIMILAR_TO graph
- Prune low-score similarities (< 0.6 threshold)
- Limit max similarities per product to 50
- Reduce graph fan-out Expected improvement: 30-40% reduction in traversals
-
Partition historical data
- Move purchases >180 days to archive table/graph
- Keep hot data (last 6 months) in main graph
- Archive accessible for long-term queries only Expected improvement: 40-50% reduction in relationship scans
🟢 Low Priority (Nice to Have)
Section titled “🟢 Low Priority (Nice to Have)”-
Pre-compute recommendations
- Batch job to generate recommendations nightly
- Store top 20 recommendations per user
- Serve from pre-computed table
Expected improvement: Near-instant queries (
<5ms)
-
Implement graph partitioning
- Shard by user_id ranges
- Dedicated Neo4j instances per shard
- Requires sophisticated routing layer Expected improvement: Unbounded horizontal scaling
7. Monitoring Recommendations
Section titled “7. Monitoring Recommendations”Key Metrics to Track
Section titled “Key Metrics to Track”Query Performance:
- P50, P95, P99 response times per endpoint
- Alert threshold: P95 > 500ms
- Track over time to detect degradation
Database Health:
- Neo4j heap usage (alert >80%)
- Page cache hit rate (alert
<85%) - Transaction throughput
- Connection pool saturation
Business Metrics:
- Queries per user per day
- Purchase history growth rate
- Offer creation rate
- SIMILAR_TO relationship density
Recommended Alerts
Section titled “Recommended Alerts”alerts: - name: "Slow Recommendations Query" condition: "p95_latency > 500ms for 5 minutes" action: "Scale read replicas or optimize query"
- name: "Slow Repurchase Query" condition: "p95_latency > 100ms for 5 minutes" action: "Investigate index issues"
- name: "Database Memory Pressure" condition: "heap_usage > 85%" action: "Increase heap or add replicas"
- name: "Purchase Growth Anomaly" condition: "daily_purchases > 2x 7-day average" action: "Check for data quality issues"8. Conclusion
Section titled “8. Conclusion”✅ Strengths
Section titled “✅ Strengths”- Repurchase prediction endpoint is exceptionally scalable
- Pre-computed metrics eliminate runtime complexity
- User-scoped queries naturally partition workload
- Graph database well-suited for relationship-heavy queries
⚠️ Concerns
Section titled “⚠️ Concerns”- Recommendations query has quadratic growth potential
- Missing critical relationship property indexes
- ELIGIBLE relationship fan-out is primary bottleneck
- No caching layer for frequently accessed data
🎯 Recommended Scale Plan
Section titled “🎯 Recommended Scale Plan”Phase 1: Current State (0-1M users)
- System works well out of the box
- Monitor query performance
- No immediate optimizations needed
Phase 2: Growth (1-10M users)
- Implement caching layer (Redis)
- Add read replicas (2-3 instances)
- Prune similarity graph
- Cost: ~$500-1000/month
Phase 3: Scale (10-50M users)
- Upgrade to Neo4j Enterprise (relationship indexes)
- Denormalize offer eligibility
- Partition historical data
- Add 5-10 read replicas
- Cost: ~$5000-10000/month
Phase 4: Massive Scale (50M+ users)
- Pre-compute recommendations (batch jobs)
- Implement graph sharding
- Dedicated recommendation clusters
- Cost: ~$20000+/month