Cost Estimation Breakdown & Validation Guide

Transparency Statement

The cost estimates in KNOWLEDGE_GRAPH_GAPS_AND_OPPORTUNITIES.md are ballpark estimates based on:

Industry-standard engineering labor rates
Publicly available API pricing
Typical project complexity multipliers
Common infrastructure costs

These are NOT quotes - they’re planning estimates to help with budgeting. Actual costs will vary based on your specific situation.

Engineering Labor Costs

Where the Numbers Came From

Phase 1: $40-60K (2 engineers, 8 weeks)

2 senior engineers × 8 weeks = 16 engineer-weeks
At $2,500-3,750/week per engineer = $40-60K

Phase 2: $50-80K (2 engineers, 8 weeks)

2 senior engineers × 8 weeks = 16 engineer-weeks
Plus LLM API experimentation costs (~$5-10K)
Total: $50-80K

Phase 3: $60-100K (3 engineers, 8 weeks)

3 senior engineers × 8 weeks = 24 engineer-weeks
Plus infrastructure setup/testing
Total: $60-100K

Assumptions Made

Engineer Level: Senior/Staff level

Years of experience: 5-10+
Skills: Neo4j, Python/Go, ML/AI, system design
Rate assumptions:
- Internal employees: $150-250K/year salary → ~$3K/week fully loaded
- Contractors: $150-300/hour → $6-12K/week
- Agency: $200-400/hour → $8-16K/week

What I Used:

Middle-ground assumption: $2,500-3,750/week
Roughly equivalent to $130-195K annual salary fully loaded
Or $160-240/hour contract rate

How to Validate

Option 1: Use your actual rates

Your Engineer Cost/Week × Number of Engineers × 8 weeks = Phase Cost

Example with $4K/week engineers:
Phase 1: $4K × 2 × 8 = $64K (vs my estimate: $40-60K)
Phase 2: $4K × 2 × 8 + $10K APIs = $74K (vs my estimate: $50-80K)
Phase 3: $4K × 3 × 8 = $96K (vs my estimate: $60-100K)

Option 2: Get quotes

Contact ML/AI consulting firms
Ask for T&M (time and materials) estimates
Typical range: $150-400/hour depending on expertise

Option 3: Use industry benchmarks

Glassdoor salaries for your area
Add 40% for benefits, overhead, management
Divide by 48 working weeks

API & Service Costs

OpenAI Embeddings

Listed Price: $0.0001 per 1K tokens

Source: https://openai.com/api/pricing/

Model: text-embedding-3-small
As of Nov 2024

My Estimate for Your Scale:

Assumptions:
- 100K products to embed
- Average 50 tokens per product (name + short description)
- = 5M tokens
- Cost: 5,000 × $0.0001 = $0.50 initial

Monthly updates:
- 10K products change/month
- = 500K tokens
- Cost: ~$0.05/month

So effectively free for embeddings.

OpenAI GPT-4 API

Listed Prices:

GPT-4o: $2.50 per 1M input tokens, $10 per 1M output tokens
GPT-4-turbo: $10 per 1M input tokens, $30 per 1M output tokens

Source: https://openai.com/api/pricing/

My Estimate: $500-2000/month

Based on:

Assumptions:
- 10K queries/month initially
- Average query: 1K tokens context + 500 tokens response
- = 15M input tokens + 5M output tokens
- Cost: (15 × $2.50) + (5 × $10) = $37.50 + $50 = $87.50/month

But:
- Some queries will be complex (10K+ context)
- Users might ask multiple follow-ups
- Testing/development usage
- Buffer for growth
→ Conservative estimate: $500-2000/month

Reality Check: Most companies doing GraphRAG see:

Light usage (<1K queries/day): $100-500/month
Medium usage (1-10K queries/day): $500-3000/month
Heavy usage (10K+ queries/day): $3000-15000/month

Alternative: Anthropic Claude

Listed Prices:

Claude 3.5 Sonnet: $3 per 1M input, $15 per 1M output
Claude 3 Opus: $15 per 1M input, $75 per 1M output

Source: https://www.anthropic.com/api

Cost would be similar: $500-2000/month at similar usage

Infrastructure Costs

Neo4j Enterprise

My Estimate: $3,000-5,000/month

Where this came from:

Neo4j doesn’t publish public pricing
Based on reported costs from:
- Reddit discussions
- Tech community forums
- Colleagues’ experiences

Typical pricing models:

Aura Professional: $0.50-2.00/hour (~$360-1440/month for always-on)
Aura Enterprise: Contact sales (typically $2K-10K+/month)
Self-hosted Enterprise: License fee + infrastructure

Why you might need Enterprise:

Relationship property indexes (critical for scale)
Advanced security features
Better support SLAs
Clustering capabilities

How to validate:

Contact Neo4j sales: sales@neo4j.com
Get quote for your expected scale
Ask about:
- Nodes: 1-100M
- Relationships: 10M-1B
- Query throughput: 100-1000 QPS
- Data size: 10GB-1TB

Alternative: Stay on Community Edition

Free
No relationship property indexes
Single-instance only
Good enough for Phase 1-2
Upgrade to Enterprise in Phase 3 if needed

GPU Instances (Optional)

My Estimate: $500-1,000/month

Based on:

AWS g5.xlarge: ~$1.00/hour = $720/month
Google Cloud T4: ~$0.35/hour = $252/month
Azure NCv3: ~$0.90/hour = $648/month

You only need this if:

Self-hosting embedding models
Running your own LLMs
High-volume image processing

Most companies don’t need this - use API instead.

Total Cost Breakdown

Phase 1 (Weeks 1-8)

Item	My Estimate	Your Actual
Engineering (2 engineers × 8 weeks)	$40-60K	$___K
OpenAI API (dev/testing)	$100-500	$___
Neo4j (Community free)	$0	$___
Total	$40-60K	$___K

Phase 2 (Weeks 9-16)

Item	My Estimate	Your Actual
Engineering (2 engineers × 8 weeks)	$40-60K	$___K
OpenAI API (production)	$500-2000/mo × 2	$1-4K
Data scraping/acquisition	$5-10K	$___K
Neo4j (Community still OK)	$0	$___
Total	$50-80K	$___K

Phase 3 (Weeks 17-24)

Item	My Estimate	Your Actual
Engineering (3 engineers × 8 weeks)	$60-90K	$___K
OpenAI API (full production)	$500-2000/mo × 2	$1-4K
Neo4j Enterprise	$3-5K/mo × 2	$6-10K
Infrastructure (monitoring, etc.)	$2-5K	$___K
Total	$60-100K	$___K

Ongoing Monthly Costs (After Phase 3)

Item	My Estimate	Your Actual
Neo4j Enterprise	$3-5K	$___K
LLM API (OpenAI/Claude)	$500-2000	$___K
Infrastructure (hosting, monitoring)	$500-1000	$___K
Total/Month	$4-8K	$___K

Hidden Costs Not Included

1. Data Acquisition

Scraping reviews: $5-20K one-time
Recipe database license: $0-50K/year
Nutrition data: $0-10K/year
Product images: bandwidth costs

2. Project Management

Product manager: 25-50% time
Designer: 10-25% time for UX
QA/Testing: 20% of dev time

3. Infrastructure

Hosting (AWS/GCP/Azure): $500-2000/month
Monitoring (Datadog, etc.): $200-500/month
CI/CD pipeline: $100-300/month

4. Compliance/Legal

Data privacy review
Terms of service updates
API terms compliance check

Total Hidden Costs: +30-50% on top of my estimates

Cost Drivers: What Makes It More/Less Expensive

More Expensive If:

🔴 Higher engineer rates

Bay Area: +100% ($6-8K/week)
NYC: +75% ($5-6K/week)
Agency/consultants: +150% ($8-12K/week)

🔴 More complex requirements

Custom ML models: +4-8 weeks
Multi-language support: +2-4 weeks
Mobile apps: +6-12 weeks
Real-time features: +2-4 weeks

🔴 Higher scale

10M+ users: Need Enterprise from start
100M+ users: Need clustering
Global deployment: +complexity

🔴 More stakeholders

More meetings, reviews, alignment
Slower decision-making
Change requests

Less Expensive If:

🟢 Lower engineer rates

Nearshore: -40% ($1,500-2,000/week)
Offshore: -60% ($1,000-1,500/week)
Junior engineers: -50% ($1,250-1,875/week)

🟢 Simpler scope

Skip recipes: -2 weeks
Skip review scraping: -2 weeks
Basic GraphRAG only: -4 weeks

🟢 Open source alternatives

Self-host Llama 3: Save $500-2K/month
Use Sentence Transformers: Save API costs
Community Neo4j: Save $3-5K/month

🟢 Existing infrastructure

Already have Neo4j cluster
Already have ML platform
Already have data pipelines

Validation Checklist

Before committing to these estimates, validate:

Engineering Costs

What’s our actual engineer cost? (salary + benefits + overhead)
Can we use existing team or need contractors?
Do we have the right skills in-house?
What’s our typical project overhead multiplier?

API Costs

Get actual quote from OpenAI sales
Estimate realistic query volume (not optimistic)
Factor in testing/development usage (2-3x production)
Consider caching strategy (can reduce by 50-80%)

Infrastructure

Get quote from Neo4j for our scale
Price out our cloud provider (AWS/GCP/Azure)
Check if we have existing credits/discounts
Factor in data transfer costs

Timeline

Add buffer for unknowns (+20-30%)
Account for holidays/PTO
Consider team ramp-up time
Plan for parallel work vs sequential

Better Cost Estimation Approach

Step 1: Break Down Into Stories

Epic: Semantic Product Search
├─ Story: Design vector schema (3 days)
├─ Story: Generate embeddings for products (2 days)
├─ Story: Create vector index (1 day)
├─ Story: Build search API (3 days)
├─ Story: Frontend integration (5 days)
├─ Story: Testing & tuning (3 days)
Total: 17 days = 3.4 weeks

Step 2: Apply Your Multipliers

3.4 weeks base estimate
× 1.5 (unknown complexity)
× 1.2 (review/QA overhead)
× 1.1 (meetings/coordination)
= 6.7 weeks actual

At $3K/week/engineer:
6.7 × $3K = ~$20K for this epic

Step 3: Sum Up All Epics

Semantic Search: $20K
Ontology Design: $15K
Entity Linking: $25K
Testing: $10K
Total: $70K

This bottom-up approach is more accurate than my top-down estimates.

Recommended Next Steps

For More Accurate Estimates:

Run a spike (1-2 weeks, $6-12K)
- Build minimal vector search
- Test Neo4j vector index
- Try OpenAI API with real data
- Measure actual costs
Get quotes
- OpenAI sales team
- Neo4j sales team
- 2-3 consulting firms
- Compare to my estimates
Prototype Phase 1
- Pick 1-2 features
- Build in 4 weeks
- Measure actual time/cost
- Extrapolate to full project
Bottom-up estimation
- Break into user stories
- Estimate each story
- Add buffers
- Compare to my top-down estimate

Honest Assessment

My Estimates Are Probably:

Too Low If:

You’re in high-cost area (SF, NYC)
You need extensive custom work
You’re risk-averse (want lots of testing)
You have compliance requirements
Your data is messy

About Right If:

You have mid-level engineers
You can use off-the-shelf solutions
You’re OK with some technical debt
Your data is relatively clean
You can move fast

Too High If:

You have excellent in-house talent
You can use open-source models
You have existing infrastructure
You’re willing to cut scope
You can accept higher risk

Reality Check

Most similar projects I’ve seen:

Successful: 6-12 months, $150-300K total
Struggled: 12-18 months, $400-600K total
Failed: 18+ months, $500K+, abandoned

The difference is usually:

Clear scope vs scope creep
Iterative delivery vs big bang
Strong PM vs weak PM
Good data vs bad data

Conclusion

Where My Numbers Came From:

✅ Labor: Industry-standard rates ($130-195K salary equiv) ✅ APIs: Public pricing × estimated usage ✅ Infrastructure: Community reports + typical pricing ❌ Contingency: Not fully included (should add 20-30%) ❌ Hidden costs: Not included (PM, data, etc.)

What You Should Do:

Use my estimates for initial planning/budgeting
Add 30-50% buffer for safety
Get quotes for validation
Run a spike/prototype to derisk
Build bottom-up estimate before committing

Final Number With Buffers:

Phase	My Estimate	+30% Buffer	+50% Buffer
Phase 1	$40-60K	$52-78K	$60-90K
Phase 2	$50-80K	$65-104K	$75-120K
Phase 3	$60-100K	$78-130K	$90-150K
Total	$150-240K	$195-312K	$225-360K

Most Likely Actual Cost: $200-350K over 6-9 months

This accounts for:

Real-world delays
Scope clarification
Integration challenges
Testing/QA
Project overhead

Bottom line: Budget $250K to be safe. If you come in under, great. If you go over, you have buffer.