How We Reduced CRM Data Latency by 96%
Our sales and marketing teams were working with 24-hour-old data. A lead would sign up at 9 AM, and important engagement information or derived insights wouldn't be available until 9 AM the next day (sometimes longer or not at all). By then, the lead might have already moved on to a competitor or been marketed to in the wrong way.
We needed real-time CRM data. HubSpot offered a native reverse-ETL sync from Snowflake that looked perfect on paper. In practice, it broke immediately.
The problem? Cybrary uses UUIDs as the primary user identifier. HubSpot uses email addresses. When a user changed their email or used multiple, the algorithm used to detect changes would stop working. Updates could be missed and hard to detect. The native sync was fundamentally incompatible with our data model.
We had two options: pay $450/month for a reverse-ETL service like Hightouch, or build our own. We chose to build. It took a few days and costs ~$50/month to run.
Here's how we did it, and why Snowflake's data share made it feasible.
Build vs Buy: The Decision
When the native sync failed, we evaluated our options:
Option A: Pay for Hightouch (or Census)
- Cost: ~$450/month minimum
- Pros: Works immediately, no engineering time
- Cons: Another vendor dependency, limited customization, recurring cost scales with data volume
Option B: Build Custom
- Cost: ~3 days engineering time + $50/month compute
- Pros: Full control, customizable, scales cheaply
- Cons: Requires engineering time, we own the maintenance
The decision hinged on one question: Could we actually build this quickly?
The answer was yes, thanks to HubSpot's Snowflake data share.
The Key Enabler: Snowflake Data Share
HubSpot provides a Snowflake data share that lets you query their CRM data directly from your Snowflake warehouse. This is a game-changer for building reverse-ETL.
Why it matters:
- No ETL needed. You can query live CRM data without extracting it first.
- Snowflake does the heavy lifting. Comparing 1M+ records is a SQL query, not a Python loop.
- Delta detection is trivial. Join your data against HubSpot's share to find what changed.
This meant we could offload the hardest part—finding which records need updating—to Snowflake. The custom code just needed to fetch results and submit them to HubSpot's API.
What We Built
The architecture is simple: Snowflake finds the deltas, Cloud Run submits them to HubSpot.
Architecture:
─────────────
┌──────────────────────────────────────────────────────────┐
│ Snowflake │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Our Warehouse │ │ HubSpot Share │ │
│ │ (user data) │────────▶│ (current CRM) │ │
│ │ │ compare │ │ │
│ └─────────────────┘ └─────────────────┘ │
│ │ │
│ │ SQL Query: What's different? │
│ │ Returns: ~100-200 changed records │
│ └─────────────────────────────────────────────┐ │
└───────────────────────────────────────────────────────┼──┘
│
┌──────────────┐ │
│ Airflow │ │
│ (scheduler) │ ──every 15 min───────┘
└──────────────┘
│
│ triggers
▼
┌──────────────┐
│ Cloud Run │
│ Job │
│ (~1k lines) │
└──────────────┘
│
│ batch upsert
▼
┌──────────────┐
│ HubSpot │
│ API │
└──────────────┘
How It Works:
─────────────
1. Cloud Composer (managed Airflow) triggers every 15 minutes
2. Cloud Run queries Snowflake for delta (records that changed)
3. Snowflake compares our warehouse vs HubSpot data share
4. Returns only the differences (~100-200 records typically)
5. Cloud Run submits batch update to HubSpot API
6. Results logged for monitoring
The Delta Query
This is where Snowflake data share shines. Instead of scanning all 1M+ records in application code, we let Snowflake do the comparison:
-- Find contacts that need updating
WITH our_data AS (
SELECT
user_id,
email,
phone,
properties,
updated_at
FROM warehouse.contacts
WHERE is_active = true
AND updated_at >= CURRENT_TIMESTAMP - INTERVAL '20 minutes'
),
hubspot_data AS (
SELECT
contact_id,
email,
phone,
properties
FROM hubspot_share.contacts
)
SELECT o.*
FROM our_data o
LEFT JOIN hubspot_data h ON o.email = h.email
WHERE
-- New contacts (not in HubSpot)
h.contact_id IS NULL
-- Or existing contacts with changes
OR o.phone != h.phone
OR o.properties != h.properties
ORDER BY o.updated_at DESC
This query runs entirely in Snowflake and returns only the records that need updating. On average, that's 100-200 contacts per 15-minute window (out of 1M+ total).
The Cloud Run job just fetches these results and submits them to HubSpot's batch API. Simple, fast, cheap.
The Cloud Run Job
The actual sync logic is ~200 lines of Python:
import snowflake.connector
import requests
def sync_contacts():
"""Sync changed contacts from Snowflake to HubSpot"""
# 1. Query Snowflake for delta
contacts = execute_snowflake_query(DELTA_QUERY)
# 2. Transform for HubSpot API
hubspot_batch = [
{
"email": c["email"],
"properties": {
"phone": normalize_phone(c["phone"]),
"user_id": c["user_id"],
# ... more fields
}
}
for c in contacts
]
# 3. Batch upsert to HubSpot
response = requests.post(
"https://api.hubspot.com/contacts/v1/contact/batch",
json=hubspot_batch,
headers={"Authorization": f"Bearer {HUBSPOT_API_KEY}"}
)
# 4. Log results for monitoring
log_sync_results(response)
return {"contacts_synced": len(contacts)}
That's it. The complexity is in the SQL query (which Snowflake handles), not the application code.
Results
96% latency reduction 90% cost savings Built in 3 days
- Data latency: 24 hours → 15 minutes (96% improvement)
- Monthly cost: $450 (Hightouch) → ~$50 (total infrastructure: Cloud Run + Snowflake queries + Composer)
- Build time: 3 days (planning + implementation + testing)
- Records synced: ~32,000 contact updates per month
- Maintenance: ~1 hour per month (monitoring dashboards)
Infrastructure cost breakdown: Cloud Run compute is minimal (~$1/month, scales to zero between syncs). The bulk of the cost is Snowflake compute for delta queries (~$20-30/month) and Cloud Composer for Airflow scheduling (~$20-30/month). Total: ~$40-60/month depending on query volume. We use ~$50 as a typical estimate.
Business impact: Sales and marketing teams can now act on lead data within 15 minutes instead of 24+ hours. Response time is one of the strongest predictors of conversion.
Unexpected Benefits
Beyond speed and cost, building custom unlocked capabilities we hadn't anticipated:
1. Git-Based Version Control
All sync logic lives in version control. We can:
- Review changes in pull requests before deploying
- Roll back bad syncs instantly (git revert)
- Test new logic in dev environments
- See full audit trail of who changed what and when
2. Easy to Add New Syncs
Want to sync a new dataset? Just add a SQL file and a config block. We went from 1 sync to 5 syncs in a month without touching the core pipeline code.
Each sync is just:
- A SQL query file (defines what to sync)
- A config entry (CRON schedule + HubSpot destination)
No vendor approval needed. No new connectors to purchase. Just commit and deploy.
3. Custom Logging and Alerting
We built exactly the observability we needed:
- Slack alerts when error rate exceeds 1%
- Dashboards showing sync performance over time
- Per-sync metrics (which syncs are slow? which fail most?)
- Detailed error messages with stack traces
With vendor solutions, you get their logging. With custom, you get exactly what you need.
When to Build vs Buy
Not every company should build their reverse-ETL pipeline. Here's when it makes sense:
Build when:
- You have Snowflake data share (or similar) access to your destination system
- Native integrations don't fit your data model
- You need custom transformation logic
- You're syncing high volumes (vendor costs scale with rows)
- You have 2-3 days of engineering time available
Buy when:
- You don't have data share access (would need to build ETL first)
- Native integration works fine for your use case
- Engineering time is scarce
- You're syncing to multiple destinations (vendor handles this)
The ROI Calculation
For us, the math was clear:
- Build cost: 3 days × $800/day (blended rate) = $2,400
- Monthly savings: $450 (vendor) - $50 (Cloud Run) = $400/month
- Payback period: 6 months
- Annual savings: $4,800/year after payback
Plus the intangible benefits: full control, git-based versioning, custom observability, and the ability to add new syncs without vendor costs.
Key Takeaways
Three lessons from building this:
1. Data Shares Change the Build vs Buy Equation
Without Snowflake data share, this would've taken weeks (build ETL, then build sync logic). With it, we built in days. Data shares make custom integrations feasible for smaller teams.
2. Keep the Cloud Run Job Simple
All the complexity lives in SQL (which Snowflake is great at). The application code just orchestrates API calls. This makes it easy to test, debug, and maintain.
3. Start with One Sync, Then Scale
We built the contact sync first. Once that worked, adding new syncs (deals, companies, custom objects) was trivial. Start small, validate the architecture, then expand.
Conclusion
When HubSpot's native sync failed, we had a choice: pay $450/month for a vendor solution or invest 3 days building our own. We chose to build, and it paid off in 6 months.
The key enabler was Snowflake's data share. It turned what could've been a complex multi-week project into a few days of focused work.
If you're facing a similar decision, consider whether you have data share access. If you do, building custom might be faster and cheaper than you think.
Need help with your data infrastructure?
I help B2B SaaS companies optimize their data stacks through FinOps audits and hands-on implementation work.
Whether you're evaluating build vs buy decisions, migrating between systems, or optimizing existing pipelines, let's talk.