< Back to Hub

FinOps Fundamentals

Understanding Cloud Financial Operations for Engineers

What is FinOps?

FinOps is an operational framework and cultural practice that brings together technology, finance, and business teams to collaborate on data-driven spending decisions. It's not just a tool or a dashboard - it's about shared responsibility for cloud costs across the entire organization.

For engineers, this means: You own the cost of what you build. Just like you're responsible for code quality and security, you're responsible for cost efficiency.

The Three Pillars of FinOps

Click each pillar to learn more

1
Inform
Visibility & Allocation
2
Optimize
Rates & Usage
3
Operate
Continuous Improvement

Inform: See Where Money Goes

  • Cost Allocation: Tag resources so you know which team/project/feature costs what
  • Showback/Chargeback: Make costs visible to the teams that incur them
  • Forecasting: Predict future costs based on usage trends
  • Anomaly Detection: Alert when costs spike unexpectedly

For your capstone: Tag your resources by feature/component so you can see which parts cost the most.

Optimize: Reduce Waste

  • Right-sizing: Match instance size to actual workload needs
  • Purchasing options: Use reserved/spot instances where appropriate
  • Idle resources: Shut down dev/test environments when not in use
  • Storage tiering: Move cold data to cheaper storage classes

For your capstone: Start with smaller instances and scale up only if needed.

Operate: Make It a Habit

  • Governance: Policies for what can be deployed and how
  • Automation: Auto-scale down, scheduled shutdowns
  • Regular reviews: Weekly cost review meetings
  • Culture: Everyone feels ownership of costs

For your capstone: Add a cost section to your project README.

Who's Responsible for Cloud Costs?

Everyone. FinOps is a team sport.

1
Engineers
Design efficient systems, right-size resources, write efficient code
2
Product
Prioritize features with cost in mind, define SLAs that balance cost
3
Finance
Budget allocation, cost reporting, contract negotiations
4
Leadership
Set cost targets, enable cost-aware culture

The Four Major Cost Levers

1. Right-Sizing Compute
Don't use a large instance when a small one will do. Monitor actual CPU/memory usage and adjust.
Example
Your API uses 10% of a large instance → Switch to medium, save 50%
2. Storage Class Choices
Hot data (frequent access) is expensive. Cold data (rarely accessed) can use cheaper storage tiers.
Example
Move logs older than 30 days to archive storage → 80% cheaper
3. Scheduling Resources
Dev/test environments don't need to run 24/7. Schedule them to shut down at night and weekends.
Example
Dev environment runs 10h/day instead of 24h → Save 58%
4. Purchasing Models
On-demand is flexible but expensive. Reserved and spot instances offer significant savings.
Example
Training job on spot instances → Up to 90% savings

The Power of Smart Design Choices

1
Always-On GPU Training
24/7 on-demand GPU instance running continuously
~$2,500/month
vs
2
Scheduled Spot Training
Checkpointed jobs on spot instances, runs 4h/day
~$125/month

Same training workload, 95% cost reduction

ML-Specific Cost Considerations

GPU/TPU Compute

ML accelerators are 10-50x more expensive than CPUs. Use them only when needed, and use spot instances for training when possible.

Training vs Inference

Training is bursty and can tolerate interruption. Inference needs to be always-on. Design accordingly.

Data Storage

Training datasets can be huge. Store raw data in cheap cold storage, processed data in faster tiers.

Model Registry

Large models (GBs each) add up. Keep only the versions you need, archive or delete old experiments.

Check Your Understanding

1. What does FinOps mean?

A tool for tracking cloud costs
A cultural practice of shared cost responsibility
A finance department's job

2. Which is the best candidate for spot instances?

Production API serving real-time predictions
Training job that saves checkpoints frequently
Critical database server

3. How can you reduce storage costs for old logs?

Delete all logs immediately
Keep everything in fast storage
Move old logs to cheaper archive storage tiers