Cost Governance | HavingATinker

Now bear with me on this one, but I’m going to talk about a subject that may sound like an absolute yawnfest, and that subject is Cost Governance. I know, I know, there are sexier things to talk about like Kubernetes Operators and OpenTelemetry, however, Cost Governance aligns rather nicely with a lot of the other skills an engineer possesses and uses on a daily basis. What cost governance also does is bridges the gap between engineering and business, as we get to really question whether the engineering choices we’re making are what the business needs. Please allow me to try to explain my ludicrous claims.

Cloud providers are a bit of a double-edged sword when it comes to costs. On one hand, cloud providers are a great enabler and promote velocity, while on the other allows for your pants to be well and truly pulled down if care is not taken. Gone are the days of needing a large outlay of costs in the form of servers, network equipment, licenses, etc; which have since been replaced with monthly service subscriptions and pay-as-you-go compute/data. This makes it very easy to rack up unsuspecting costs with very little understanding of the what and whys.

Alt Text

From my experience across multiple organisations of different sizes is that the Cloud bill usually is under constant scrutiny, and for good reason too. Regardless of size, the cost should not scale linearly with the business. Having said that, in most cases there isn’t necessarily any cost governance, but rather a watchful eye is kept on the bill. So what do I mean by “cost governance”?

Cost Governance is to control, report and optimise the costs of the cloud compute bill. Through cost governance, we are able to promote and foster greater accountability, which then feeds into identifying cost efficiencies.

What Cost Governance is not, is a way to stop people from spending money. Cost Governance is to improve accountability and in turn positively impact the bill. Cost Governance should not be seen or used as a carrot or stick, but rather as a means to allow greater ownership of one’s domain. The stick approach will only garner negative emotions and push back against the benefits that come with cost governance e.g. budgets, visibility, and ownership. The key takeaway here is the improved accountability of the cost of the products we build and serve.

Cost Efficiencies#

So far I have mentioned cost efficiencies a few times and so it’s probably good to clarify what I really mean by this:

Cost Efficiency is the act of providing value to a business at the lowest possible cost.

With this in mind, our main aim isn’t to reduce spending at all costs, but rather to make sure we are gaining the correct value from our spending and adjusting accordingly. This could come in numerous forms:

Changing clustered DBs to single instances in development environments
Autoscaling for workloads to meet demand
Spot instances for more non-critical workloads
Upfront reservations for long-term workloads
Utilising the latest available compute or storage options
Right-sizing of workloads to optimise for wastage vs usage
Correct storage classes for your data depending on their use case e.g. frequency of access, speed of access
Limit data transfer between zones/egress
Only logging what’s necessary to save on storage/parsing/indexing costs as well as removing the cognitive load when debugging
Reviewing SaaS provider usage to adjust or sunset wasteful contracts
Lifecycle policies to only keep what you need e.g. short-term vs long-term, due diligence vs legal requirement

This isn’t a definitive list, but these are just a few examples of how someone can be cost-efficient by understanding the business requirements and adjusting their approach accordingly. Individually these might not amount to much, however, collectively I can with almost certainty, guarantee this will positively affect your bill.

The key thing when doing any of this is asking, what benefit is this bringing to the business? How will this help the company reach its goals? What purpose is this fulfilling?

Tagging#

Alt Text

If there’s anything you take away from this blog, I hope it’s the importance of tagging. Not just tagging, but tagging based on a predefined list of tags. For example, having cost-center: dept-a and costcenter: dept-a will cause inconsistencies resulting in incorrect monitors, reports, and cost categorisation, essentially negatively impacting any cost governance implemented. We need to ensure there is consistency in how we tag and also it’s relevancy.

The reason for such an emphasis on tagging is that it’s the basis of all Cost Governance, it’s how we are able to identify costs by their relevant grouping. Without tags, we are not able to attribute costs effectively to whichever groups we wish to assign them to. Groups being things like the service owner, the service itself, the environment, cost center, essentially whichever way you feel it would be beneficial to attribute cost. An example of this predefined list could be something like the following:

tags: 
  cost-center: "dept-a"
  service: "fancy-service"
  environment: "production"
  owner: "team-a"
  repo: "repo-name"
  tooling: "crossplane"

Using these tags we could then do something like a filter for all resources under the cost-center of dept-a, but then group the results by service and which are within the environment of production. This would then gives us a better picture of the services owned by a cost center in production, whereby we could go even deeper by then filtering by owner to understand the accountability and costs associated with owners within a cost center.

Budgets#

Alt Text

Once we have the means to attribute the cost to resources, we can then look to assign budgets to these resources. These budgets could be for teams, cost centers, services, etc, but more importantly, it’s about introducing some form of accountability. This isn’t to say the existing owners aren’t already accountable, but this gives us a mechanism to improve or define ownership and a way to raise awareness of costs where there wasn’t much before.

The problem you may come across when assigning budgets is more from a cultural standpoint e.g. do you have buy-in? Rolling up to someone and just dropping a budget on their plate, and walking off isn’t helping anyone. The key is to work with the different stakeholders, socialise the idea of budgets and why they are beneficial to everyone involved. This could be in the form of a presentation or guild, perhaps an RFC or maybe even a 1 on 1 catch-up with the key stakeholder. The key is collaboration and including others on the journey.

Another thing to mention here is to make sure you “eat your own dog food”. If you are expecting others to set budgets and work within them, then you should be doing exactly the same otherwise why should they?

I’ve spoken about assigning these budgets, but how do we get to this point? How do we decide on this budget, and how long should this budget be for? While it’s a cop-out, the answer is it depends. The trick is to understand what you are trying to achieve and tailor your budgets accordingly. In order to calculate your budgets you will need some historical data (for example the past 3-6 months, but ideally more) to give the current trajectory of the spend. Now we have this trajectory of spend we can ask questions like, do you expect growth in specific usage e.g. data size? Number of databases? Team size? Are you aiming for a reduction taking into expected proposed changes?

A good resource in this instance would be your account manager for your cloud provider, they can help go through your bills, and calculate the expected growth. Another one is to interact with your finance department to understand their expectations and requirements. Setting a budget too conservative may well set you up for failure while being too liberal doesn’t promote curiosity in regards to the budgets, reducing the potential efficiencies being investigated and implemented.

Monitoring#

Alt Text

A good way to look at this is to treat your cost governance and budgets like you would a service. Is it performing as you’re expecting? Have you exceeded any expected thresholds? What impact is the current performance having on our business? As mentioned at the beginning of the article, cost governance intersects quite nicely with engineering methodologies, and leveraging this experience enhances our utilisation of it.

Monitoring could come in numerous forms, perhaps you want to notify a team when their cost center is exceeding a threshold, or perhaps you want to know when reservation coverage drops below X%. Leading on from this, is this alert actionable? Do we have playbooks for when this does fire? The key thing here is to utilise that engineering mindset and apply the same experience when building out the monitoring aspect of cost governance.

By having engineers across this monitoring aspect, it feeds into the “you build it, you run it” which is championed by the DevOps methodology. This helps build that accountability and allows the service owners to have greater ownership and autonomy.

Investigative Tooling#

Alt Text

Fantastic, so we have a way to divvy up the resource costs and attribute this cost to somewhere, but how on earth can our budget owners dig into their own costs? This is the trickier part of cost governance, because without knowing the whys around what’s driving the cost, then how on earth can we look to positively impact this? Sadly this isn’t a one size fits all answer, but this is something that needs to be solved before pushing for the service owners to have more accountability around cost governance. Without this, all that we’re doing is pushing a problem onto someone else, which will only inhibit any collaboration and manifest in some particularly unhappy colleagues.

Leading on from this we also need to ensure the investigative tooling is fit for purpose for the end user, these tools need to be accessible and intuitive. The project owners have so many spinning plates and focuses, we need to ensure that their time spent working on the cost governance and efficiency measures is effective. Engineers already have so much on their plates these days as the role of a software engineer has evolved, and layering more expectations around cost governance if done poorly is an unwanted distraction and increased cognitive load.

From having gone through this journey before, a positive side effect of implementing said investigative tooling is that you may come across some unexpected findings. These are findings can come in the form of wasted resources, overzealous tooling, “leakages” or maybe even just odd and unexpected behaviours of tools/resources.

Incremental Changes and Measure Often#

This may sound like an obvious statement to make, but I feel it’s something worth mentioning regardless.

When cost efficiencies have been identified, we need to first ensure we have a baseline; this baseline will allow us to understand the impact of our changes. Do the changes we make line up with our expectations? Can we correlate changes in the bill with any cost efficiency measures we rolled out?

The key is to make these changes incremental, and not large sweeping changes. I’ve certainly been guilty of this whereby I’ve seen see all these improvements we can make, getting excited and trying to throw them all out at once. Whilst it may have a positive outcome, we can’t measure it’s effectiveness. By bundling everything at once it’s difficult to distinguish what each individual change has actually done, thereby making it tricky to know exactly which “levers” to pull next, leading to finger-in-the-air decisions being made.

That’s great and all, but how?#

I’m going to caveat this article with that most of my experience is with AWS, and I can only give examples of how to do this with AWS. The reason for keeping the sections above quite non-descript and without implementation details is that these sections above are essentially the building blocks. These building blocks are agnostic to the environment or provider and together are what create a healthy foundation for cost governance.

Having said all that I feel like I’d be selling you snake oil and dreams if I didn’t at least mention some of the tools I have had success with in implementing my own brand of cost governance.

Alt Text

Tagging

Cost Allocation Tags - Enables the tags set for resources to be available within the CUR and Cost Explorer
Cost Categories - Grouping costs by organisational structure

Budgets

Budgets - Set budgets and actions based on cost and usage
Google Sheets - Any spreadsheet software, to then take all this historical data and set budgets utilising formulas based on the expected trajectory

Monitoring

Budgets - Able to set alerts using budgets
Cost Anomaly Detection - Continuously monitor cost and usage using machine learning to find anomalies
Slack - Send budget alerts to the relevant channels

Investigative Tooling

Athena - Interactive query service for querying data in S3 buckets such as logs and CUR
S3 Storage Lens - Analyses S3 buckets to better understand object usage and activity
Cost Explorer - View and analyse costs and usage across the organisation
Cost & Usage Report - A comprehensive set of cost and usage data stored in S3

Special Mentions

Kubecost - Real-time cost visibility with reporting and monitor capabilities for Kubernetes

Wrap Up#

My hope for this article is to help others in their pursuit of implementing or building on their cloud cost governance. It’s not a widely spoken about topic (from my personal experience), but is something that every organisation grapples with.

The biggest learning for me is that the tools and methodologies I have in my engineering toolbox, feed really nicely into cost governance. I wasn’t having to completely learn something new or overhaul my approach, but rather use my current skills to enhance it, whether it be implementing observability for costs or promoting and socialising the greater ownership of cost governance. This subject isn’t as snooze-inducing as I initially thought, and actually enjoyed it because of this intersection of business and tech. Don’t get me wrong, I do enjoy being a hermit and focusing on purely technical subjects, but it certainly opened my eyes.

Another eye-opener for me (which sounds silly on reflection) is how influential and coupled your technical decisions are to the business’s success. I always knew my technical work would contribute to the business but in a slighty abstract way e.g. uptime of 99.999% means applications are serving customers, however, it was difficult for me to really marry the two worlds up. By thinking “what impact will this change have on our customers, and how cost efficient is this?” really helps with the decision process. For example, even a simple change (see cost efficiency examples) which may free up $150 per day (which appears trivial) equates to roughly $4,500 per month, and over the course of a year, you’re looking at $54,750 worth of money effectively saved. That’s funds for a new or existing SaaS contract, perhaps part funding a new member of a team, or just another way to shout about the positive impact your team is having on the business.

As the recession looms and uncertainty in various markets festers, the topic of cost governance becomes hell of a lot more relevant for businesses and hopefully this helps other teams tackle these worries head on.