Hey there! This is Platform Weekly, your weekly climb on the platform engineering jungle gym. This week with guest writer, Fawad Khaliq, CTO and Co-founder of Chkk.

Platform teams need a delightfully different approach, not one that sucks less

This week’s edition of Platform Weekly is an excerpt from Fawad’s latest article. Read the full article here.

Gartner estimates that by 2026, 80% of software engineering organizations will have established platform teams providing reusable services, components, and tools. While this trend sounds elegant and straightforward, the reality is anything but.

The challenges that platform teams experience can be broadly classified into the following buckets.

#1: The "shared responsibility model" pushes complexity to platform teams.

Cloud providers handle only part of the platform stack, leaving teams to manage all the layers on top. Consequently, they must deliver better-than-yesterday features and scale to application teams, and make sure things never break

#2: Change is a constant… and all changes are availability risks. 

Platform teams face multiple change drivers: security fixes, application team requests for features, cluster upgrades, and frequent add-on updates. All this inflow of change must be ingested, prioritized, and executed by the platform team. Because changes cause disruptions, implementing changes takes forever. 

#3: Teams can’t automate and hire fast enough to keep up with platform growth and support mission-critical applications.

Scarce platform talent, especially Cloud and DevOps Engineers, makes hiring challenging. Scaling headcount isn't sustainable. On-the-job training consumes skilled engineers' time, entangling them in repeated tasks and firefighting, leaving no room for innovation.

#4 Reactive incident response is necessary but insufficient. 

It involves experiencing incidents first-hand and manually researching solutions, consuming significant engineering resources without preventing future errors. As a result, firefights are a way of life and automation always take the back seat.

There has to be a better way…

It seems impossible for a single company’s platform team to solve these chronic challenges, but we believe it’s possible if we enable them to “collectively learn” from each other. A technological solution to ensure that:

  • Platform engineers can learn from the unstructured information available on the internet without having to read walls of text just to update a single component in the infrastructure, tracking versions through CLIs and APIs, etc. 
  • Silos are broken across different platform teams so that learnings are programmatically shared, similar to CVE's role in security, creating a “CVE for Availability”. 

Solving these challenges requires a “trusted broker” that can collect information from all the sources, validate it, curate it as programmatic signatures, and publish it broadly for everyone’s benefit.

Read the full article here.

Quick bites

Articles that blew me away:

From the community: