How do you get over 1000 devs to work together to consume resources and assets? Yonatan Boguslavski learned the hard way that building DevOps infrastructure isn’t an easy task. Nevertheless, his persistence led to the development of Port. So when we say we’ve been there and done it, you’ll know we've really done it.
Let’s set the scene, imagine a company with more than 1,000 developers. They work in siloed teams with various products, infrastructure, and technology stacks. Each DevOps team has its own technical specialization and creates internal scripts and control panels to cope - Kafka topics, Kubernetes and Hadoop Clusters, VMs, databases, etc.
Some teams manage their instances in Excel files; some SSH to many different servers - and some search their email history for old ticket requests. The sub-optimal resource utilization and general chaos keeps on growing every single day. Deleting resources becomes an impossible task, mainly because of fear. DevOps teams become scared of deleting an unused instance because this instance can be a part of the production environment in some way.
Does any of this sound familiar?
If it does, you’ll already know that changes need to happen. So here are a few lessons I learned as the Head of Platform Engineering while building a customizable DevPortal to get this giant mess back on track and provide a great developer experience.
Changes are hard but worth it.
The developers had got used to the freedom to work in their own way. You'll meet resistance if you introduce a new way of working and suddenly ‘force’ them to implement it, even if it’s just to add a short and simple YAML file to their Git Repo.
Convincing someone that a project will only have worth in the long term is nearly impossible. So we had to find ways to give the developers immediate value. And it’s not just the developers; the decision stakeholders that greenlit the DevPortal project want to see quick value and direct impact. With this group, prepare to manage expectations - adoption doesn’t happen in a day.
Providing that immediate value meant giving control to the people using the DevPortal; rather than dictating the new system, we allowed them to build it to their requirements. This involvement and the opportunity for self-service helped engagement and interest in the new platform.
Ticketing systems are not enough; self-service for the win
We had many templates for our tickets, and we were constantly improving them. Yet, no matter how hard we tried, there were always special requests or vital missing information. The biggest issue with the ticketing system was the level of manual work our DevOps teams had to do.
Defining any operation with its actual execution made it very transparent between the developer and the infrastructure they wanted to consume. Day-two operations suddenly become much more streamlined. There’s no need to define ‘sub-sub-sub-tasks’ for each ticket type. Developers could re-configure their instance configuration in a self-serve fashion. It was a beautiful thing to watch.
The challenge was dynamic and varied with parameters so each team could provide the required data. We needed to avoid confusion but maintain the developers’ freedom.
Balancing DevOps <> Developer priorities is a fine line
As a person immersed in DevOps, I need to know that infrastructure is deployed ideally, well-monitored, and not hitting peaks. Don’t even mention things like “out of RAM.”
As more and more developers started using the platform we created, we noticed something. They didn’t care if resources were well limited and monitored; in fact, they hated meeting quotas and limitations. Instead, they want a Kafka Consumer that just works. They want a Kubernetes deployment that just scales and always returns that “200 OK” response.
For our team, being open to changes and criticism was vital. Personally, hearing the developer's experiences only ever made me want to improve the platform - even if teams didn’t immediately appreciate our hard work. I learned that a DevPortal will always be a work in progress. And that’s ok. Balancing these priorities will primarily be derived from your organizational culture and structure.
Fit your solution to the organizational structure
This was the lesson that took the longest to learn. But, at the end of the day, it’s all about the people who run your organization and how they manage and communicate their infrastructure allocation. We gave developers out-of-the-box templates called "golden paths," but we also gave more technical teams the ability to consume infrastructure in dynamic ways. We wouldn’t restrict teams to work in one specific way.
Some teams were more “technological” than others. Some teams were more “production critical” than others, which could change depending on the business’s strategic priorities. These differences led us to develop features like quotas per team and product and several levels of visibility into the infrastructure and audit log.
Each DevOps team had a different level of confidence in their “automation readiness,” so we added another manual approval step that occurred when specific parameters were met. Win-win for everyone!
And, of course, the list doesn’t end there. The key takeaway here is to be curious about the differing needs of your various users.
Once the system stabilized, the possibilities were limitless!
I’ve thought long and hard about writing such a bold statement, but the long-term change was remarkable! Being well organized just made everything so much smoother and, well, more organized. The benefits were not only for the developers but also for the DevOps teams.
In addition, we took who performed the operation into consideration. For example: When a developer tried to create an S3 bucket, we provided him with a different set of options to create a bucket, but when the analyst tried to create a bucket, he had only one way to do it because we understood the other options would not be relevant for him in any case.
After going through the inevitable growing pains with adoption and platform positivity, we discussed with DevOps, team leaders, developers, and stakeholders. Then, taking advantage of the fact that everything infrastructure is centralized, we came up with awesome ideas for improvements.
Things like cost optimization, better resource usage, secure and predictable deployment, and code reuse on the self-serve side (e.g., authentication, form creation, task execution) all naturally improved from being more organized.
It was a beautiful thing to witness.
DevOps teams could focus on improving the infrastructure and deepening their expertise. Developers could focus on development and allocate their infrastructure without waiting times without being infrastructure experts. In addition, stakeholders got automated daily, weekly, and monthly reports about resource usage.
And we, as a team, maintained and improved the platform so everyone could enjoy a better development experience and be happier. Which is the ultimate goal, right?
All these lessons (and plenty more) led me to build Port. Working together with a fantastic team, we walk those fine lines daily. We’ve created a customizable platform to fit any organizational structure and workflow. Integration can happen gradually using your existing infrastructure tools, scripts, and processes.