Learning from CyberArk: building an Internal Developer Platform for Self-Service and increased velocity
CyberArk's team had a vision for a better developer experience, and ended up building an internal developer platform in-house. How much faster is deployment to production and how did they accomplish this and create consensus around the need? What are they planning for the future? Read on.
The following is based on an interview with CyberArk’s Guy Brodny, R&D DevOps Group Manager and Shlomi Benita, Principal DevOps Architect there. The DevOps group at CyberArk provides services to the company’s entire engineering organization. At the time, teams were growing at a fast pace, products were moving from on-premises to the cloud and the complexity of tools grew. To improve development velocity, quality and security, Guy and Shlomi’s teams embarked on an ambitious Internal Developer Platform project. Doing this radically improved the developer experience at CyberArk, although Guy and Shlomi weren’t aware of the term at the time.
Here’s their story.
Why go build an Internal Developer Platform?
Guy: When I joined CyberArk, we looked at the process and tools that are used to develop, release, and deploy code to our SaaS environment. We asked ourselves what are the structural things that block development. We didn’t even call this developer experience. We just wanted to help the company ship code faster and more reliably to customers; our underlying thought was that we’re here for developer enablement.
CyberArk’s product wasn’t originally a cloud product but rather an on-premises product that migrated to the cloud. As a result, testing, building and releasing our code was a complex and tedious effort that required multiple development groups. You can imagine what that looked like. CI and CD were disconnected. People would be literally throwing code over the wall in the most extreme fashion. They did not use the same source control, they did not use the same tools.
Our first realization was that we need to create a common language: we had three different R&D languages in three different groups. We started on the production side, but with the thought that “production begins with development”. We asked what would make it easy for developers to take ownership of deployment to production, in an efficient, non-overwhelming way. We wanted to take everything into consideration - cognitive load, what developers should be aware of with regards to how their code arrives in production and what they shouldn’t care about. Not all our developers were familiar with a cloud environment since their product had recently migrated to the cloud. It all had to be efficient, not overwhelming.
An engine that manages cloud infrastructure via API
Provide both UI & API interfaces to interact with the infrastructure
Some developers used the UX, some the API, people created SDKs over the API as an extension
Shlomi: Deployment velocity is a big thing - and graphing it can show people some uncomfortable truths. One of the greatest selling points we used to get the formal management OK to work on an internal developer platform was plotting lead time to change on a graph over time. We showed what lead time to change would be without the IDP and what it would be with an IDP. Over time, we presented our estimation that lead time to change would grow worse without an IDP. This convinced management.
What happened when the Internal Developer Platform was launched?
Guy: We started on the CD side. The minute that people didn’t have to go through three gates to get to production, the minute that our “product” was delivered to developers just as if they were a customer of CyberArk’s, tremendous change happened. Developers could experience the product just as the customers do and also received the option to be responsible for their part of the product. The infrastructure ran on a central engine, but each product team could change the template they ran on. Today, if a developer wants to get code into production or check whether code works or not, they have a development environment that is identical to the production environment.
We managed to make a significant cultural change. Today, our internal developer platform is even part of our production environment. Many people raise an eyebrow when they hear it’s also part of our production environment, but we wanted one common unified language for all. The service reliability engineer and the developer should have the same interface. We build an engine for infra management for production but made it easy enough for anyone to use. It was very difficult to get this going, since we were working with thousands of developer customers and were replacing the engine under their feet in the middle of work. Eventually, we had accomplished shift-left by creating the following:
A common language
Reduced cognitive load for developers
The real result, which we had promised to management in the first place, was also a significant reduction of time to production, which is why we started this in the first place.
Can you talk about the effort behind this?
Guy: The development around this project took a year. To begin work, we needed to provide management with a business case. Our main argument was that deploying a change to the production environment could take up to six months. We knew that the internal developer platform would reduce this. Today the same change is ready for our cloud environment at the moment the pull request is merged. It still takes around 4 weeks to push the change to all production customers, but this is much better than what it used to be. We aim to do this even faster in the future.
What part of this success is the developer platform and what part is change management?
Guy: Some of the success is a result of educating our developers and shifting left, but I don’t think this could have happened without this internal developer platform we had invested in. To succeed you need an IDP. We’re a large company. Not all developers “know” cloud. If we tell them “here’s the cloud, with all its configurations, work with the cloud’s SDK, use these APIs” they would not know what to do. So it’s not just education, it’s also about having a platform and it’s the guardrails in it.
Shlomi: The trick was to engage developers and provide the cloud production environment in a way that is accessible to developers. Knowing who owns an environment is also a big piece of that. I provide them with a repository and I say “this is your code. If you change something, you can see what happens”. They can now check their changes - we all know that when we change something we probably begin by breaking it. They get access to the actual customer experience with the changes that they made.
The importance of developer self-service - can a REST API do the trick?
Shlomi: Self-service is a great part of the ability to succeed. I’m truly proud we had this vision three years ago, when developer self-service wasn’t that common. Self-service is crucial since this is what makes developers adopt this approach. You can set up your own environment, deploy, and have less DevOps tools to think about, which reduces your cognitive load. You write your deployment code and you’re all set. Once you’re done, you’ll also get feedback on what went wrong. Most importantly, you are independent. You don’t call me and say “hey, can you deploy something for me? Set up an environment? Is it working?”
Initially I thought that giving developers a REST API would do the trick. It simply doesn’t work. I’m not saying an API isn’t possible, but for developers it’s still pretty demanding. They don’t necessarily want to write an SDK (although some would), even though working with a REST API should be easy for them.
Guy: Self-service is all about reduced cognitive load. “You build it you own it” is a nice slogan. But you can’t own things you don’t know. So we help you overcome what you don’t know, even if it isn’t in your domain. Developers want to write code, they don’t want to configure security or manage infra for containers. They also want the ability to reuse lessons we already learned, and we help to reuse common verified grounds.
Shlomi: Self-service not only reduces cognitive load and grows productivity, it’s also very important from the security standpoint. We don’t like telling people that they can’t do something because it may be risky. Self-service can help here - the more risky actions/tools can be accessed through self-service with the proper guardrails. Think of actions as divided among high privilege and minimum privilege and what the impact is on self-service.
Here’s an example: someone wants to create agents on Jenkins. To do this they need to be a Jenkins admin. Will we give admin privileges to each developer? No. Instead, we create a self-service function that allows you to create agents.
Jenkins UI is limited in terms of self-service. You have a GitHub repository, Artifactory, a Jenkins pipeline, DataDog integration, Jenkins agents, secrets - it isn’t possible to set this up for a developer in a self-explanatory way. We ended up putting HTML all over in the descriptions with links to documentation and so on. To me, a good developer portal of the future will be self-explanatory.
Guy: Developers don’t care how I created the permissions module in jenkins and in git. When we embarked on our internal developer platform developers used to ask us why they weren’t admins anymore. We told them – we are here for you, tell us what abilities you need. And with self-service they can do anything, with the right guardrails and without being admins.
Tell us about the CI side.
Shlomi: Our first platform was a CD platform. We also created an Internal Developer Platform for CI. This covers setting up a new pipeline, going through multiple steps required for CI, taking care of security. We’re a cyber security company, and we cannot afford to have anything that’s less than the topmost security standards. This creates an even heavier load on the developer. I need to tell him “hey, set up this repository, then check another 45 things”. Developers just won’t do this and from the CD platform we knew that providing an API won’t help.
We decided to create a UI for the CI platform. All this means we must have a UI on top of this. What’s the best UI? Jenkins. We know what the problems are with Jenkins - you wrote about that in your blog "It's a trap- Jenkins as self-service UI".
Jenkins has limits. You add an Artifactory, and you add secrets, pipelines of different types, and suddenly the Jenkins forms aren’t good enough. You also can’t show who owns a service, dependencies and more. We’re still dealing with those problems today, and that’s one of the reasons we’re looking at developer portals as a product - buying and not building.
What are the next steps? When will you consider buying a product?
Guy: Let’s first take a look at what we built in house:
A CI platform
A CD platform
A Self Hosted test environment – for our non cloud products, infra and configuration management with a full developer portal (created 8 years ago)
And an additional two dozen DevOps tools
Today all this is a product. It needs product management, a user journey, it even raises questions about whether this increases developer retention. We need to deal with scale and reliability. Maybe it means that the next step is to buy a portal.
Shlomi: As the ecosystem becomes more complex, the question of what’s our next step is becoming real. We have GitHub, Jenkins, a Jfrog Artifactory and DataDog as the base tools, a variety of build and security tools such as blackduck or snyk, as well as multiple cloud vendors and many more. You can go through our corridors with a shopping cart and load DevOps tools. Developers can’t deal with all these tools, no one can.
For example, we wanted to implement a security tool that’s supposed to identify when you accidentally included passwords inside your code. The conventional approach would be to have everyone add that tool as a step into their pipeline. But this isn’t that simple. Not only do I have to educate everyone about this, but I also need to take that developer outside their comfort zone. They’re willing to go to GitHub, maybe Jenkins, but now they should also check this password leakage tool? No way. Specifically we solved this through a process of events and hooks that went into that tool and provided the scan result in GitHub. Each tool I add creates a higher cognitive load. We must offer the developer a one stop shop with a unified approach. In this way, adding another tool won’t add more cognitive load.
What would you like to add to your Internal developer platform in the future?
Shlomi: Showing dependencies between teams, the architecture, a “map” is super important. We’d like to add that. The service catalog is “waze” for the company, if there’s a problem with a specific service, who should I talk to? How are its deployments and builds? Each service needs its own scorecard. The scorecard should motivate people to act. For instance, in the context of DORA metrics, we don’t want to just tell people they deploy once every two weeks, we also want to tell them that the company average is every one week.
Guy: Developers want to get info about their product within the product context; security scores, catalogs, even DORA metrics. If they can’t consume this data within their service/product catalog, they won’t care for it. Collecting and visualizing this data is crucial to get your engineering organization to the next level. This is the essence of our vision - we haven't implemented that yet, but this is part of this journey we’re on.
We also know that developers miss a lot of tools they can use. Think of CNCF maps of DevOps tools. Hundreds of tools - no one really knows them all, so they can’t use them.
What about service maturity? What’s your approach?
Shlomi: Exposing scores is important. In terms of service maturity we can tell a team they need to get to a certain level in order to be deployed.
Guy: We need gates that let us create an MVP as soon as possible. Worst case, I can do a roll back. All these maturity indicators in the portal can help you see what stage you're in and when you'll be ready to rollout. You can also see the dependencies of the relative statuses of products that are intertwined. Our main product is based on the work of 10 groups, so we need to think about how all this should be managed.
Some advice for the creation of an internal developer platform in other companies?
Guy: In many ways, the service catalog is your MVP as a DevOps team - a single pane of glass with a catalog of everything. We believe that a portal should give a developer a landing page that holds all the services in the context of their product or service, which is something you can’t achieve without a Developer Portal.
While the CyberArk team had the courage to go ahead with this ambitious project, to serve the entire R&D team, it took a year and a half and required a lot of change management. The problem is that when you do a huge project you can’t do other things. That’s why in the future we want to buy someone else’s product and not necessarily build. We looked atbackstage, but we believe it requires a lot of effort and resources from us, so we probably won’t go down that road. We’re still strategizing about the best way to do this.