Lior Rabin is the DevEx team lead at monday.com. Lior’s team developed an internal developer portal named “Sphera” with the goal of promoting developer velocity through an improved developer experience and self-service. The following is what he told us in an interview about Sphera.
Why create an internal developer platform?
My team, the Developer Experience team, is part of a larger Infrastructure group. We’re there to make sure that developer needs are met, so that developers can be more productive.
At monday, developers have end-to-end ownership of the services they code, and we want to enable that as much as possible. On the other hand, as infra, we can’t let them provision or modify services or resources directly due to the sensitivity of our architecture. At the time, the infra team had become a bottleneck. We had a queue of requests that kept growing and needed someone on-call to respond to requests such as: creating an SQS in AWS, or complaints about approvals and reviews in the microservice creation process. Our first motivation was to stop being a bottleneck so developers will be able to run faster; we also wanted to automate some of the DevOps work we do.
How did you realize you needed a developer portal?
Our first step was to provide our developers a CLI which we had built a while ago. We broadened it so developers could generate their own files in terraform and open a pull request for us to approve and apply, since they did not have the permissions to execute ‘terraform apply’ on their own.
Additional issues came up pretty quickly because the flows are too complex: a developer created a microservice and did not remember they should configure it to the CI/CD , the configuration was in a different repository, something that was needed for staging was missing. I have to admit that even as infra people we don’t remember everything that needs to be done when a new microservice is created. It’s very difficult to keep this knowledge in a checklist and enforce it systematically while giving all the responsibility to the developer, and our list became longer and longer. So developers still needed our help, and it still took them a week or more to set up a new microservice properly.
At this point we realized that adding automations won’t do the trick, that we need a developer portal with a rich user interface that puts our developers on a golden path to success and self-service..
Build vs Buy: what choices did you consider for the internal developer portal?
A year ago, we looked at the various products on the market. We did not want a product, like backstage.io, which relies on YAMLs, since we wanted to manage things with code that we can customize as much as possible to our needs. This led to the decision to develop this ourselves.
Our proof of concept was the microservice creation flow: letting a developer create a microservice and have it run in minutes on their dev environment (time to hello-world). We decided we must have a UI, that using the portal shouldn’t feel like a terminal, but more of a natural flow.
The proof of concept worked well. I must admit the initial UI was ugly, since we’re not UI experts and we did not have a DevEx team at the time. We got great feedback and decided to invest more in the project. Today 70-80% of what my team does is work on the developer portal.
What happened after the proof of concept?
The next stage was MVP - create a table with a list of microservices developed at Monday, pull the data from github and show a table with the version of the npm packages and maybe indicate if there is a more updated version. This was read only, of course.
We’re improving Sphera daily. I actually have a box with 150 t-shirts that have “sphera builder” on them. When a developer implements a task from our “quick wins” pool , they get one.
Today developers can view all secrets and create secrets, open pull requests for staging infra, and lately we added the management of feature flags directly from Sphera. At monday.com all our new features are flagged - opened for a certain percentage of our users or accounts. Previously this was managed through a CLI, which is accessed within a production pod (this isn’t optimal from a security standpoint). With feature flags in Sphera, you can see all flags (which wasn’t possible before) and you can close them and update them. Feature flags made Sphera adoption jump because they are used almost daily. We did consider using a 3rd party solution for feature flags, but it made more sense to integrate it into Sphera at this time.
Today, a lot of R&D discussions about where to manage/display service-related features (and other management features) involve Sphera as a proposed solution - that’s a win. On the DevOps side, we also know that anything new needs to be accessed/managed from the portal. Today Sphera has its own microservices, a Sphera agent and a Sphera kubernetes executor.
What is the adoption rate of your internal developer portal?
Our KPI is the percentage of developers that entered the developer portal at least once in the past week. Today, it stands at around 40%. The more actions we enable through Sphera, the more people use it, especially since we do not allow these actions to be performed outside the portal.
How does gitops play with this UI based approach?
There are some things we won’t do with gitops. For instance, won’t create secrets in code. We don’t want to create feature flags in code. I also don’t want developers to be terraform experts.
Where do you want to go next?
With every change we make we get a lot of feature requests. Today one of our main issues is that we don’t have enough people to respond to all of them.
Our vision - and we haven’t gotten there yet - is that the developer will come to work, open the portal and never leave it, it will be the only tab they need to have open. This will provide them with monitoring, the setting up of infrastructure, logging roll-back, resource creation and much more. Today there are so many tools that people become lost. We also want to use this as DevOps, for visibility purposes, drift detection, etc. We have multiple clusters across different regions. As DevOps we can enjoy the portal to get a single view of all the “infra assets” we manage and maintain.
A feature we’re thinking of for the future is to score a service’s readiness level. If a service is not updated with the packages that we have defined as core, if on github it doesn’t have configurations that are must, if the service doesn’t exist in all our geo regions - we can calculate a score . A score will let us warn the service owners that their service is below par. If a service drops below a certain threshold, we may want to block deployments to production or something similar.
How did you release this to developers? You changed their workflow…
Developers don’t like it when you change their workflow and especially when you take their permissions away but if you wrap it with something usable and explain what you’re doing, they will accept it. No one likes to have their permissions taken away - but we have security and other needs such as conventions, how to place secrets etc. With Sphera, we can enforce these things while providing our developers with an exceptional self-service experience.
Has there been a reduction in the amount of tickets?
There was a reduction. Most microservice related tickets - microservice doesn’t work, no permissions, etc… are gone.
The process has changed. To get a microservice to production a developer needs to provide a design doc for the microservice - this ends up as one request for the infra group and prevents many back and forth tickets. This means one person can take the task and do it instead of different infra people doing this over the course of several weeks.
Is this also useful for DevOps?
We are moving towards a multi-region multi-cluster world. With so many clusters it is very difficult to understand what is installed where, in which version, etc. In the next versions of Sphera we want to provide DevOps with visibility, as we move into a multi cluster environment.
Another thing we want to do for DevOps is to show drift detection both for developers and DevOps engineers. For each thing we create - a secret, a cluster, SQS, SNS - we want Spehra to alert us when it is missing in one of the environments/regions. Today we have no alerts to show that there are differences between different installations on different infrastructures. When we have a terraform CI/CD process we also want to know when there is a drift in the code or infra, that happens too. Someone may have manually worked on HPA and forgot to update the gitops. So the two main use cases for internal developer portals used by DevOps are visibility and alerts.