How internal developer portals improve incident management

Overview of what’s missing from your incident management program

To combat incidents, an incident management framework is usually put in place. The first point of call is an incident management tool. These tools provide organizations with the capability to trigger, escalate and manage incidents in a highly effective way. They enable organizations to configure the on-call rotation, trigger alerts based on events sent from monitoring tools, notify the on-call, and let the various respondents communicate around the incident. Then, when the incident is resolved, the tool enables you to close the incident and issue a summary.

What about actually resolving the incident? While incident management tools can handle incident logistics, the onus is on the on-call engineer to resolve it. Instead of having on-call engineers access multiple tools to resolve the incident, a portal can provide a great start, with everything from code owners, to monitoring links, downstream and upstream dependencies and more. On-call engineers need support to deal with the incident beyond what an incident management tool alone can provide, and internal developer portals can do just that.

What on-call need is:

A better overall experience - so they are able to find and access the information they need with ease, from ‘what are the upstream/downstream dependencies’ to underlying infrastructure health metrics, and understanding who the owner of a service is and if a service is being monitored correctly.

Context - with all of the information in one place, so they can understand what’s going on across the organization in real-time, without having to switch between different tools.

Autonomy - so that when they’re investigating the incident, they can perform quick easy-to-execute actions without needing to ask for help. For instance, actionable playbooks with pre-configured self-service actions would enable engineers to use day-2 operations as part of the incident remediation process.

How does the portal support incident management?

While there are many reasons to adopt an internal developer portal as part of a platform engineering initiative, a portal is especially valuable in the context of incident management, providing:

A better experience because:

It is the primary tool developers already use across the SDLC- it offers an easy-to-use interface to get the information they need and perform the actions they need. This improves the efficiency of incident management as engineers have an easier way to act, remediate and prevent further incidents.
SREs who build the framework to deal with incidents can build a big part of it in the portal. SREs can build scorecards to monitor production readiness of different services and applications, preventing incidents in the first place. SREs can also create self-service actions that on-call can use during the incident resolution process.

Context by using:

A software catalog that reflects the entire engineering ecosystem for the developer, from services through APIs, CI/CD and more, including their chosen incident management tool’s information and capabilities. The integrations and data feed into the portal’s software catalog providing a complete picture. By searching through the software catalog, users can find the information they need about a service, application or owner.

Autonomy through:

Self-service actions, so that engineers can use day-2 operations and don’t need to rely on DevOps or ticketops.
Dashboards tailored for different teams or individuals with the data they need, seamlessly.

Four-step strategy to manage incidents using a portal

So how do you actually use the portal to benefit your incident management program?

1. Have your foundations in place

How often have you discovered that critical services aren’t monitored? Or that you had to send several slack messages just to understand who the owner of a specific service was? This should never happen, but during incidents, this can be critical. In this section, we will explain how you can ensure that you have all your foundations in place.

With a software catalog in place, it’s pretty easy to answer the following questions:

Who’s on call right now?
Who’s the owner of this service?
Is this service properly monitored? And where exactly?

For production readiness questions, SREs can use scorecards to define standards and track the compliance of services, so they can understand if something is missing and resolve this.

For instance, they may check monitoring services are active or if it includes API documentation, an established on-call rotation and runbooks before going to production. SRE teams can just glance at the scorecard to check whether a service is ready, and what is required to make it ready, rather than having to manually verify this. Likewise, they can use the portal to communicate initiatives and easily track them by user, team and more.

2. Investigate with context you wouldn’t otherwise have

Once an incident is detected - thanks to the integration with the incident management tool - the on-call engineer is notified. They’ll first try to implement a quick fix in order to mitigate (which can also be done using day-2 ops in the portal, more about this later) but then they need to investigate. Of course, they’ll start looking at their monitoring tools, and they’ll get useful information such as network latency, CPU usage etc. But they’re missing important context.

Perhaps a recent deployment caused the incident to happen - this is not necessarily something you can find in the monitoring tool. Next, they’ll want to check downstream dependencies of the faulty service, to ensure they’re not impacted either. Yet again, this information isn’t always available in the incident management tool.

As the portal connects with monitoring, Git, CD and security, the on-call engineer will have all the information side-by-side making it quicker and clearer for them to understand what’s happening. Even the monitoring information can be displayed in the portal so that they don't have to jump between multiple tools.

3. Automate actions and remediation

Once the on-call engineer understands what is happening, they need to take action. Sure, they have detailed playbooks explaining step-by-step what to do. These require the engineer to log in to instances, perhaps copy and paste some scripts or change the configuration. While these playbooks are invaluable; they can be enhanced using a portal.

As part of the preparation stage, SREs can create self-service actions directly from the portal that allow respondents to execute on their existing incident management playbooks for every scenario. The engineer can then start the remediation process using day-2 operations that have been pre-configured for them directly from the portal - for example:

Requesting permission to a cluster
Rollback a service
Scale up a cloud resource
Toggle off a feature flag

Here are some examples from Port’s demo:

By enabling engineers to act directly in the portal, they no longer need to switch between different tools or copy and paste scripts, increasing efficiency, improving the developer experience and reducing cognitive load. As self-service actions are centralized and provided with the complexity abstracted away, they reduce the risk of errors and simplify the execution of necessary remediation steps.

Now, imagine having one place with everything you need to solve your incident.
The relevant information, your health metrics and quick actions all at your fingertips.

4. Learn and prevent (and put that on loop)

Continuous learning is crucial for preventing future issues; it enables you to reduce frequent issues and build better products. Accessing useful data in an easy-to-consume way is important, so that your engineering team can go from being reactive and fixing issues, to being proactive and improving the way you develop and respond.

Maturity scorecards to prevent incidents:

By creating maturity scorecards, an engineering team can ensure best practices are always upheld - for example, ensuring you have the right number of ReplicaSets or that critical vulnerabilities are being remediated.

Dashboards to evaluate and analyze performance:

SREs can define and monitor performance metrics that should be monitored. These metrics could include the number of outages, MTTR, and failed deployments. These metrics, as well as others can then be visualized in the portal’s dashboards.

With continuous monitoring, you can identify which resources are prone to incidents and cause the highest number of outages. You can check how long it takes to recover per team and per service, and establish initiatives to improve. These initiatives may be to train underperforming teams, fine tune processes or resolve issues with technical debt. You can use the portal to communicate the initiatives and track them by developer, team and service.

This ongoing evaluation helps in building a more resilient and efficient system.

A more efficient, effective and informed incident management approach

By coupling the power of incident management tools with the simplicity and context-rich nature of a developer portal, engineering teams can better deal with incidents at all levels. The customizability of the portal means that all of those involved in the incident management program - SREs, developers and managers - can all benefit in different ways, streamlining the overall approach and success of the program.

Want to start with incident management for your internal developer portal and don’t know where to start? Check out the following materials:

Book a demo right now to check out Port's developer portal yourself

Book a demo

It's a Trap - Jenkins as Self service UI

How do GitOps affect developer experience?

It's a Trap - Jenkins as Self service UI. Click her to download the eBook

Download eBook

Learning from CyberArk - building an internal developer platform in-house

Example JSON block

{
  "foo": "bar"
}

Order Domain

{
  "properties": {},
  "relations": {},
  "title": "Orders",
  "identifier": "Orders"
}

Cart System

{
  "properties": {},
  "relations": {
    "domain": "Orders"
  },
  "identifier": "Cart",
  "title": "Cart"
}

Products System

{
  "properties": {},
  "relations": {
    "domain": "Orders"
  },
  "identifier": "Products",
  "title": "Products"
}

Cart Resource

{
  "properties": {
    "type": "postgress"
  },
  "relations": {},
  "icon": "GPU",
  "title": "Cart SQL database",
  "identifier": "cart-sql-sb"
}

Cart API

{
 "identifier": "CartAPI",
 "title": "Cart API",
 "blueprint": "API",
 "properties": {
   "type": "Open API"
 },
 "relations": {
   "provider": "CartService"
 },
 "icon": "Link"
}

Core Kafka Library

{
  "properties": {
    "type": "library"
  },
  "relations": {
    "system": "Cart"
  },
  "title": "Core Kafka Library",
  "identifier": "CoreKafkaLibrary"
}

Core Payment Library

{
  "properties": {
    "type": "library"
  },
  "relations": {
    "system": "Cart"
  },
  "title": "Core Payment Library",
  "identifier": "CorePaymentLibrary"
}

Cart Service JSON

{
 "identifier": "CartService",
 "title": "Cart Service",
 "blueprint": "Component",
 "properties": {
   "type": "service"
 },
 "relations": {
   "system": "Cart",
   "resources": [
     "cart-sql-sb"
   ],
   "consumesApi": [],
   "components": [
     "CorePaymentLibrary",
     "CoreKafkaLibrary"
   ]
 },
 "icon": "Cloud"
}

Products Service JSON

{
  "identifier": "ProductsService",
  "title": "Products Service",
  "blueprint": "Component",
  "properties": {
    "type": "service"
  },
  "relations": {
    "system": "Products",
    "consumesApi": [
      "CartAPI"
    ],
    "components": []
  }
}

Component Blueprint

{
 "identifier": "Component",
 "title": "Component",
 "icon": "Cloud",
 "schema": {
   "properties": {
     "type": {
       "enum": [
         "service",
         "library"
       ],
       "icon": "Docs",
       "type": "string",
       "enumColors": {
         "service": "blue",
         "library": "green"
       }
     }
   },
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "system": {
     "target": "System",
     "required": false,
     "many": false
   },
   "resources": {
     "target": "Resource",
     "required": false,
     "many": true
   },
   "consumesApi": {
     "target": "API",
     "required": false,
     "many": true
   },
   "components": {
     "target": "Component",
     "required": false,
     "many": true
   },
   "providesApi": {
     "target": "API",
     "required": false,
     "many": false
   }
 }
}

Resource Blueprint

{
 “identifier”: “Resource”,
 “title”: “Resource”,
 “icon”: “DevopsTool”,
 “schema”: {
   “properties”: {
     “type”: {
       “enum”: [
         “postgress”,
         “kafka-topic”,
         “rabbit-queue”,
         “s3-bucket”
       ],
       “icon”: “Docs”,
       “type”: “string”
     }
   },
   “required”: []
 },
 “mirrorProperties”: {},
 “formulaProperties”: {},
 “calculationProperties”: {},
 “relations”: {}
}

API Blueprint

{
 "identifier": "API",
 "title": "API",
 "icon": "Link",
 "schema": {
   "properties": {
     "type": {
       "type": "string",
       "enum": [
         "Open API",
         "grpc"
       ]
     }
   },
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "provider": {
     "target": "Component",
     "required": true,
     "many": false
   }
 }
}

Domain Blueprint

{
 "identifier": "Domain",
 "title": "Domain",
 "icon": "Server",
 "schema": {
   "properties": {},
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {}
}

System Blueprint

{
 "identifier": "System",
 "title": "System",
 "icon": "DevopsTool",
 "schema": {
   "properties": {},
   "required": []
 },
 "mirrorProperties": {},
 "formulaProperties": {},
 "calculationProperties": {},
 "relations": {
   "domain": {
     "target": "Domain",
     "required": true,
     "many": false
   }
 }
}

Microservices SDLC

Scaffold a new microservice
Deploy (canary or blue-green)
Feature flagging
Revert
Lock deployments
Add Secret
Force merge pull request (skip tests on crises)
Add environment variable to service
Add IaC to the service
Upgrade package version

Development environments

Spin up a developer environment for 5 days
ETL mock data to environment
Invite developer to the environment
Extend TTL by 3 days

Cloud resources

Provision a cloud resource
Modify a cloud resource
Get permissions to access cloud resource

SRE actions

Update pod count
Update auto-scaling group
Execute incident response runbook automation

Data Engineering

Add / Remove / Update Column to table
Run Airflow DAG
Duplicate table

Backoffice

Change customer configuration
Update customer software version
Upgrade - Downgrade plan tier
Create - Delete customer

Machine learning actions

Train model
Pre-process dataset
Deploy
A/B testing traffic route
Revert
Spin up remote Jupyter notebook

Engineering tools

Observability
Tasks management
CI/CD
On-Call management
Troubleshooting tools
DevSecOps
Runbooks

Infrastructure

Cloud Resources
K8S
Containers & Serverless
IaC
Databases
Environments
Regions

Software and more

Microservices
Docker Images
Docs
APIs
3rd parties
Runbooks
Cron jobs

Check out Port's pre-populated demo and see what it's all about.

Check live demo

No email required

Contact sales for a technical product walkthrough

Let’s start

Open a free Port account. No credit card required

Let’s start

Watch Port live coding videos - setting up an internal developer portal & platform

Let’s start

Check out Port's pre-populated demo and see what it's all about.

(no email required)

Let’s start

Contact sales for a technical product walkthrough

Let’s start

Open a free Port account. No credit card required

Let’s start

Watch Port live coding videos - setting up an internal developer portal & platform

Let’s start

How internal developer portals improve incident management

Overview of what’s missing from your incident management program

How does the portal support incident management?

Four-step strategy to manage incidents using a portal

1. Have your foundations in place

2. Investigate with context you wouldn’t otherwise have

3. Automate actions and remediation

4. Learn and prevent (and put that on loop)

A more efficient, effective and informed incident management approach

Previous article

Next article

Book a demo right now to check out Port's developer portal yourself

It's a Trap - Jenkins as Self service UI

How do GitOps affect developer experience?

It's a Trap - Jenkins as Self service UI. Click her to download the eBook

Learning from CyberArk - building an internal developer platform in-house

Example JSON block

Order Domain

Cart System

Products System

Cart Resource

Cart API

Core Kafka Library

Core Payment Library

Cart Service JSON

Products Service JSON

Component Blueprint

Resource Blueprint

API Blueprint

Domain Blueprint

System Blueprint

Microservices SDLC

Development environments

Cloud resources

SRE actions

Data Engineering

Backoffice

Machine learning actions

Engineering tools

Infrastructure

Software and more

Check out Port's pre-populated demo and see what it's all about.

Contact sales for a technical product walkthrough

Open a free Port account. No credit card required

Watch Port live coding videos - setting up an internal developer portal & platform

Check out Port's pre-populated demo and see what it's all about.

Contact sales for a technical product walkthrough

Open a free Port account. No credit card required

Watch Port live coding videos - setting up an internal developer portal & platform

Let us walk you through the platform and catalog the assets of your choice.