How to develop self-healing apps: 4 key patterns

public://pictures/rouan.png
Rouan Wilsenach, Software Developer, TES Global

Self-healing applications sound futuristic. The phrase conjures images of advanced artificial intelligence (AI) algorithms adjusting complex processes. But while machine learning may be the next big thing in the tech industry, you don’t need any AI at all to build applications that heal themselves.

Most approaches to self-healing apps follow the same basic patterns, which most software engineers already use frequently—without realizing it. Here's how to get started.

Testing in the Agile Era: Top Tools and Processes

3 principles for self-healing apps

First, identify the basic building blocks and find ways to use them in your applications. Here are three principles:

  • Get to know your application. This is the most important principle. It's easy to dive in and start automating self-healing mechanisms, but you'll waste time if you don't understand the production problems your application is having before you begin. First, set up automated alerts to see which error scenarios are most common.
  • Prevention is better than cure. Sure, it's nice to be able to recover from error scenarios automatically, but it's better to prevent the error scenarios wherever possible. Take a holistic look at problems you detect so that you can identify and fix the root causes where possible.
  • A primary goal of self-healing systems should be happier development teams. Self-healing systems bring benefits to users and operations. But an often-overlooked benefit is that an application that fixes itself reduces the support burden on the development team. Less rote and menial work means happier developers.

How to pattern your self-healing app behavior

1. Error handling

The first pattern to look at is error handling. The idea is simple enough: spot an error and adjust how the system responds accordingly.

You've probably been using this pattern for a long time. A familiar example is catching an exception and returning a different HTTP status code, or redirecting the user to a useful error page. You can use this same pattern to implement something a bit more sophisticated.

Let's say you've built your own payment service to process user payments. It saves your company heaps of money because you don't need to pay another vendor, but it's currently under construction and suffering from stability issues.

You could write some fairly simple code to fall back to using a third-party payment provider if your primary payment service fails to respond within a certain time, or if it returns an error. By doing so, you preserve the primary flow of your application, but you've provided an alternate flow in the event of error so that the application (and, more importantly, the user) can continue doing what's important.

2. Manage the flow of information

The second pattern concerns managing the flow of information, which is also probably something you're already doing. For example, when an exception is thrown in your application, you probably catch it and log useful information to your application logs before allowing the exception to propagate.

This is a way of directing information to a person who is able to deal with it appropriately. You can use this same pattern to save your team huge amounts of time.

Error scenarios are often related to bad data. Let's say a "Company" entity in an upstream service's database needs to have an "Office Location" field so a user can assign a shipment to that company. It's fairly common, especially in complex or enterprise systems, for that data to be owned by another team. It's common in such a case to just bug the other team until it supplies the missing data.

What if, instead, you built a simple dashboard where you could surface problematic "Company" records? Your application could automatically add problematic "Company" records to this dashboard when it encounters them—and could even email the team responsible for the data to remind them to check the dashboard.

This saves the development team time, but may also help notify the upstream team of a problem that they could fix for good. I'm not suggesting replacing human interaction with a dashboard, but I've found that having such a system in place can help facilitate conversations.

3. Adjust problematic data

Problematic data is the next pattern to examine. Again, this is something you've likely encountered. A good example is automatically falling back to using another email address field if there is no data in the expected email field for a user.

For example, your app might contact a generic company email address when a user at that company does not have a valid email address. Again, you can employ this simple pattern to do something more powerful.

What if your application paused the workflow currently under way and sent an email to generic address, asking the user to fill in a short form with the missing information? The user could then enter the correct data—preventing the same issue from cropping up again—and the system could resume the process once the correct data is received.

This pattern is especially powerful when an interaction in your application depends on two different users. Finding a way to let them facilitate getting the right data from one another is a great way to reduce the support burden on your team.

4. Retry

The final pattern is probably the simplest of all—a retry. A familiar example is retrying a certain number of times when a call to another service fails. This can be very useful when another system is unreliable. But what about when the user is unreliable?

You could use the same strategy to find missing actions in your system and encourage users to perform them. For example, if you have a vacation-booking system, you could write a simple script to check for bookings to which the vacation-accommodation host has not yet been responded. You can then automatically send an email to remind the host to respond. 

[ Webinar: Agile Portfolio Management: Three best practices ]

Keep it simple with automation

Self-healing systems are not about machine learning or artificial intelligence, but about understanding your systems' frequent error scenarios and automating simple recovery steps.

By keeping these four simple patterns in mind, you'll be able to find creative ways to make your application more robust by automating solutions to those cumbersome support issues that keep cropping up.

Share your best practices for self-healing apps in the comments below.