A guest blog by Jonas Lenntun from Approved Sweden.
Clearly, we’ll automate!
Automation and efficiency go hand in hand and is something that has been mentioned in IT since the 70’s. Nevertheless, 40 years on, and the majority of companies still have to internalize and embrace automated processes.
The growing amount of devices to be monitored in combination with higher availability requirements makes it more urgent to review their internal processes. Especially when digitization is introduced with more and more critical e-services that are expected to be available 24 hours a day.
Introducing automation involves short-handed removal of manual processes that can easily be performed by a machine according to predetermined routines – in a shorter and the same way, each time.
Some processes have already come a long way in this. Among other things, orders of equipment, user setup or server update, along with a lot of administrative work.
At the IT department, there are three interesting areas with high potential to automate manual processes to become more efficient, reduce shorter lead times and reduce repetitive work.
What can machine learning add?
Machine learning has previously been perceived as not directly relevant to traditional monitoring and incident management. But more and more people realize that it is a matter of highest relevance to simplify everyday life, in every aspect.
Instead of manually escalating incidents or sending out notifications to readiness through complex and blunt regulations, machine learning can be applied.
We can relatively easily train a machine to automatically identify patterns and then perform the actions we want in a very short time.
We have already begun with automation.
Most likely, you have already begun implementing automation in several areas. Since automation is such a wide-ranging area, this article focuses on activities that increase the value of what the monitoring delivers and is more relevant to you in IT operations.
Three important automation areas
At first sight, escalation is considered a rather simple process to automate. However, the more complex the rules are for different types of alarms to be distributed to different groups, depending on certain criteria, the more difficult it will be to easily control these rights through a static regulatory framework.
Instead of building complex script or programs, you can instead look at an alarm and train where to send. How it then comes to the conclusion is where machine learning comes in its right place. It finds patterns we did not know.
Large time savings can be made by shortening the processing time due to the fact that the cases are sent to the correct grouping without having to wait for a manual decision.
Many errors that occur at the operating system level or around inadvertently stopped services can be easily reset.
Even though it is possible to configure it on a Windows service to start up if it is stopped, it is better to allow a monitoring system to capture the error. Since a monitoring system can both restore and maintain statistics, it will be easier to monitor any recurring interference. These statistics also provide a good basis for the problem process with the supplier – the dialogue is based on data instead of rumors and empathy.
Many restorations need to be clearly defined, but there is also the possibility to train a model that learns which rescues are to run in order to minimize complexity through machine learning.
Many errors that occur may be difficult to automatically reset, but this does not mean we should exclude automation.
If a disc indicates that it is running out of space, then the human factor may be needed to determine what can be cleaned. But that does not prevent us from collecting diagnostic information of the person who will be performing the task.
Automation of diagnostics can be to look at which of the largest directories contain the largest files, or to insert a graph of disk usage into the analysis process.
Here too we can use Machine Learning to determine what to run or not.
How do we show results?
- Recovery time overall on the alarms that are automated compared to those that are not
- Automation degree overall – What is the percentage of alarms automated
- Automation rate per queue – What is the percentage of alarms automated per destination
- Recovery time of automatically escalated alarms compared to those done manually
- Recovery time per escalated destination
- Recovery time of automatic reset compared to manual handling
- Recovery time of automated diagnostics compared to manual handling
These are just a few key figures that have a great effect in detecting the results of automation and machine learning.
Below you will find an example of the Approved operational analysis tool “IT Service Analytics” (in Swedish) which, with data from the Microsoft System Center Operations Manager, can show results after the introduction of automation.