3 Reasons To Implement Automation & Machine Learning For IT-Operations
by Jonas Lenntun, on 05-Mar-2018 15:01:25
Clearly, we'll automate!
Automation and efficiency go hand in hand and is something that has been mentioned in IT since the 70's. Nevertheless, 40 years on, and the majority of companies still have to internalize and embrace automated processes.
The growing amount of devices to be monitored in combination with higher availability requirements makes it more urgent to review their internal processes. Especially when digitization is introduced with more and more critical e-services that are expected to be available 24 hours a day.
Introducing automation involves short-handed removal of manual processes that can easily be performed by a machine according to predetermined routines - in a shorter and the same way, each time.
Some processes have already come a long way in this. Among other things, orders of equipment, user setup or server update, along with a lot of administrative work.
At the IT department, there are three interesting areas with
What can machine learning add?
Machine learning has previously been perceived as not directly relevant to traditional monitoring and incident management. But more and more people realize that it is a matter of highest relevance to simplify everyday life, in every aspect.
Instead of manually escalating incidents or sending out notifications to readiness through complex and blunt regulations, machine learning can be applied.
We can relatively easily train a machine to automatically identify patterns and then perform the actions we want in a very short time.
We have already begun with automation.
Most likely, you have already begun implementing automation in several areas. Since automation is such a wide-ranging area, this article focuses on activities that increase the value of what the monitoring delivers and is more relevant to you in IT operations.
Three important automation areas
At first sight, escalation is considered a rather simple process to automate. However, the more complex the rules are for different types of alarms to be distributed to different groups, depending on certain criteria, the more difficult it will be to easily control these rights through a static regulatory framework.
Instead of building complex script or programs, you can instead look at an alarm and train where to send. How it then comes to the conclusion is where machine learning comes in its right place. It finds patterns we did not know.
Large time savings can be made by shortening the processing time due to the fact that the cases are sent to the correct grouping without having to wait for a manual decision.
Many errors that occur at the operating system level or around inadvertently stopped services can be easily reset.
Even though it is possible to configure it on a Windows service to start up if it is stopped, it is better to allow a monitoring system to capture the error. Since a monitoring system can both restore and maintain statistics, it will be easier to monitor any recurring interference. These statistics also provide a good basis for the problem process with the supplier - the dialogue is based on data instead of rumors and empathy.
Many restorations need to be clearly defined, but there is also the possibility to train a model that learns which rescues are to run in order to minimize complexity through machine learning.
Many errors that occur may be difficult to automatically reset, but this does not mean we should exclude automation.
If a disc indicates that it is running out of space, then the human factor may be needed to determine what can be cleaned. But that does not prevent us from collecting diagnostic information of the person who will be performing the task.
Automation of diagnostics can be to look at which of the largest directories contain the largest
Here too we can use Machine Learning to determine what to run or not.
How do we show results?
Introducing automation and machine learning in IT operations has many advantages. Since many things happen without anyone even discovering it, follow-up is one of the most important parts to improve results after the introduction.
There are many important key figures to look for before and after the introduction, but the most important thing is of course "Mean Time To Repair", shortened MTTR. In short, the time it takes for the alarm to be resolved and closed.
Because we can divide automation into three different categories, we can measure:
- Recovery time overall on the alarms that are automated compared to those that are not
- Automation degree overall - What is the percentage of alarms automated
- Automation rate per queue - What is the percentage of alarms automated per destination
Recoverytime of automatically escalated alarms compared to those done manually
- Recovery time per escalated destination
Recoverytime of automatic reset compared to manual handling
Recoverytime of automated diagnostics compared to manual handling
These are just a few key figures that have a great effect in detecting the results of automation and machine learning.
Below you will find an example of the Approved operational analysis tool "IT Service Analytics" which, with data from the Microsoft System Center Operations Manager, can show results after the introduction of automation.
Automation of IT operations is a topic that can not be ignored if you don't want to risk getting lost. The challenge at first is to decide how and where to start. Building down and up and analyzing where to put the effort is a common tactic. With automation, basically, you suddenly get action that runs 24/7 on all your deliveries, reducing the need for emergency preparedness.
We hope you had a good introduction to why you just need to look at automation and machine learning in your organization.`
For more information, email us at firstname.lastname@example.org.