current position:Home>Summary of root cause analysis ideas and methods | ensuring it system and its stability

Summary of root cause analysis ideas and methods | ensuring it system and its stability

2022-05-15 07:23:15Cloud smart aiops community

Cloud wisdom AIOps The community is initiated by cloud wisdom , For the O & M business scenario , Provide algorithm 、 Calculate the force 、 The overall service system of data set and the solution exchange community of intelligent operation and maintenance business scenario . The community is committed to spreading AIOps technology , Designed to communicate with customers in various industries 、 user 、 Researchers and developers work together to solve the technical problems of intelligent operation and maintenance industry , Push AIOps Technology landing in the enterprise , Build a healthy and win-win society AIOps Developer ecology .


In recent years , With IT The system monitoring capability is becoming more and more mature ,IT The root cause analysis of system runtime problems has attracted the attention of many researchers . This paper investigates a large number of relevant literature on root cause analysis in the field of operation and maintenance , Combined with the actual needs of operation and maintenance , The root cause analysis problem is disassembled , The solutions of each sub problem are summarized and analyzed .

One 、IT Concept and abstraction of system and its stability

IT System , namely IT infrastructure . Its definitions vary widely , But it is generally considered to be a collection of a series of physical equipment and application software necessary to operate the whole organization , It also includes the service collection of human and technical capabilities within the organization determined by the management budget . Information technology hardware often mentioned in the industry 、 Software 、 Investment in services , In fact, that is IT infrastructure . For businesses , These facilities can serve customers 、 Establish contact with suppliers and provide basis for internal management .IT Infrastructure expenditure often accounts for... Of the information technology expenditure of large enterprises 25%-30%.IT The task of system operation and maintenance shall ensure the stability of service operation environment as much as possible —— That is, in a limited IT Within the resource conditions provided by the infrastructure , Ensure the smooth operation of the service . Pictured 1 Shown , By monitoring the operation status of the system ( Condition monitoring ), The operation and maintenance personnel need to analyze the fault points ( Fault detection ), And trace back to the source of the problem ( Root cause analysis ), Then manage the system ( Control strategy and control signal generation ), So as to restore the normal operation of the system or keep it stable .

Root cause analysis as IT An important part of operation , To find out which events really triggered IT A phenomenon or symptom in the system . Similar to clinical diagnosis , The operator comprehensively analyzes the index data and system log , Determine the main problem of the system , So as to realize fault location . In many enterprises , To control costs , To maintain the stability of the service , There is a huge demand for root cause analysis technology . Good and mature root cause analysis technology can help system operation and maintenance personnel quickly locate system problems , So as to speed up the repair of the problem , Solve... At the lowest possible cost IT Problems encountered in the operation of the system , Increase the smooth running time of the system , Reduce business losses . In the figure 1 in , Root cause analysis is the key link between the problem discovery module and the problem solving module , It plays a very important role . One side , Today's IT The system is very complex , The system is prone to thousands of nodes , And the structure and function between nodes are highly coupled , Single point problems often have a considerable scope of influence , The operation and maintenance personnel are often unable to directly determine the fault point of the system . On the other hand , In order to resolve the impact of system failure , The operation and maintenance personnel need to fix the system based on the results of root cause analysis , So that the system can return to normal in the shortest possible time .

Traditional operation and maintenance needs to be carried out manually . The operation and maintenance personnel often need to go through a difficult process to troubleshoot the root cause at this stage , You need to check the system log 、 Monitoring indicators , Understand system status , To infer the cause of the system failure . With the development of automatic operation and maintenance , The automatic monitoring and information collection of the system are becoming more and more perfect , The monitoring of the system by the operation and maintenance personnel is becoming more and more intuitive , And the real-time degree is greatly improved . In the face of small IT System time , Automatic operation and maintenance greatly strengthens the control ability of operation and maintenance personnel to the system , It can reduce the workload invested by the operation and maintenance personnel in narrowing the root cause of the problem . But at the same time , With DevOps And the development of cloud technology ,IT The scale of the system is becoming larger and larger , It is not uncommon for an independent system to have thousands of nodes . in addition , The microservicing of services also makes IT The structure of the system changes more and more rapidly , Root cause analysis of system structure for large-scale system , Due to the huge amount of monitoring data , Relying solely on operation and maintenance personnel to conduct root cause analysis is stretched .

There are still many deficiencies in manual obstacle removal , The ability of automated root cause analysis has become the focus of attention . Automated root cause analysis , It is to use the algorithm to automatically analyze the given fault problem , The process of outputting recommendation results to assist the operation and maintenance personnel in troubleshooting system problems . Automated root cause analysis , It can reduce the workload of operation and maintenance personnel , Reduce the average repair time of system problems , Improve the average available time of the system . in addition , about C For end-to-end enterprises , It also reduces customer complaints , Reduce operation and maintenance costs , And then an important means to improve economic benefits .

Two 、 Summary of root cause analysis ideas and methods

Root cause analysis (root cause analysis) The word is not the invention of operation and maintenance . stay IT In the field , Root cause analysis initially refers to analyzing the problem points that cause the program to run abnormally , That is what we usually say “ look for bug”. later , As operation and maintenance and development become more and more inseparable , The term root cause analysis in the development mouth has gradually expanded to the operation and maintenance industry , Evolved into what we now understand IT Root cause analysis in operation and maintenance .

seeing the name of a thing one thinks of its function , Root cause analysis is a process from problem phenomenon to problem essence . according to IT The characteristics of operation and maintenance itself , The root cause analysis problem can be further divided into two parts . First , We need to determine the location of the problem at the macro level , Give relevant location information and approximate problem scope ; This process is called root cause range compression . secondly , We should further check the single point according to the result of root cause range compression , Specifically locate the problem event on the node , Provide relevant logical evidence for operation and maintenance personnel to solve problems ; This process is called root cause event lookup .

In the following space , We introduce the existing ideas and methods for root cause range compression and root cause event search , The advantages and disadvantages of each method are simply analyzed .

1、 Root cause range compression method

The main purpose of root cause range compression is from IT Screen the main fault points of the problem from the huge monitoring data of the system . therefore , The method used in this process is mainly data-driven , Supplemented by operation and maintenance logic , Through the combination of statistical characteristics of data and operation and maintenance experience, the scope of problem source is screened . Because data-driven models and methods will inevitably be affected by data quality , Therefore, the process also needs to estimate the credibility of the results .

Classifier based model , For example, decision trees 、 Support vector machine ( A dichotomous model )、 neural network ( Two or more categories ) Wait for the model , Convert the system state into a feature , By learning the characteristics of the system and the implicit logic between them , So as to judge the state of the system and the corresponding root cause range . By means of statistical learning or machine learning , Classifier models can easily extract system constraints automatically , So we can infer the root cause range in different cases , Therefore, this kind of model has a good scope of application : As long as we can extract the features of the system and label the data , Such models can often be used for root cause range compression . But then comes , Classifier based methods generally have the problem of poor interpretability of results , And the system knowledge is implicit in the model structure ; For root cause analysis problems , It's hard to verify what the model has learned “ Operation and maintenance knowledge ” Is it real . On the other hand , At present, we have not found a more effective general method for feature extraction and screening . For different operation and maintenance data , What features need to be paid attention to to to solve the problem of root cause range compression , It also takes time to accumulate . therefore , Such methods have certain requirements for the quantity and quality of system monitoring data .

Besides , There is also a more common model, such as Markov model , Random Petri Net, etc , Through the simulation of the system structure , Use the operation and maintenance experience built into the model to estimate the root cause range . The characteristics of this kind of model are , Data oriented , Build the general framework of the system through the diagram model , Then the probability distribution of the transfer relationship in the model is determined by means of machine learning , So as to automatically generate the probability model of operation and maintenance knowledge . In this kind of model, there is a popular idea of root cause analysis at this stage , That is, on the association topology of nodes or indicators , Use statistical learning to model the system , Then the operation and maintenance experience is designed as algorithm logic to compress the root cause range . for example , Some methods use the distribution characteristics of faults to identify the root cause location , The premise is , If a large number of abnormal services pass through a node , Then the node is more likely to become the root cause . This series of methods combine the system model with the operation and maintenance experience , Can play a better root cause range compression effect .

in general , Data driven methods can really play a role in narrowing the root cause range , However, it still cannot provide enough help for subsequent analysis in most scenarios , The presentation of the details of the problem is not sufficient . At a deeper level , because “ Causality ” And “ The correlation ” The difference between the two concepts is difficult to bridge , It is difficult to make a good correspondence between the correlation extracted by the algorithm and the causal relationship of system fault propagation . The fusion of operation and maintenance experience in the algorithm can appropriately shorten the root cause analysis “ Causality ” and “ The correlation ” Distance of , However, it is still not enough for the operation and maintenance personnel to obtain enough information to repair the problem . therefore , Only the ability of root cause range compression , We still need some manual troubleshooting , It's hard to achieve IT Automation of system control closed loop .

2、 Root cause event lookup method

At present, there are few solutions to the root cause event search problem , Academic discussion on this issue is not sufficient . Bell Labs was back in 1999 In, a root cause analysis framework based on event reasoning was proposed . This framework proposes a root cause reasoning method based on event relation graph , The root cause reasoning problem under incomplete information and the influence of the introduction of timing information on the modeling ability of event relationship graph are considered . This method has good guidance and reference significance for the root cause analysis of intelligent operation and maintenance , But unfortunately , Later, most of the in-depth research in this direction turned to Petri Net On , Gradually separated from the current operation and maintenance scenario —— From the perspective of analyzing the occupation of operation and maintenance resources , This method can still solve the operation and maintenance problems of some simple systems , But with the increasing scale of operation and maintenance system , There are more and more services , The occupation of resources is becoming more and more complex , be based on Petri Net My analysis slowly began to become weak . Predictably, , Starting the analysis of operation and maintenance faults directly with the details of resource occupation will bring a great burden to the analysis engine . Another interesting idea comes from an article on 2012 Years published in IEEE/ACM ToN Papers on . What is mentioned in this paper G-RCA System is a good idea of root cause analysis —— By analyzing the relationship between system events, the cause and effect diagram of events is constructed , Then different reasoning methods are used to analyze root cause Events . The knowledge extraction method introduced in it , It is also of great significance to solve and characterize the root cause analysis problems in the field of operation and maintenance . in addition , Others are based on SAT theory , Inducement logic program ( Probabilistic model or deterministic model ) The framework or model of root cause reasoning of , Also mentioned in some studies , But because the complexity of reasoning is too high ( Especially some non on-the-fly Methods ) Or far from the needs of the operation and maintenance scenario , Not in IT The prospect of operation and maintenance scenario landing .

Open source benefits

Cloud intelligence has become an open source data visualization platform FlyFish . By configuring the data model, it provides users with hundreds of visual graphics components , Zero coding can achieve a cool visual large screen that meets your business needs . meanwhile , Flying fish also provides flexible expansion ability , Support component development 、 Customize the configuration of functions and global events , Facing complex demand scenarios can ensure efficient development and delivery .

Click the address link below , Welcome to FlyFish Like to send Star. Participate in component development , There are ten thousand yuan in cash waiting for you to get .

GitHub Address :

Gitee Address :

Ten thousand yuan cash activities :

copyright notice
author[Cloud smart aiops community],Please bring the original link to reprint, thank you.

Random recommended