Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers    

Effective root cause analysis (RCA) of performance issues in modern cloud environ- ments remains a hard problem. Traditional RCA tracks complex issues by their signatures known as problem incidents. Common approaches to incident discovery rely mainly on expertise of users who define environment-spec...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Arnak Poghosyan, Ashot Harutyunyan, Naira Grigoryan, Nicholas Kushmerick
Formato: article
Lenguaje:EN
Publicado: Graz University of Technology 2021
Materias:
ano
Acceso en línea:https://doaj.org/article/2451a06304304fd6b58cd92b364abd4d
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:2451a06304304fd6b58cd92b364abd4d
record_format dspace
spelling oai:doaj.org-article:2451a06304304fd6b58cd92b364abd4d2021-11-30T04:30:12ZIncident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers    10.3897/jucs.766080948-6968https://doaj.org/article/2451a06304304fd6b58cd92b364abd4d2021-11-01T00:00:00Zhttps://lib.jucs.org/article/76608/download/pdf/https://lib.jucs.org/article/76608/download/xml/https://lib.jucs.org/article/76608/https://doaj.org/toc/0948-6968Effective root cause analysis (RCA) of performance issues in modern cloud environ- ments remains a hard problem. Traditional RCA tracks complex issues by their signatures known as problem incidents. Common approaches to incident discovery rely mainly on expertise of users who define environment-specific set of alerts and >target detection of problems through their occurrence in the monitoring system. Adequately modeling of all possible problem patterns for nowadays extremely sophisticated data center applications is a very complex task. It may result in alert/event storms including large numbers of non-indicative precautions. Thus, the crucial task for the incident-based RCA is reduction of redundant recommendations by prioritizing those events subject to importance/impact criteria or by deriving their meaningful groupings into separable situations. In this paper, we consider automation of incident discovery based on rule induction algorithms that retrieve conditions directly from monitoring datasets without consuming the sys- tem events. Rule-learning algorithms are very flexible and powerful for many regression and classification problems, with high-level explainability. Since annotated or labeled data sets are mostly unavailable in this area of technology, we discuss data self-labelling principles which allow transforming originally unsupervised learning tasks into classification problems with further application of rule induction methods to incident detection.Arnak PoghosyanAshot HarutyunyanNaira GrigoryanNicholas KushmerickGraz University of Technologyarticledata center managementperformance incidentsanoElectronic computers. Computer scienceQA75.5-76.95ENJournal of Universal Computer Science, Vol 27, Iss 11, Pp 1152-1173 (2021)
institution DOAJ
collection DOAJ
language EN
topic data center management
performance incidents
ano
Electronic computers. Computer science
QA75.5-76.95
spellingShingle data center management
performance incidents
ano
Electronic computers. Computer science
QA75.5-76.95
Arnak Poghosyan
Ashot Harutyunyan
Naira Grigoryan
Nicholas Kushmerick
Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers    
description Effective root cause analysis (RCA) of performance issues in modern cloud environ- ments remains a hard problem. Traditional RCA tracks complex issues by their signatures known as problem incidents. Common approaches to incident discovery rely mainly on expertise of users who define environment-specific set of alerts and >target detection of problems through their occurrence in the monitoring system. Adequately modeling of all possible problem patterns for nowadays extremely sophisticated data center applications is a very complex task. It may result in alert/event storms including large numbers of non-indicative precautions. Thus, the crucial task for the incident-based RCA is reduction of redundant recommendations by prioritizing those events subject to importance/impact criteria or by deriving their meaningful groupings into separable situations. In this paper, we consider automation of incident discovery based on rule induction algorithms that retrieve conditions directly from monitoring datasets without consuming the sys- tem events. Rule-learning algorithms are very flexible and powerful for many regression and classification problems, with high-level explainability. Since annotated or labeled data sets are mostly unavailable in this area of technology, we discuss data self-labelling principles which allow transforming originally unsupervised learning tasks into classification problems with further application of rule induction methods to incident detection.
format article
author Arnak Poghosyan
Ashot Harutyunyan
Naira Grigoryan
Nicholas Kushmerick
author_facet Arnak Poghosyan
Ashot Harutyunyan
Naira Grigoryan
Nicholas Kushmerick
author_sort Arnak Poghosyan
title Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers    
title_short Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers    
title_full Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers    
title_fullStr Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers    
title_full_unstemmed Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers    
title_sort incident management for explainable and automated root cause analysis in cloud data centers    
publisher Graz University of Technology
publishDate 2021
url https://doaj.org/article/2451a06304304fd6b58cd92b364abd4d
work_keys_str_mv AT arnakpoghosyan incidentmanagementforexplainableandautomatedrootcauseanalysisinclouddatacentersampnbspampnbspampnbspampnbsp
AT ashotharutyunyan incidentmanagementforexplainableandautomatedrootcauseanalysisinclouddatacentersampnbspampnbspampnbspampnbsp
AT nairagrigoryan incidentmanagementforexplainableandautomatedrootcauseanalysisinclouddatacentersampnbspampnbspampnbspampnbsp
AT nicholaskushmerick incidentmanagementforexplainableandautomatedrootcauseanalysisinclouddatacentersampnbspampnbspampnbspampnbsp
_version_ 1718406732426772480