Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers    
Effective root cause analysis (RCA) of performance issues in modern cloud environ- ments remains a hard problem. Traditional RCA tracks complex issues by their signatures known as problem incidents. Common approaches to incident discovery rely mainly on expertise of users who define environment-spec...
Guardado en:
Autores principales: | , , , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
Graz University of Technology
2021
|
Materias: | |
Acceso en línea: | https://doaj.org/article/2451a06304304fd6b58cd92b364abd4d |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:2451a06304304fd6b58cd92b364abd4d |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:2451a06304304fd6b58cd92b364abd4d2021-11-30T04:30:12ZIncident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers    10.3897/jucs.766080948-6968https://doaj.org/article/2451a06304304fd6b58cd92b364abd4d2021-11-01T00:00:00Zhttps://lib.jucs.org/article/76608/download/pdf/https://lib.jucs.org/article/76608/download/xml/https://lib.jucs.org/article/76608/https://doaj.org/toc/0948-6968Effective root cause analysis (RCA) of performance issues in modern cloud environ- ments remains a hard problem. Traditional RCA tracks complex issues by their signatures known as problem incidents. Common approaches to incident discovery rely mainly on expertise of users who define environment-specific set of alerts and >target detection of problems through their occurrence in the monitoring system. Adequately modeling of all possible problem patterns for nowadays extremely sophisticated data center applications is a very complex task. It may result in alert/event storms including large numbers of non-indicative precautions. Thus, the crucial task for the incident-based RCA is reduction of redundant recommendations by prioritizing those events subject to importance/impact criteria or by deriving their meaningful groupings into separable situations. In this paper, we consider automation of incident discovery based on rule induction algorithms that retrieve conditions directly from monitoring datasets without consuming the sys- tem events. Rule-learning algorithms are very flexible and powerful for many regression and classification problems, with high-level explainability. Since annotated or labeled data sets are mostly unavailable in this area of technology, we discuss data self-labelling principles which allow transforming originally unsupervised learning tasks into classification problems with further application of rule induction methods to incident detection.Arnak PoghosyanAshot HarutyunyanNaira GrigoryanNicholas KushmerickGraz University of Technologyarticledata center managementperformance incidentsanoElectronic computers. Computer scienceQA75.5-76.95ENJournal of Universal Computer Science, Vol 27, Iss 11, Pp 1152-1173 (2021) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
data center management performance incidents ano Electronic computers. Computer science QA75.5-76.95 |
spellingShingle |
data center management performance incidents ano Electronic computers. Computer science QA75.5-76.95 Arnak Poghosyan Ashot Harutyunyan Naira Grigoryan Nicholas Kushmerick Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers     |
description |
Effective root cause analysis (RCA) of performance issues in modern cloud environ- ments remains a hard problem. Traditional RCA tracks complex issues by their signatures known as problem incidents. Common approaches to incident discovery rely mainly on expertise of users who define environment-specific set of alerts and >target detection of problems through their occurrence in the monitoring system. Adequately modeling of all possible problem patterns for nowadays extremely sophisticated data center applications is a very complex task. It may result in alert/event storms including large numbers of non-indicative precautions. Thus, the crucial task for the incident-based RCA is reduction of redundant recommendations by prioritizing those events subject to importance/impact criteria or by deriving their meaningful groupings into separable situations. In this paper, we consider automation of incident discovery based on rule induction algorithms that retrieve conditions directly from monitoring datasets without consuming the sys- tem events. Rule-learning algorithms are very flexible and powerful for many regression and classification problems, with high-level explainability. Since annotated or labeled data sets are mostly unavailable in this area of technology, we discuss data self-labelling principles which allow transforming originally unsupervised learning tasks into classification problems with further application of rule induction methods to incident detection. |
format |
article |
author |
Arnak Poghosyan Ashot Harutyunyan Naira Grigoryan Nicholas Kushmerick |
author_facet |
Arnak Poghosyan Ashot Harutyunyan Naira Grigoryan Nicholas Kushmerick |
author_sort |
Arnak Poghosyan |
title |
Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers     |
title_short |
Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers     |
title_full |
Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers     |
title_fullStr |
Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers     |
title_full_unstemmed |
Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers     |
title_sort |
incident management for explainable and automated root cause analysis in cloud data centers     |
publisher |
Graz University of Technology |
publishDate |
2021 |
url |
https://doaj.org/article/2451a06304304fd6b58cd92b364abd4d |
work_keys_str_mv |
AT arnakpoghosyan incidentmanagementforexplainableandautomatedrootcauseanalysisinclouddatacentersampnbspampnbspampnbspampnbsp AT ashotharutyunyan incidentmanagementforexplainableandautomatedrootcauseanalysisinclouddatacentersampnbspampnbspampnbspampnbsp AT nairagrigoryan incidentmanagementforexplainableandautomatedrootcauseanalysisinclouddatacentersampnbspampnbspampnbspampnbsp AT nicholaskushmerick incidentmanagementforexplainableandautomatedrootcauseanalysisinclouddatacentersampnbspampnbspampnbspampnbsp |
_version_ |
1718406732426772480 |