Towards Improving Code Stylometry Analysis in Underground Forums

Code Stylometry has emerged as a powerful mechanism to identify programmers. While there have been significant advances in the field, existing mechanisms underperform in challenging domains. One such domain is studying the provenance of code shared in underground forums, where code posts tend to hav...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Tereszkowski-Kaminski Michal, Pastrana Sergio, Blasco Jorge, Suarez-Tangil Guillermo
Formato: article
Lenguaje:EN
Publicado: Sciendo 2022
Materias:
Acceso en línea:https://doaj.org/article/0f09d9231f5b412abab528bb2857f5c3
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:0f09d9231f5b412abab528bb2857f5c3
record_format dspace
spelling oai:doaj.org-article:0f09d9231f5b412abab528bb2857f5c32021-12-05T14:11:10ZTowards Improving Code Stylometry Analysis in Underground Forums2299-098410.2478/popets-2022-0007https://doaj.org/article/0f09d9231f5b412abab528bb2857f5c32022-01-01T00:00:00Zhttps://doi.org/10.2478/popets-2022-0007https://doaj.org/toc/2299-0984Code Stylometry has emerged as a powerful mechanism to identify programmers. While there have been significant advances in the field, existing mechanisms underperform in challenging domains. One such domain is studying the provenance of code shared in underground forums, where code posts tend to have small or incomplete source code fragments. This paper proposes a method designed to deal with the idiosyncrasies of code snippets shared in these forums. Our system fuses a forum-specific learning pipeline with Conformal Prediction to generate predictions with precise confidence levels as a novelty. We see that identifying unreliable code snippets is paramount to generate high-accuracy predictions, and this is a task where traditional learning settings fail. Overall, our method performs as twice as well as the state-of-the-art in a constrained setting with a large number of authors (i.e., 100). When dealing with a smaller number of authors (i.e., 20), it performs at high accuracy (89%). We also evaluate our work on an open-world assumption and see that our method is more effective at retaining samples.Tereszkowski-Kaminski MichalPastrana SergioBlasco JorgeSuarez-Tangil GuillermoSciendoarticleauthorship attributionunderground forumslanguage selectioncode clone detectionEthicsBJ1-1725Electronic computers. Computer scienceQA75.5-76.95ENProceedings on Privacy Enhancing Technologies, Vol 2022, Iss 1, Pp 126-147 (2022)
institution DOAJ
collection DOAJ
language EN
topic authorship attribution
underground forums
language selection
code clone detection
Ethics
BJ1-1725
Electronic computers. Computer science
QA75.5-76.95
spellingShingle authorship attribution
underground forums
language selection
code clone detection
Ethics
BJ1-1725
Electronic computers. Computer science
QA75.5-76.95
Tereszkowski-Kaminski Michal
Pastrana Sergio
Blasco Jorge
Suarez-Tangil Guillermo
Towards Improving Code Stylometry Analysis in Underground Forums
description Code Stylometry has emerged as a powerful mechanism to identify programmers. While there have been significant advances in the field, existing mechanisms underperform in challenging domains. One such domain is studying the provenance of code shared in underground forums, where code posts tend to have small or incomplete source code fragments. This paper proposes a method designed to deal with the idiosyncrasies of code snippets shared in these forums. Our system fuses a forum-specific learning pipeline with Conformal Prediction to generate predictions with precise confidence levels as a novelty. We see that identifying unreliable code snippets is paramount to generate high-accuracy predictions, and this is a task where traditional learning settings fail. Overall, our method performs as twice as well as the state-of-the-art in a constrained setting with a large number of authors (i.e., 100). When dealing with a smaller number of authors (i.e., 20), it performs at high accuracy (89%). We also evaluate our work on an open-world assumption and see that our method is more effective at retaining samples.
format article
author Tereszkowski-Kaminski Michal
Pastrana Sergio
Blasco Jorge
Suarez-Tangil Guillermo
author_facet Tereszkowski-Kaminski Michal
Pastrana Sergio
Blasco Jorge
Suarez-Tangil Guillermo
author_sort Tereszkowski-Kaminski Michal
title Towards Improving Code Stylometry Analysis in Underground Forums
title_short Towards Improving Code Stylometry Analysis in Underground Forums
title_full Towards Improving Code Stylometry Analysis in Underground Forums
title_fullStr Towards Improving Code Stylometry Analysis in Underground Forums
title_full_unstemmed Towards Improving Code Stylometry Analysis in Underground Forums
title_sort towards improving code stylometry analysis in underground forums
publisher Sciendo
publishDate 2022
url https://doaj.org/article/0f09d9231f5b412abab528bb2857f5c3
work_keys_str_mv AT tereszkowskikaminskimichal towardsimprovingcodestylometryanalysisinundergroundforums
AT pastranasergio towardsimprovingcodestylometryanalysisinundergroundforums
AT blascojorge towardsimprovingcodestylometryanalysisinundergroundforums
AT suareztangilguillermo towardsimprovingcodestylometryanalysisinundergroundforums
_version_ 1718371280902684672