Towards Improving Code Stylometry Analysis in Underground Forums
Code Stylometry has emerged as a powerful mechanism to identify programmers. While there have been significant advances in the field, existing mechanisms underperform in challenging domains. One such domain is studying the provenance of code shared in underground forums, where code posts tend to hav...
Guardado en:
Autores principales: | , , , |
---|---|
Formato: | article |
Lenguaje: | EN |
Publicado: |
Sciendo
2022
|
Materias: | |
Acceso en línea: | https://doaj.org/article/0f09d9231f5b412abab528bb2857f5c3 |
Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
id |
oai:doaj.org-article:0f09d9231f5b412abab528bb2857f5c3 |
---|---|
record_format |
dspace |
spelling |
oai:doaj.org-article:0f09d9231f5b412abab528bb2857f5c32021-12-05T14:11:10ZTowards Improving Code Stylometry Analysis in Underground Forums2299-098410.2478/popets-2022-0007https://doaj.org/article/0f09d9231f5b412abab528bb2857f5c32022-01-01T00:00:00Zhttps://doi.org/10.2478/popets-2022-0007https://doaj.org/toc/2299-0984Code Stylometry has emerged as a powerful mechanism to identify programmers. While there have been significant advances in the field, existing mechanisms underperform in challenging domains. One such domain is studying the provenance of code shared in underground forums, where code posts tend to have small or incomplete source code fragments. This paper proposes a method designed to deal with the idiosyncrasies of code snippets shared in these forums. Our system fuses a forum-specific learning pipeline with Conformal Prediction to generate predictions with precise confidence levels as a novelty. We see that identifying unreliable code snippets is paramount to generate high-accuracy predictions, and this is a task where traditional learning settings fail. Overall, our method performs as twice as well as the state-of-the-art in a constrained setting with a large number of authors (i.e., 100). When dealing with a smaller number of authors (i.e., 20), it performs at high accuracy (89%). We also evaluate our work on an open-world assumption and see that our method is more effective at retaining samples.Tereszkowski-Kaminski MichalPastrana SergioBlasco JorgeSuarez-Tangil GuillermoSciendoarticleauthorship attributionunderground forumslanguage selectioncode clone detectionEthicsBJ1-1725Electronic computers. Computer scienceQA75.5-76.95ENProceedings on Privacy Enhancing Technologies, Vol 2022, Iss 1, Pp 126-147 (2022) |
institution |
DOAJ |
collection |
DOAJ |
language |
EN |
topic |
authorship attribution underground forums language selection code clone detection Ethics BJ1-1725 Electronic computers. Computer science QA75.5-76.95 |
spellingShingle |
authorship attribution underground forums language selection code clone detection Ethics BJ1-1725 Electronic computers. Computer science QA75.5-76.95 Tereszkowski-Kaminski Michal Pastrana Sergio Blasco Jorge Suarez-Tangil Guillermo Towards Improving Code Stylometry Analysis in Underground Forums |
description |
Code Stylometry has emerged as a powerful mechanism to identify programmers. While there have been significant advances in the field, existing mechanisms underperform in challenging domains. One such domain is studying the provenance of code shared in underground forums, where code posts tend to have small or incomplete source code fragments. This paper proposes a method designed to deal with the idiosyncrasies of code snippets shared in these forums. Our system fuses a forum-specific learning pipeline with Conformal Prediction to generate predictions with precise confidence levels as a novelty. We see that identifying unreliable code snippets is paramount to generate high-accuracy predictions, and this is a task where traditional learning settings fail. Overall, our method performs as twice as well as the state-of-the-art in a constrained setting with a large number of authors (i.e., 100). When dealing with a smaller number of authors (i.e., 20), it performs at high accuracy (89%). We also evaluate our work on an open-world assumption and see that our method is more effective at retaining samples. |
format |
article |
author |
Tereszkowski-Kaminski Michal Pastrana Sergio Blasco Jorge Suarez-Tangil Guillermo |
author_facet |
Tereszkowski-Kaminski Michal Pastrana Sergio Blasco Jorge Suarez-Tangil Guillermo |
author_sort |
Tereszkowski-Kaminski Michal |
title |
Towards Improving Code Stylometry Analysis in Underground Forums |
title_short |
Towards Improving Code Stylometry Analysis in Underground Forums |
title_full |
Towards Improving Code Stylometry Analysis in Underground Forums |
title_fullStr |
Towards Improving Code Stylometry Analysis in Underground Forums |
title_full_unstemmed |
Towards Improving Code Stylometry Analysis in Underground Forums |
title_sort |
towards improving code stylometry analysis in underground forums |
publisher |
Sciendo |
publishDate |
2022 |
url |
https://doaj.org/article/0f09d9231f5b412abab528bb2857f5c3 |
work_keys_str_mv |
AT tereszkowskikaminskimichal towardsimprovingcodestylometryanalysisinundergroundforums AT pastranasergio towardsimprovingcodestylometryanalysisinundergroundforums AT blascojorge towardsimprovingcodestylometryanalysisinundergroundforums AT suareztangilguillermo towardsimprovingcodestylometryanalysisinundergroundforums |
_version_ |
1718371280902684672 |