Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference

Objective: We recently showed that genderize.io is not a sufficiently powerful gender detection tool due to a large number of nonclassifications. In the present study, we aimed to assess whether the accuracy of inference by genderize.io can be improved by manipulating the first names in the database...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autor principal: Paul Sebo
Formato: article
Lenguaje:EN
Publicado: University Library System, University of Pittsburgh 2021
Materias:
Z
R
Acceso en línea:https://doaj.org/article/4e0663cfa18448c28fb87b6ae0702cc7
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:4e0663cfa18448c28fb87b6ae0702cc7
record_format dspace
spelling oai:doaj.org-article:4e0663cfa18448c28fb87b6ae0702cc72021-11-22T20:41:00ZUsing genderize.io to infer the gender of first names: how to improve the accuracy of the inference1536-50501558-943910.5195/jmla.2021.1252https://doaj.org/article/4e0663cfa18448c28fb87b6ae0702cc72021-11-01T00:00:00Zhttps://jmla.pitt.edu/ojs/jmla/article/view/1252https://doaj.org/toc/1536-5050https://doaj.org/toc/1558-9439Objective: We recently showed that genderize.io is not a sufficiently powerful gender detection tool due to a large number of nonclassifications. In the present study, we aimed to assess whether the accuracy of inference by genderize.io can be improved by manipulating the first names in the database. Methods: We used a database containing the first names, surnames, and gender of 6,131 physicians practicing in a multicultural country (Switzerland). We uploaded the original CSV file (file #1), the file obtained after removing all diacritic marks, such as accents and cedilla (file #2), and the file obtained after removing all diacritic marks and retaining only the first term of the compound first names (file #3). For each file, we computed three performance metrics: proportion of misclassifications (errorCodedWithoutNA), proportion of nonclassifications (naCoded), and proportion of misclassifications and nonclassifications (errorCoded). Results: naCoded, which was high for file #1 (16.4%), was reduced after data manipulation (file #2: 11.7%, file #3: 0.4%). As the increase in the number of misclassifications was small, the overall performance of genderize.io (i.e., errorCoded) improved, especially for file #3 (file #1: 17.7%, file #2: 13.0%, and file #3: 2.3%). Conclusions: A relatively simple manipulation of the data improved the accuracy of gender inference by genderize.io. We recommend using genderize.io only with files that were modified in this way.Paul SeboUniversity Library System, University of Pittsburgharticleaccuracygender determinationgenderize.iomisclassificationnamename-to-genderperformanceBibliography. Library science. Information resourcesZMedicineRENJournal of the Medical Library Association, Vol 109, Iss 4 (2021)
institution DOAJ
collection DOAJ
language EN
topic accuracy
gender determination
genderize.io
misclassification
name
name-to-gender
performance
Bibliography. Library science. Information resources
Z
Medicine
R
spellingShingle accuracy
gender determination
genderize.io
misclassification
name
name-to-gender
performance
Bibliography. Library science. Information resources
Z
Medicine
R
Paul Sebo
Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
description Objective: We recently showed that genderize.io is not a sufficiently powerful gender detection tool due to a large number of nonclassifications. In the present study, we aimed to assess whether the accuracy of inference by genderize.io can be improved by manipulating the first names in the database. Methods: We used a database containing the first names, surnames, and gender of 6,131 physicians practicing in a multicultural country (Switzerland). We uploaded the original CSV file (file #1), the file obtained after removing all diacritic marks, such as accents and cedilla (file #2), and the file obtained after removing all diacritic marks and retaining only the first term of the compound first names (file #3). For each file, we computed three performance metrics: proportion of misclassifications (errorCodedWithoutNA), proportion of nonclassifications (naCoded), and proportion of misclassifications and nonclassifications (errorCoded). Results: naCoded, which was high for file #1 (16.4%), was reduced after data manipulation (file #2: 11.7%, file #3: 0.4%). As the increase in the number of misclassifications was small, the overall performance of genderize.io (i.e., errorCoded) improved, especially for file #3 (file #1: 17.7%, file #2: 13.0%, and file #3: 2.3%). Conclusions: A relatively simple manipulation of the data improved the accuracy of gender inference by genderize.io. We recommend using genderize.io only with files that were modified in this way.
format article
author Paul Sebo
author_facet Paul Sebo
author_sort Paul Sebo
title Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
title_short Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
title_full Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
title_fullStr Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
title_full_unstemmed Using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
title_sort using genderize.io to infer the gender of first names: how to improve the accuracy of the inference
publisher University Library System, University of Pittsburgh
publishDate 2021
url https://doaj.org/article/4e0663cfa18448c28fb87b6ae0702cc7
work_keys_str_mv AT paulsebo usinggenderizeiotoinferthegenderoffirstnameshowtoimprovetheaccuracyoftheinference
_version_ 1718417374049206272