Hierarchical Contaminated Web Page Classification Based on Meta Tag Denoising Disposal

Web page classification is critical for information retrieval. Most web page classification methods have the following two faults: (1) need to analyze based on the overall web page and (2) do not pay enough attention to the existence of noise information inside the web page, which will thus decrease...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Xiang Song, Yi Zhu, Xuemei Zeng, Xingshu Chen
Formato: article
Lenguaje:EN
Publicado: Hindawi-Wiley 2021
Materias:
Acceso en línea:https://doaj.org/article/a3a5ecc31c5942758d466230f88da098
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Sumario:Web page classification is critical for information retrieval. Most web page classification methods have the following two faults: (1) need to analyze based on the overall web page and (2) do not pay enough attention to the existence of noise information inside the web page, which will thus decrease the efficiency and classification performance, especially when classifying the contaminated web page. To solve these problems, this paper proposes a denoising disposal algorithm. We choose the top-down method for hierarchical classification to improve the prediction efficiency. The experimental results demonstrate that our method is about 7 times faster than the full-page method and achieves good classification results in most categories. The precision of 7 parent categories is all above 88% and is 24% higher than the other meta tag-based method on average.