High-efficient analysis system for massive data of alpine grassland based on Hive

Solving the problem of alpine grassland degradation should conduct a comprehensive evaluation of the current situation of alpine grassland degradation, and this requires relevant data as support. This paper designs and implements a Hive-based high-performance system for analyzing massive data of alp...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: LI Liangdan, YE Sha, XIE Xia, HU Yueming, XIE Jianwen, ZHOU Wu, YOU Xiaomin
Formato: article
Lenguaje:ZH
Publicado: Agro-Environmental Protection Institute, Ministry of Agriculture 2021
Materias:
Acceso en línea:https://doaj.org/article/079e18fe6ee54b3f94c013e303782b0f
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Descripción
Sumario:Solving the problem of alpine grassland degradation should conduct a comprehensive evaluation of the current situation of alpine grassland degradation, and this requires relevant data as support. This paper designs and implements a Hive-based high-performance system for analyzing massive data of alpine grassland, which can reliably and efficiently store and analyze the massive data of alpine grassland. First, the platform was designed based on the Hadoop, Hive, and Sqoop environments, and was completed through steps such as node and cluster configuration. Then, the data ETL(Extract-Transform-Load) and data storage were completed by using the EM (Expectation-Maximization) algorithm for data filling, importing data, and data partition storage. Finally, the system realized the fuzzy query function through mixed function coding, and the system had achieved the predetermined effect. The results showed as the file size increased, the overall data size increased, and the overall system storage and reading time were always increasing, however, the average running time(the average time for processing 1 MB of data) was decreasing, reflecting the system's high ability to process large amounts of data in parallel as the amount of data increased. Using the alpine grassland quadrat monitoring data and some virtual data from the counties in Qinghai Province in 2014, the total data volume was approximately 39.58 million(7.56 GB), and the efficiency of data query between the Hive cluster and the relational database SQL Server was compared. When the query data volume was 39.58 million, the Hive cluster data query time was 67.8% of the SQL Server. With the increase of data volume, the efficiency of system data query was higher than that of SQL Server. The ecological data of alpine grassland was analyzed and processed through HiveQL, and the corresponding control experiment was carried out. The comparison found that the Hive data analysis technology had the same processing result as the control experiment. In summary, distributed data warehouse technology is applied to the storage and analysis of massive data in alpine grasslands, which is a significant improvement over traditional data storage and analysis technologies. This system has high efficiency in processing massive data and strong developability, which can well meet the storage and analysis requirements of massive alpine grassland data.