High-efficient analysis system for massive data of alpine grassland based on Hive

Solving the problem of alpine grassland degradation should conduct a comprehensive evaluation of the current situation of alpine grassland degradation, and this requires relevant data as support. This paper designs and implements a Hive-based high-performance system for analyzing massive data of alp...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: LI Liangdan, YE Sha, XIE Xia, HU Yueming, XIE Jianwen, ZHOU Wu, YOU Xiaomin
Formato: article
Lenguaje:ZH
Publicado: Agro-Environmental Protection Institute, Ministry of Agriculture 2021
Materias:
Acceso en línea:https://doaj.org/article/079e18fe6ee54b3f94c013e303782b0f
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:079e18fe6ee54b3f94c013e303782b0f
record_format dspace
spelling oai:doaj.org-article:079e18fe6ee54b3f94c013e303782b0f2021-12-03T02:29:43ZHigh-efficient analysis system for massive data of alpine grassland based on Hive2095-681910.13254/j.jare.2021.0530https://doaj.org/article/079e18fe6ee54b3f94c013e303782b0f2021-11-01T00:00:00Zhttp://www.aed.org.cn/nyzyyhjxb/html/2021/6/20210622.htmhttps://doaj.org/toc/2095-6819Solving the problem of alpine grassland degradation should conduct a comprehensive evaluation of the current situation of alpine grassland degradation, and this requires relevant data as support. This paper designs and implements a Hive-based high-performance system for analyzing massive data of alpine grassland, which can reliably and efficiently store and analyze the massive data of alpine grassland. First, the platform was designed based on the Hadoop, Hive, and Sqoop environments, and was completed through steps such as node and cluster configuration. Then, the data ETL(Extract-Transform-Load) and data storage were completed by using the EM (Expectation-Maximization) algorithm for data filling, importing data, and data partition storage. Finally, the system realized the fuzzy query function through mixed function coding, and the system had achieved the predetermined effect. The results showed as the file size increased, the overall data size increased, and the overall system storage and reading time were always increasing, however, the average running time(the average time for processing 1 MB of data) was decreasing, reflecting the system's high ability to process large amounts of data in parallel as the amount of data increased. Using the alpine grassland quadrat monitoring data and some virtual data from the counties in Qinghai Province in 2014, the total data volume was approximately 39.58 million(7.56 GB), and the efficiency of data query between the Hive cluster and the relational database SQL Server was compared. When the query data volume was 39.58 million, the Hive cluster data query time was 67.8% of the SQL Server. With the increase of data volume, the efficiency of system data query was higher than that of SQL Server. The ecological data of alpine grassland was analyzed and processed through HiveQL, and the corresponding control experiment was carried out. The comparison found that the Hive data analysis technology had the same processing result as the control experiment. In summary, distributed data warehouse technology is applied to the storage and analysis of massive data in alpine grasslands, which is a significant improvement over traditional data storage and analysis technologies. This system has high efficiency in processing massive data and strong developability, which can well meet the storage and analysis requirements of massive alpine grassland data.LI LiangdanYE ShaXIE XiaHU YuemingXIE JianwenZHOU WuYOU XiaominAgro-Environmental Protection Institute, Ministry of Agriculturearticlealpine grasslandmassive datastorage and analysishiveAgriculture (General)S1-972Environmental sciencesGE1-350ZHJournal of Agricultural Resources and Environment, Vol 38, Iss 6, Pp 1152-1163 (2021)
institution DOAJ
collection DOAJ
language ZH
topic alpine grassland
massive data
storage and analysis
hive
Agriculture (General)
S1-972
Environmental sciences
GE1-350
spellingShingle alpine grassland
massive data
storage and analysis
hive
Agriculture (General)
S1-972
Environmental sciences
GE1-350
LI Liangdan
YE Sha
XIE Xia
HU Yueming
XIE Jianwen
ZHOU Wu
YOU Xiaomin
High-efficient analysis system for massive data of alpine grassland based on Hive
description Solving the problem of alpine grassland degradation should conduct a comprehensive evaluation of the current situation of alpine grassland degradation, and this requires relevant data as support. This paper designs and implements a Hive-based high-performance system for analyzing massive data of alpine grassland, which can reliably and efficiently store and analyze the massive data of alpine grassland. First, the platform was designed based on the Hadoop, Hive, and Sqoop environments, and was completed through steps such as node and cluster configuration. Then, the data ETL(Extract-Transform-Load) and data storage were completed by using the EM (Expectation-Maximization) algorithm for data filling, importing data, and data partition storage. Finally, the system realized the fuzzy query function through mixed function coding, and the system had achieved the predetermined effect. The results showed as the file size increased, the overall data size increased, and the overall system storage and reading time were always increasing, however, the average running time(the average time for processing 1 MB of data) was decreasing, reflecting the system's high ability to process large amounts of data in parallel as the amount of data increased. Using the alpine grassland quadrat monitoring data and some virtual data from the counties in Qinghai Province in 2014, the total data volume was approximately 39.58 million(7.56 GB), and the efficiency of data query between the Hive cluster and the relational database SQL Server was compared. When the query data volume was 39.58 million, the Hive cluster data query time was 67.8% of the SQL Server. With the increase of data volume, the efficiency of system data query was higher than that of SQL Server. The ecological data of alpine grassland was analyzed and processed through HiveQL, and the corresponding control experiment was carried out. The comparison found that the Hive data analysis technology had the same processing result as the control experiment. In summary, distributed data warehouse technology is applied to the storage and analysis of massive data in alpine grasslands, which is a significant improvement over traditional data storage and analysis technologies. This system has high efficiency in processing massive data and strong developability, which can well meet the storage and analysis requirements of massive alpine grassland data.
format article
author LI Liangdan
YE Sha
XIE Xia
HU Yueming
XIE Jianwen
ZHOU Wu
YOU Xiaomin
author_facet LI Liangdan
YE Sha
XIE Xia
HU Yueming
XIE Jianwen
ZHOU Wu
YOU Xiaomin
author_sort LI Liangdan
title High-efficient analysis system for massive data of alpine grassland based on Hive
title_short High-efficient analysis system for massive data of alpine grassland based on Hive
title_full High-efficient analysis system for massive data of alpine grassland based on Hive
title_fullStr High-efficient analysis system for massive data of alpine grassland based on Hive
title_full_unstemmed High-efficient analysis system for massive data of alpine grassland based on Hive
title_sort high-efficient analysis system for massive data of alpine grassland based on hive
publisher Agro-Environmental Protection Institute, Ministry of Agriculture
publishDate 2021
url https://doaj.org/article/079e18fe6ee54b3f94c013e303782b0f
work_keys_str_mv AT liliangdan highefficientanalysissystemformassivedataofalpinegrasslandbasedonhive
AT yesha highefficientanalysissystemformassivedataofalpinegrasslandbasedonhive
AT xiexia highefficientanalysissystemformassivedataofalpinegrasslandbasedonhive
AT huyueming highefficientanalysissystemformassivedataofalpinegrasslandbasedonhive
AT xiejianwen highefficientanalysissystemformassivedataofalpinegrasslandbasedonhive
AT zhouwu highefficientanalysissystemformassivedataofalpinegrasslandbasedonhive
AT youxiaomin highefficientanalysissystemformassivedataofalpinegrasslandbasedonhive
_version_ 1718373912006361088