High-efficient analysis system for massive data of alpine grassland based on Hive

Solving the problem of alpine grassland degradation should conduct a comprehensive evaluation of the current situation of alpine grassland degradation, and this requires relevant data as support. This paper designs and implements a Hive-based high-performance system for analyzing massive data of alp...

Descripción completa

Guardado en:

Detalles Bibliográficos
Autores principales:	LI Liangdan, YE Sha, XIE Xia, HU Yueming, XIE Jianwen, ZHOU Wu, YOU Xiaomin
Formato:	article
Lenguaje:	ZH
Publicado:	Agro-Environmental Protection Institute, Ministry of Agriculture 2021
Materias:	alpine grassland massive data storage and analysis hive Agriculture (General) S1-972 Environmental sciences GE1-350
Acceso en línea:	https://doaj.org/article/079e18fe6ee54b3f94c013e303782b0f
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

id	oai:doaj.org-article:079e18fe6ee54b3f94c013e303782b0f
record_format	dspace
spelling	oai:doaj.org-article:079e18fe6ee54b3f94c013e303782b0f2021-12-03T02:29:43ZHigh-efficient analysis system for massive data of alpine grassland based on Hive2095-681910.13254/j.jare.2021.0530https://doaj.org/article/079e18fe6ee54b3f94c013e303782b0f2021-11-01T00:00:00Zhttp://www.aed.org.cn/nyzyyhjxb/html/2021/6/20210622.htmhttps://doaj.org/toc/2095-6819Solving the problem of alpine grassland degradation should conduct a comprehensive evaluation of the current situation of alpine grassland degradation, and this requires relevant data as support. This paper designs and implements a Hive-based high-performance system for analyzing massive data of alpine grassland, which can reliably and efficiently store and analyze the massive data of alpine grassland. First, the platform was designed based on the Hadoop, Hive, and Sqoop environments, and was completed through steps such as node and cluster configuration. Then, the data ETL(Extract-Transform-Load) and data storage were completed by using the EM (Expectation-Maximization) algorithm for data filling, importing data, and data partition storage. Finally, the system realized the fuzzy query function through mixed function coding, and the system had achieved the predetermined effect. The results showed as the file size increased, the overall data size increased, and the overall system storage and reading time were always increasing, however, the average running time(the average time for processing 1 MB of data) was decreasing, reflecting the system's high ability to process large amounts of data in parallel as the amount of data increased. Using the alpine grassland quadrat monitoring data and some virtual data from the counties in Qinghai Province in 2014, the total data volume was approximately 39.58 million(7.56 GB), and the efficiency of data query between the Hive cluster and the relational database SQL Server was compared. When the query data volume was 39.58 million, the Hive cluster data query time was 67.8% of the SQL Server. With the increase of data volume, the efficiency of system data query was higher than that of SQL Server. The ecological data of alpine grassland was analyzed and processed through HiveQL, and the corresponding control experiment was carried out. The comparison found that the Hive data analysis technology had the same processing result as the control experiment. In summary, distributed data warehouse technology is applied to the storage and analysis of massive data in alpine grasslands, which is a significant improvement over traditional data storage and analysis technologies. This system has high efficiency in processing massive data and strong developability, which can well meet the storage and analysis requirements of massive alpine grassland data.LI LiangdanYE ShaXIE XiaHU YuemingXIE JianwenZHOU WuYOU XiaominAgro-Environmental Protection Institute, Ministry of Agriculturearticlealpine grasslandmassive datastorage and analysishiveAgriculture (General)S1-972Environmental sciencesGE1-350ZHJournal of Agricultural Resources and Environment, Vol 38, Iss 6, Pp 1152-1163 (2021)
institution	DOAJ
collection	DOAJ
language	ZH
topic	alpine grassland massive data storage and analysis hive Agriculture (General) S1-972 Environmental sciences GE1-350
spellingShingle	alpine grassland massive data storage and analysis hive Agriculture (General) S1-972 Environmental sciences GE1-350 LI Liangdan YE Sha XIE Xia HU Yueming XIE Jianwen ZHOU Wu YOU Xiaomin High-efficient analysis system for massive data of alpine grassland based on Hive
description	Solving the problem of alpine grassland degradation should conduct a comprehensive evaluation of the current situation of alpine grassland degradation, and this requires relevant data as support. This paper designs and implements a Hive-based high-performance system for analyzing massive data of alpine grassland, which can reliably and efficiently store and analyze the massive data of alpine grassland. First, the platform was designed based on the Hadoop, Hive, and Sqoop environments, and was completed through steps such as node and cluster configuration. Then, the data ETL(Extract-Transform-Load) and data storage were completed by using the EM (Expectation-Maximization) algorithm for data filling, importing data, and data partition storage. Finally, the system realized the fuzzy query function through mixed function coding, and the system had achieved the predetermined effect. The results showed as the file size increased, the overall data size increased, and the overall system storage and reading time were always increasing, however, the average running time(the average time for processing 1 MB of data) was decreasing, reflecting the system's high ability to process large amounts of data in parallel as the amount of data increased. Using the alpine grassland quadrat monitoring data and some virtual data from the counties in Qinghai Province in 2014, the total data volume was approximately 39.58 million(7.56 GB), and the efficiency of data query between the Hive cluster and the relational database SQL Server was compared. When the query data volume was 39.58 million, the Hive cluster data query time was 67.8% of the SQL Server. With the increase of data volume, the efficiency of system data query was higher than that of SQL Server. The ecological data of alpine grassland was analyzed and processed through HiveQL, and the corresponding control experiment was carried out. The comparison found that the Hive data analysis technology had the same processing result as the control experiment. In summary, distributed data warehouse technology is applied to the storage and analysis of massive data in alpine grasslands, which is a significant improvement over traditional data storage and analysis technologies. This system has high efficiency in processing massive data and strong developability, which can well meet the storage and analysis requirements of massive alpine grassland data.
format	article
author	LI Liangdan YE Sha XIE Xia HU Yueming XIE Jianwen ZHOU Wu YOU Xiaomin
author_facet	LI Liangdan YE Sha XIE Xia HU Yueming XIE Jianwen ZHOU Wu YOU Xiaomin
author_sort	LI Liangdan
title	High-efficient analysis system for massive data of alpine grassland based on Hive
title_short	High-efficient analysis system for massive data of alpine grassland based on Hive
title_full	High-efficient analysis system for massive data of alpine grassland based on Hive
title_fullStr	High-efficient analysis system for massive data of alpine grassland based on Hive
title_full_unstemmed	High-efficient analysis system for massive data of alpine grassland based on Hive
title_sort	high-efficient analysis system for massive data of alpine grassland based on hive
publisher	Agro-Environmental Protection Institute, Ministry of Agriculture
publishDate	2021
url	https://doaj.org/article/079e18fe6ee54b3f94c013e303782b0f
work_keys_str_mv	AT liliangdan highefficientanalysissystemformassivedataofalpinegrasslandbasedonhive AT yesha highefficientanalysissystemformassivedataofalpinegrasslandbasedonhive AT xiexia highefficientanalysissystemformassivedataofalpinegrasslandbasedonhive AT huyueming highefficientanalysissystemformassivedataofalpinegrasslandbasedonhive AT xiejianwen highefficientanalysissystemformassivedataofalpinegrasslandbasedonhive AT zhouwu highefficientanalysissystemformassivedataofalpinegrasslandbasedonhive AT youxiaomin highefficientanalysissystemformassivedataofalpinegrasslandbasedonhive
_version_	1718373912006361088

High-efficient analysis system for massive data of alpine grassland based on Hive

Ejemplares similares