Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark

Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users...

Descripción completa

Guardado en:
Detalles Bibliográficos
Autores principales: Panagiotis Moutafis, George Mavrommatis, Michael Vassilakopoulos, Antonio Corral
Formato: article
Lenguaje:EN
Publicado: MDPI AG 2021
Materias:
Acceso en línea:https://doaj.org/article/828e8861446c43acbd59639ca6eeb6b4
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
id oai:doaj.org-article:828e8861446c43acbd59639ca6eeb6b4
record_format dspace
spelling oai:doaj.org-article:828e8861446c43acbd59639ca6eeb6b42021-11-25T17:53:04ZEfficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark10.3390/ijgi101107632220-9964https://doaj.org/article/828e8861446c43acbd59639ca6eeb6b42021-11-01T00:00:00Zhttps://www.mdpi.com/2220-9964/10/11/763https://doaj.org/toc/2220-9964Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users to work on distributed in-memory data, without worrying about the data distribution mechanism and fault-tolerance. Given two datasets of points (called Query and Training), the group <i>K</i> nearest-neighbor (G<i>K</i>NN) query retrieves (<i>K</i>) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been actively studied in centralized environments and several performance improving techniques and pruning heuristics have been also proposed, while, a distributed algorithm in Apache Hadoop was recently proposed by our team. Since, in general, Apache Hadoop exhibits lower performance than Spark, in this paper, we present the first distributed G<i>K</i>NN query algorithm in Apache Spark and compare it against the one in Apache Hadoop. This algorithm incorporates programming features and facilities that are specific to Apache Spark. Moreover, techniques that improve performance and are applicable in Apache Spark are also incorporated. The results of an extensive set of experiments with real-world spatial datasets are presented, demonstrating that our Apache Spark G<i>K</i>NN solution, with its improvements, is efficient and a clear winner in comparison to processing this query in Apache Hadoop.Panagiotis MoutafisGeorge MavrommatisMichael VassilakopoulosAntonio CorralMDPI AGarticlebig spatial dataspatial query processinggroup nearest-neighbor queryApache Sparkspatial query evaluationGeography (General)G1-922ENISPRS International Journal of Geo-Information, Vol 10, Iss 763, p 763 (2021)
institution DOAJ
collection DOAJ
language EN
topic big spatial data
spatial query processing
group nearest-neighbor query
Apache Spark
spatial query evaluation
Geography (General)
G1-922
spellingShingle big spatial data
spatial query processing
group nearest-neighbor query
Apache Spark
spatial query evaluation
Geography (General)
G1-922
Panagiotis Moutafis
George Mavrommatis
Michael Vassilakopoulos
Antonio Corral
Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark
description Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users to work on distributed in-memory data, without worrying about the data distribution mechanism and fault-tolerance. Given two datasets of points (called Query and Training), the group <i>K</i> nearest-neighbor (G<i>K</i>NN) query retrieves (<i>K</i>) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been actively studied in centralized environments and several performance improving techniques and pruning heuristics have been also proposed, while, a distributed algorithm in Apache Hadoop was recently proposed by our team. Since, in general, Apache Hadoop exhibits lower performance than Spark, in this paper, we present the first distributed G<i>K</i>NN query algorithm in Apache Spark and compare it against the one in Apache Hadoop. This algorithm incorporates programming features and facilities that are specific to Apache Spark. Moreover, techniques that improve performance and are applicable in Apache Spark are also incorporated. The results of an extensive set of experiments with real-world spatial datasets are presented, demonstrating that our Apache Spark G<i>K</i>NN solution, with its improvements, is efficient and a clear winner in comparison to processing this query in Apache Hadoop.
format article
author Panagiotis Moutafis
George Mavrommatis
Michael Vassilakopoulos
Antonio Corral
author_facet Panagiotis Moutafis
George Mavrommatis
Michael Vassilakopoulos
Antonio Corral
author_sort Panagiotis Moutafis
title Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark
title_short Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark
title_full Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark
title_fullStr Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark
title_full_unstemmed Efficient Group <i>K</i> Nearest-Neighbor Spatial Query Processing in Apache Spark
title_sort efficient group <i>k</i> nearest-neighbor spatial query processing in apache spark
publisher MDPI AG
publishDate 2021
url https://doaj.org/article/828e8861446c43acbd59639ca6eeb6b4
work_keys_str_mv AT panagiotismoutafis efficientgroupikinearestneighborspatialqueryprocessinginapachespark
AT georgemavrommatis efficientgroupikinearestneighborspatialqueryprocessinginapachespark
AT michaelvassilakopoulos efficientgroupikinearestneighborspatialqueryprocessinginapachespark
AT antoniocorral efficientgroupikinearestneighborspatialqueryprocessinginapachespark
_version_ 1718411871523962880