Getting started with GeoThinneR

Overview

GeoThinneR is an R package designed for efficient spatial thinning of point data, particularly useful for species occurrence data and other spatial analyses. It implements optimized neighbor search algorithms based on kd-trees, providing an efficient, scalable, and user-friendly approach. When working with large-scale datasets containing hundreds of thousands of points, applying thinning manually is not possible, and existing tools in R providing thinning methods for species distribution modeling often lack scalability and computational efficiency.

Spatial thinning is widely used across many disciplines to reduce the density of spatial point data by selectively removing points based on specified criteria, such as threshold spacing or grid cells. This technique is applied in fields like remote sensing, where it improves data visualization and processing, as well as in geostatistics, and spatial ecological modeling. Among these, species distribution modeling is a key area where thinning plays a crucial role in mitigating the effects of spatial sampling bias.

This vignette provides a step-by-step guide to thinning with GeoThinneR using both simulated and real-world datasets.

Setup and loading data

To get started with GeoThinneR, we begin by loading the packages used in this vignette:

library(GeoThinneR)
library(terra)
library(sf)
library(ggplot2)

We’ll use two dataset to demonstrate the use of GeoThinneR: 1. A simulated a dataset with n = 2000 random occurrences randomly distributed within a 40 x 20 degree area for two artificial species. 2. A real-world dataset containing loggerhead sea turtle (Caretta caretta) occurrences in the Mediterranean Sea downloaded from GBIF () and included with the package.

# Set seed for reproducibility
set.seed(123)

# Simulated dataset
n <- 2000
sim_data <- data.frame(
  lon = runif(n, min = -10, max = 10),
  lat = runif(n, min = -5, max = 5),
  sp = sample(c("sp1", "sp2"), n, replace = TRUE)
)

# Load real-world occurrence dataset
data("caretta")

# Load Mediterranean Sea polygon
medit <- system.file("extdata", "mediterranean_sea.gpkg", package = "GeoThinneR")
medit <- sf::st_read(medit)

Let’s visualize the datasets before thinning:

Quick start guide

Now that our data is ready, let’s use GeoThinneR for spatial thinning. All methods are accessed through the thin_points() function, which requires, at a minimum, a dataset containing georeferenced points (WGS84 CRS - EPSG:4326) in matrix, data.frame, or tibble format. The query dataset is provided through the data argument, and with lon_col and lat_col parameters, we can define the column names containing longitude and latitude (or x, y) coordinates. Default values are “lon” and “lat”, and if not present, it will take the first two columns by default.

The thinning method is defined using the method parameter, with the following options:

The thinning constraint (distance, grid resolution or decimal precision) depends on the method chosen (these will be detailed in the next section). Regardless of the method, the algorithm selects which points to retain and remove. Since this selection is non-deterministic, you can run multiple randomized thinning trials to maximize the number of retained points.

Let’s quickly thin the simulated dataset using a simple distance-based method:

# Apply spatial thinning to the simulated data
quick_thin <- thin_points(
  data = sim_data,     # Dataframe with coordinates
  lon_col = "lon",     # Longitude column name
  lat_col = "lat",     # Latitude column name
  method = "distance", # Method for thinning
  thin_dist = 20,      # Thinning distance in km,
  trials = 5,          # Number of trials
  all_trials = TRUE,   # Return all trials
  seed = 123           # Seed for reproducibility
)

quick_thin
#> GeoThinned object
#> Method used: distance 
#> Number of trials: 5 
#> Points retained in largest trial: 1391

This command applies spatial thinning ensuring no two retained points are closer than 20 km (thin_dist). The algorithm runs 5 random trials (trials) and return the results of all trials (all_trials = TRUE).

Exploring the GeoThinned object

The output of thin_points() is a GeoThinned object, which contains: - x$retained: a list of logical vectors indicating which points were retained in each trial - x$method: the method used for thinning - x$params: the parameters used for thinning - x$original_data: the original dataset

You can apply basic methods for inspecting the object such as print(), summary(), and plot(). The summary() function provides detailed information about the thinning process, including the number of points retained in each trial, the nearest neighbor distance (NND), and the coverage of the retained points.

summary(quick_thin)
#> Summary of GeoThinneR Results
#> -----------------------------
#> Method used       : distance 
#> Distance metric   : haversine 
#> 
#> Number of points:
#>   Original              2000
#>   Thinned               1391
#>   Retention ratio      0.696
#> 
#> Nearest Neighbor Distances [km]
#>             Min 1st Qu. Median   Mean 3rd Qu.   Max
#> Original  0.650  11.011 17.159 18.216  24.265 61.83
#> Thinned  20.015  23.050 26.545 28.026  31.380 61.83
#> 
#> Spatial Coverage (Convex Hull Area) [km2]
#>   Original        2458025.112
#>   Thinned         2457093.024

We see that from 2000, we retained the 69.6% of points (1391), and the minimum distance between points has increased without reducing too much the spatial coverage or extent of the points compared to the original dataset.

plot(quick_thin)

If we check the number of points retained in each trial, we can see that the 5 trials return different numbers of points retained. This is due to the randomness of the algorithm when selecting which points to retain. You can use seed for reproducibility. By setting, all_trials = FALSE it will just return the trial that kept the most points.

# Number of kept points in each trial
sapply(quick_thin$retained, sum)
#> [1] 1387 1385 1388 1391 1387

You can access the thinned dataset that retained the maximum number of points using the largest() function.

head(largest(quick_thin))
#>          lon        lat  sp
#> 1 -4.2484496 -3.4032602 sp2
#> 2  5.7661027 -3.5548415 sp2
#> 3 -1.8204616 -3.5081961 sp2
#> 6 -9.0888700  1.1634277 sp1
#> 7  0.5621098 -0.5257711 sp1
#> 8  7.8483809 -4.4432328 sp1

If you want to access points from a specific trial, you can use the get_trial() function, which returns the thinned dataset for that trial.

head(get_trial(quick_thin, trial = 2)) 
#>          lon        lat  sp
#> 1 -4.2484496 -3.4032602 sp2
#> 2  5.7661027 -3.5548415 sp2
#> 3 -1.8204616 -3.5081961 sp2
#> 6 -9.0888700  1.1634277 sp1
#> 7  0.5621098 -0.5257711 sp1
#> 8  7.8483809 -4.4432328 sp1

Just in case you want to get the points that were removed from the original dataset, you can use the retained list to index the original data. The retained list contains logical vectors indicating which points were retained in each trial. You can use this to filter the original data and get the removed points.

trial <- 2
head(quick_thin$original_data[!quick_thin$retained[[trial]], ])
#>         lon         lat  sp
#> 4  7.660348  0.14434260 sp1
#> 5  8.809346 -0.07172694 sp1
#> 11 9.136667  3.50963224 sp2
#> 20 9.090073 -4.96464505 sp2
#> 21 7.790786  4.97424144 sp1
#> 32 8.045981  0.77635149 sp1

Also, you can use as_sf() to convert the thinned dataset to a sf object, which is useful for plotting and spatial analysis.

sf_thinned <- as_sf(quick_thin)
class(sf_thinned)
#> [1] "sf"         "data.frame"

Real-world example

GeoThinneR includes a dataset of loggerhead sea turtle (Caretta caretta) occurrence records in the Mediterranean Sea, sourced from GBIF and filtered. We now apply a 30 km thinning distance, returning only the largest trial (the one with the maximum number of points retained, all_trials set to FALSE).

# Apply spatial thinning to the real data
caretta_thin <- thin_points(
  data = caretta, # We will not specify lon_col, lat_col as they are in position 1 and 2
  lat_col = "decimalLatitude",
  lon_col = "decimalLongitude",
  method = "distance",
  thin_dist = 30, # Thinning distance in km,
  trials = 5,
  all_trials = FALSE,
  seed = 123
)

# Thinned dataframe stored in the first element of the retained list
identical(largest(caretta_thin), get_trial(caretta_thin, 1))
#> [1] TRUE

As you can see, with GeoThinneR, it’s very easy to apply spatial thinning to your dataset. In the next section we explain the different thinning methods and how to use them, and also the implemented optimizations for large-scale datasets.

Thinning methods

GeoThinneR offers different algorithms for spatial thinning. In this section, we will show how to use each thinning method, highlighting its unique parameters and features.

Distance-based thinning

Distance-based thinning (method="distance") ensures that all retained points are separated by at least a user-defined minimum distance thin_dist (in km). For geographic coordinates, using distance = "haversine" (default) computes great-circle distances, or alternatively, the Euclidean distance between 3D Cartesian coordinates (depending on the method) to identify neighboring points within the distance threshold. The algorithm iterates through the dataset and removes points that are too close to each other.

In general, the choice of search method depends on the size of the dataset and the desired thinning distance:

You can find the benchmarking results for each method in the package manuscript (see references).

dist_thin <- thin_points(sim_data, method = "distance", thin_dist = 25, trials = 3)
plot(dist_thin)

Brute force

This is the most common greedy method for calculating the distance between points (search_type="brute"), as it directly computes all pairwise distances and retains points that are far enough apart based on the thinning distance. By default, it computes the Haversine distance using the rdist.earth() function from the fields package. If distance = "euclidean", it computes the Euclidean distance instead. The main advantage of this method is that it computes the full distance matrix between your points. However, the main disadvantage is that it is very time and memory consuming, as it requires computing all pairwise comparisons.

brute_thin <- thin_points(
  data = sim_data,
  method = "distance",
  search_type = "brute",
  thin_dist = 20,
  trials = 10,
  all_trials = FALSE,
  seed = 123
)
nrow(largest(brute_thin))
#> [1] 1391

Traditional kd-tree

A kd-tree is a binary tree that partitions space into regions to optimize nearest neighbor searches. The search_type="kd_tree" method uses the nabor R package to implement kd-trees via the libnabo library. This method is more memory-efficient than brute force. Since this implementation of kd-trees is based on Euclidean distance, when working with geographic coordinates (distance = "haversine"), GeoThinneR transforms the coordinates into XYZ Cartesian coordinates. Although this method scales better than the brute-force method for large datasets, since we want to identify all neighboring points of each query point, its performance can still be improved for very large datasets.

kd_tree_thin <- thin_points(
  data = sim_data,
  method = "distance",
  search_type = "kd_tree",
  thin_dist = 20,
  trials = 10,
  all_trials = FALSE,
  seed = 123
)
nrow(largest(kd_tree_thin))
#> [1] 1391

Local kd-trees

To reduce the complexity of searching within a single huge kd-tree containing all points from the dataset, we propose a simple modification: creating small local kd-trees by dividing the spatial domain into multiple smaller regions, each with its own tree (search_type="local_kd_tree"). This notably reduces memory usage, allowing this thinning method to be used for very large datasets on personal computers. Additionally, you can run in parallel this method to improve speed using the n_cores parameter.

local_kd_tree_thin <- thin_points(
  data = sim_data,
  method = "distance",
  search_type = "local_kd_tree",
  thin_dist = 20,
  trials = 10,
  all_trials = FALSE,
  seed = 123
)
nrow(largest(local_kd_tree_thin))
#> [1] 1391

Grid-based thinning

Grid sampling is a standard method where the area is divided into a grid, and n points are sampled from each grid cell. This method is very fast and memory-efficient. The only drawback is that you cannot ensure a minimum distance between points. There are two main ways to apply grid sampling in GeoThinneR: (i) define the characteristics of the grid, or (ii) pass your own grid as a raster (terra::SpatRaster).

For the first method, you can use the thin_dist parameter to define the grid cell size (the distance in km will be approximated to degrees to define the grid cell size, one degree roughly 111.32 km), or you can pass the resolution of the grid (e.g., resolution = 0.25 for 0.25x0.25-degree cells). If you want to align the grid with external data or covariate layers, you can pass the origin argument as a tuple of two values (e.g., c(0, 0)). Similarly, you can specify the coordinate reference system (CRS) of your grid (crs).

system.time(
grid_thin <- thin_points(
  data = sim_data,
  method = "grid",
  resolution = 1,
  origin = c(0, 0),
  crs = "epsg:4326",
  n = 1, # one point per grid cell
  trials = 10,
  all_trials = FALSE,
  seed = 123
))
#>    user  system elapsed 
#>    0.03    0.00    0.06
nrow(largest(grid_thin))
#> [1] 200

Alternatively, you can pass a terra::SpatRaster object, and that grid will be used for the thinning process.

rast_obj <- terra::rast(xmin = -10, xmax = 10, ymin = -5, ymax = 5, res = 1)
system.time(
grid_raster_thin <- thin_points(
  data = sim_data,
  method = "grid", 
  raster_obj = rast_obj,
  trials = 10,
  n = 1,
  all_trials = FALSE,
  seed = 123
))
#>    user  system elapsed 
#>    0.02    0.00    0.00
nrow(largest(grid_raster_thin))
#> [1] 200

Precision thinning

In this approach, coordinates are rounded to a certain precision to remove points that fall too close together. After removing points based on coordinate precision, the coordinate values are restored to their original locations. This is the simplest method and is more of a filtering when working with data from different sources with varying coordinate precision. To use it, you need to define the precision parameter, indicating the number of decimals to which the coordinates should be rounded.

system.time(
precision_thin <- thin_points(
  data = sim_data,
  method = "precision", 
  precision = 0,
  trials = 50,
  all_trials = FALSE,
  seed = 123
))
#>    user  system elapsed 
#>       0       0       0
nrow(largest(precision_thin))
#> [1] 230

These are the methods implemented in GeoThinneR. Depending on your specific dataset and research needs, one method may be more suitable than others.

Additional customization features

Spatial thinning by group

In some cases, your dataset may include different groups, such as species, time periods, areas, or conditions, that you want to thin independently. The group_col parameter allows you to specify the column containing the grouping factor, and the thinning will be performed separately for each group. For example, in the simulated data where we have two species, we can use this parameter to thin each species independently:

all_thin <- thin_points(
  data = sim_data,
  thin_dist = 100,
  seed = 123
)
grouped_thin <- thin_points(
  data = sim_data,
  group_col = "sp",
  thin_dist = 100,
  seed = 123
)

nrow(largest(all_thin))
#> [1] 167
nrow(largest(grouped_thin))
#> [1] 321

Fixed number of points

What if you need to retain a fixed number of points that best covers the area where your data points are located, or for class balancing? The target_points parameter allows you to specify the number of points to keep, and the function will return that number of points spaced as separated as possible. Additionally, you can also set a thin_dist parameter so that no points closer than this distance will be retained. Currently, this approach is only implemented using the brute force method (method = "distance" and search_type = "brute"), so be cautious when applying it to very large datasets.

targeted_thin <- thin_points(
  data = caretta,
  lon_col = "decimalLongitude",
  lat_col = "decimalLatitude",
  target_points = 150,
  thin_dist = 30,
  all_trials = FALSE,
  seed = 123,
  verbose = TRUE
)
#> Starting spatial thinning at 2025-04-23 22:41:16 
#> Thinning using method: distance
#> For specific target points, brute force method is used.
#> Thinning process completed.
#> Total execution time: 3.57 seconds
nrow(largest(targeted_thin))
#> [1] 150

Select points by priority

In some scenarios, you may want to prioritize certain points based on a specific criterion, such as uncertainty, data quality, or any other variable. The priority parameter allows you to pass a vector representing the priority of each point. Currently, this feature can be used with the "grid" and "precision" methods.

For example, in the sea turtle data downloaded from GBIF, there is a column named coordinateUncertaintyInMeters. We can use this to prioritize points with lower uncertainty within each grid cell (in the grid method) or when rounding coordinates (in the precision method). Keep in mind that bigger uncertainty values represent less priority so we have to reverse this values.

grid_thin <- thin_points(
  data = caretta,
  lon_col = "decimalLongitude",
  lat_col = "decimalLatitude",
  method = "grid",
  resolution = 2,
  seed = 123
)

# Substracting the maximum - the highest uncertainty becomes the lowest priority and vice versa.
priority <- max(caretta$coordinateUncertaintyInMeters, na.rm = TRUE) - caretta$coordinateUncertaintyInMeters
priority_thin <- thin_points(
  data = caretta,
  lon_col = "decimalLongitude",
  lat_col = "decimalLatitude",
  method = "grid",
  resolution = 2,
  priority = priority,
  seed = 123
)

mean(largest(grid_thin)$coordinateUncertaintyInMeters, na.rm = TRUE)
#> [1] 35513.54
mean(largest(priority_thin )$coordinateUncertaintyInMeters, na.rm = TRUE)
#> [1] 17260.05

Other Packages

The GeoThinneR package was inspired by the work of many others who have developed methods and packages for working with spatial data and thinning techniques. Our goal with GeoThinneR is to offer additional flexibility in method selection and to address specific needs we encountered while using other packages. We would like to mention other tools that may be suitable for your work:

References

Session info

sessionInfo()
#> R version 4.3.3 (2024-02-29 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 11 x64 (build 26100)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=C                   LC_CTYPE=Spanish_Spain.utf8   
#> [3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.utf8    
#> 
#> time zone: Europe/Madrid
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_3.5.1    sf_1.0-21        terra_1.7-78     GeoThinneR_2.0.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] s2_1.1.7           sass_0.4.9         utf8_1.2.4         generics_0.1.3    
#>  [5] class_7.3-22       KernSmooth_2.23-22 digest_0.6.35      magrittr_2.0.3    
#>  [9] evaluate_0.24.0    grid_4.3.3         iterators_1.0.14   maps_3.4.2        
#> [13] fastmap_1.1.1      foreach_1.5.2      jsonlite_1.8.8     e1071_1.7-14      
#> [17] DBI_1.2.3          spam_2.10-0        fansi_1.0.6        viridisLite_0.4.2 
#> [21] scales_1.3.0       codetools_0.2-19   jquerylib_0.1.4    cli_3.6.2         
#> [25] rlang_1.1.3        units_0.8-5        munsell_0.5.1      withr_3.0.1       
#> [29] cachem_1.0.8       yaml_2.3.10        tools_4.3.3        dplyr_1.1.4       
#> [33] colorspace_2.1-1   vctrs_0.6.5        R6_2.5.1           matrixStats_1.3.0 
#> [37] nabor_0.5.0        proxy_0.4-27       lifecycle_1.0.4    classInt_0.4-10   
#> [41] pkgconfig_2.0.3    pillar_1.9.0       bslib_0.8.0        gtable_0.3.5      
#> [45] glue_1.7.0         data.table_1.15.4  Rcpp_1.0.12        fields_15.2       
#> [49] highr_0.11         xfun_0.51          tibble_3.2.1       tidyselect_1.2.1  
#> [53] rstudioapi_0.16.0  knitr_1.48         farver_2.1.2       htmltools_0.5.8.1 
#> [57] rmarkdown_2.28     labeling_0.4.3     wk_0.9.2           dotCall64_1.1-1   
#> [61] compiler_4.3.3