`stats::hclust`

with `hclust1d`

The purpose of this vignette is to provide guidelines on replacing
`stats::hclust`

calls with `hclust1d`

calls for
univariate (1D) data, in a plug-and-play manner, i.e. without changing
any surrounding code (or little of the surrounding code, as an option to
the programmer).

To enable use of `hclust1d`

you need to include this line
in your script or markdown notebook:

`library(hclust1d)`

In case of packages you need to import `hclust1d`

in your
`DESCRIPTION`

file.

The main reason a programmer might want to replace
`stats:hclust`

calls with `hclust1d`

in case of
clustering 1D points is that the computational complexity of
`hclust1d`

is \(\mathcal{O}(n\log
n)\), while it is at least quadratic (and \(\mathcal{O}(n^2\log n)\) for case of
general linkage) for multidimensional algorithms.

The better algorithmic time complexity (compared to multidimensional
hierarchical clustering) paired with its efficient `C++`

implementation make `hclust1d`

very fast. The computational
time beats `stats::hclust`

on all sizes of data and is *en
par* with `fastcluster::hclust`

with small data sizes.
However, it is of orders of magnitude faster than both multivariate
clustering routines on larger data sizes.

`hclust1d`

with
`stats::hclust`

Maintaining compatibility of `hclust`

with
`stats::hclust`

was high on a list of design priorities for
`hclust1d`

.

All linkage functions of

`stats::hclust`

are supported in`hclust1d`

, too.Input to

`stats::hclust`

should be`dist`

S3 class structure as produced by`stats::dist`

function and the same input is accepted in`hclust1d`

(with a`distance`

argument explicitly set to`TRUE`

).There are three atypical linkages in

`stats::hclust`

. Namely,`stats::hclust`

requires that the**squared**`dist`

structure is provided for`ward.D`

,`centroid`

and`median`

linkage functions. This is implicit. The same input is accepted in`hclust1d`

(with`distance`

and`squared`

arguments explicitly set to`TRUE`

).The object returned from

`hclust1d`

call is the same S3 class as the result of`stats::hclust`

, namely`hclust`

S3 class.The heights returned from

`hclust1d`

are calculated the same, as in`stats::hclust`

:**squared**respective distances are returned for`ward.D`

,`centroid`

and`median`

linkage functions,**unsquared**respective distances are returned for the remaining linkage functions.

`hclust1d`

The list of all linkage functions supported in `hclust1d`

is available by calling:

```
supported_methods()
#> [1] "complete" "average" "centroid" "true_median" "median"
#> [6] "mcquitty" "ward.D" "ward.D2" "single"
```

The in-depth description of the linkage functions in
`hclust1d`

, together with the inter-cluster distance metric
definition used in case of each linkage function (and returned as the
merging height) can be found in our getting
started vignette.

The choice of a linkage function is the same in `hclust1d`

as in `stats::hclust`

, i.e. by specifying a
`method`

argument and passing the name of a linkage function
into `hclust1d`

as a character string.

To provide an example, the following two calls execute
`average`

linkage hierarchical clustering on distances
computed for a set of 1D points, by passing
`method = "average"`

argument to relevant calls:

```
<- rnorm(10)
points
<- stats::hclust(stats::dist(points), method = "average")
res <- hclust1d(stats::dist(points), method = "average", distance = TRUE) res
```

`hclust1d`

The user of `stats::hclust`

and of `hclust1d`

can select a number of distance metrics when building distance-based
input with `stats::dist`

, by selecting an appropriate name of
a metric and passing it as a `method`

argument to
`stats::dist`

as a character string. Not all of them are
supported in `hclust1d`

. The list of distance metrics
supported in `hclust1d`

is available by calling:

```
supported_dist.methods()
#> [1] "euclidean" "maximum" "manhattan" "minkowski"
```

The trick here is that for 1D points `euclidean`

,
`maximum`

, `manhattan`

and `minkowski`

distances are equivalent.

To provide an example, the following two calls execute
`average`

linkage hierarchical clustering on distances
computed by `minkowski`

\(L_3\) norm for a set of 1D points, by
passing `method = "minkowski"`

and `p = 3`

arguments to relevant `stats::dist`

calls:

```
<- rnorm(10)
points
<- stats::hclust(stats::dist(points, method = "minkowski", p=3), method="average")
res <- hclust1d(stats::dist(points, method = "minkowski", p=3), method="average", distance = TRUE) res
```

We don’t support `members`

argument in
`hclust1d`

.

`stats::hclust`

in case of `ward.D`

,
`centroid`

or `median`

linkage functions**This section DOES NOT apply to ward.D2 linkage
function, despite the similarity in its name**.

For those three linkage functions (`ward.D`

,
`centroid`

or `median`

) the default hierarchical
clustering routine `stats::hclust`

requires *squared*
distance structure calculated on original points. Consequently, as can
be seen from the sections above, to replace a `stats::hclust`

call with `hclust1d`

for 1D data, one needs to replace any
call to

`<- stats::hclust(squared_d, method = linkage_function_name, members = NULL) res `

by a call to

`<- hclust1d(squared_d, method = linkage_function_name, distance = TRUE, square = TRUE) res `

Somewhere in the code above this line, `squared_d`

has
been computed by a call to `stats::dist`

from a vector of 1D
points and subsequently squaring the `stats::dist`

result.

Somewhere below in the code `res`

gets analyzed, but it is
OK, because the results of both calls are compatible.

If the programmer has access to the original

`stats::dist`

result (let’s denote this variable`d`

), the computation of`squared_d`

can be removed (provided it is not used for other purpose) and a call to`<- stats::hclust(squared_d, method = linkage_function_name, members = NULL) res`

can be replaced by a call to

`<- hclust1d(d, method = linkage_function_name, distance = TRUE) res`

If the programmer has access to the original points (let’s denote this variable

`points`

), the computation of`squared_d`

and of`d`

can be removed altogether (provided they are not used for other purpose) and a call to`<- stats::hclust(squared_d, method = linkage_function_name, members = NULL) res`

can be replaced by a call to

`<- hclust1d(points, method = linkage_function_name) res`

`stats::hclust`

in case of all other linkage
functions, besides `ward.D`

, `centroid`

or
`median`

**This section applies to, among others, ward.D2
linkage function**.

For those remaining linkage functions `stats::hclust`

accepts regular (*unsquared*) distance structure calculated on
original points. Consequently, as can be seen from the sections above,
to replace a `stats::hclust`

call with `hclust1d`

for 1D data, one needs to replace any call to

`<- stats::hclust(d, method = linkage_function_name, members = NULL) res `

by a call to

`<- hclust1d(d, method = linkage_function_name, distance = TRUE) res `

Somewhere in the code above this line, `d`

has been
computed by a call to `stats::dist`

from a vector of 1D
points.

Somewhere below in the code `res`

gets analyzed, but it is
OK, because the results of both calls are compatible.

If the programmer has access to the original points (let’s denote
this variable `points`

), the computation of `d`

can be removed (provided it is not used for other purpose) and a call
to

`<- stats::hclust(d, method = linkage_function_name, members = NULL) res `

can be replaced by a call to

`<- hclust1d(points, method = linkage_function_name) res `