Documentation

### This is machine translation

Translated by
Mouseover text to see original. Click the button below to return to the English version of the page.

# pdist2

Pairwise distance between two sets of observations

## Syntax

``D = pdist2(X,Y,Distance)``
``D = pdist2(X,Y,Distance,DistParameter)``
``D = pdist2(___,Name,Value)``
``[D,I] = pdist2(___,Name,Value)``

## Description

example

````D = pdist2(X,Y,Distance)` returns the distance between each pair of observations in `X` and `Y` using the metric specified by `Distance`.```

example

````D = pdist2(X,Y,Distance,DistParameter)` returns the distance using the metric specified by `Distance` and `DistParameter`. You can specify `DistParameter` only when `Distance` is `'seuclidean'`, `'minkowski'`, or `'mahalanobis'`.```
````D = pdist2(___,Name,Value)` specifies an additional option using one of the name-value pair arguments `'Smallest'` or `'Largest'` in addition to any of the arguments in the previous syntaxes. For example, `D = pdist2(X,Y,Distance,'Smallest',K)` computes the distance using the metric specified by `Distance` and returns the `K` smallest pairwise distances to observations in `X` for each observation in `Y` in ascending order.```D = pdist2(X,Y,Distance,DistParameter,'Largest',K)``` computes the distance using the metric specified by `Distance` and `DistParameter` and returns the `K` largest pairwise distances in descending order. ```

example

````[D,I] = pdist2(___,Name,Value)` also returns the matrix `I` using any of the arguments in the previous syntaxes. The matrix `I` contains the indices of the observations in `X` corresponding to the distances in `D`. ```

## Examples

collapse all

Create two matrices with three observations and two variables.

```rng('default') % For reproducibility X = rand(3,2); Y = rand(3,2);```

Compute the Euclidean distance. The default value of the input argument `Distance` is `'euclidean'`. When computing the Euclidean distance without using a name-value pair argument, you do not need to specify `Distance`.

`D = pdist2(X,Y)`
```D = 3×3 0.5387 0.8018 0.1538 0.7100 0.5951 0.3422 0.8805 0.4242 1.2050 ```

`D(i,j)` corresponds to the pairwise distance between observation `i` in `X` and observation `j` in `Y`.

Create two matrices with three observations and two variables.

```rng('default') % For reproducibility X = rand(3,2); Y = rand(3,2);```

Compute the Minkowski distance with the default exponent 2.

`D1 = pdist2(X,Y,'minkowski')`
```D1 = 3×3 0.5387 0.8018 0.1538 0.7100 0.5951 0.3422 0.8805 0.4242 1.2050 ```

Compute the Minkowski distance with an exponent of 1, which is equal to the city block distance.

`D2 = pdist2(X,Y,'minkowski',1)`
```D2 = 3×3 0.5877 1.0236 0.2000 0.9598 0.8337 0.3899 1.0189 0.4800 1.7036 ```
`D3 = pdist2(X,Y,'cityblock')`
```D3 = 3×3 0.5877 1.0236 0.2000 0.9598 0.8337 0.3899 1.0189 0.4800 1.7036 ```

Create two matrices with three observations and two variables.

```rng('default') % For reproducibility X = rand(3,2); Y = rand(3,2);```

Find the two smallest pairwise Euclidean distances to observations in `X` for each observation in `Y`.

`[D,I] = pdist2(X,Y,'euclidean','Smallest',2)`
```D = 2×3 0.5387 0.4242 0.1538 0.7100 0.5951 0.3422 ```
```I = 2×3 1 3 1 2 2 2 ```

For each observation in `Y`, `pdist2` finds the two smallest distances by computing and comparing the distance values to all the observations in `X`. The function then sorts the distances in each column of `D` in ascending order. `I` contains the indices of the observations in `X` corresponding to the distances in `D`.

Define a custom distance function that ignores coordinates with `NaN` values, and compute pairwise distance by using the custom distance function.

Create two matrices with three observations and three variables.

```rng('default') % For reproducibility X = rand(3,3) Y = [X(:,1:2) rand(3,1)] ```
```X = 0.8147 0.9134 0.2785 0.9058 0.6324 0.5469 0.1270 0.0975 0.9575 Y = 0.8147 0.9134 0.9649 0.9058 0.6324 0.1576 0.1270 0.0975 0.9706 ```

The first two columns of X and Y are identical. Assume that `X(1,1)` is missing.

```X(1,1) = NaN ```
```X = NaN 0.9134 0.2785 0.9058 0.6324 0.5469 0.1270 0.0975 0.9575 ```

Compute the Hamming distance.

```D1 = pdist2(X,Y,'hamming') ```
```D1 = NaN NaN NaN 1.0000 0.3333 1.0000 1.0000 1.0000 0.3333 ```

If observation `i` in `X` or observation `j` in `Y` contains `NaN` values, the function `pdist2` returns `NaN` for the pairwise distance between `i` and `j`. Therefore, D1(1,1), D1(1,2), and D1(1,3) are `NaN` values.

Define a custom distance function `nanhamdist` that ignores coordinates with `NaN` values and computes the Hamming distance. When working with a large number of observations, you can compute the distance more quickly by looping over coordinates of the data.

```function D2 = nanhamdist(XI,XJ) %NANHAMDIST Hamming distance ignoring coordinates with NaNs [m,p] = size(XJ); nesum = zeros(m,1); pstar = zeros(m,1); for q = 1:p notnan = ~(isnan(XI(q)) | isnan(XJ(:,q))); nesum = nesum + ((XI(q) ~= XJ(:,q)) & notnan); pstar = pstar + notnan; end D2 = nesum./pstar; ```

Compute the distance with `nanhamdist` by passing the function handle as an input argument of `pdist2`.

```D2 = pdist2(X,Y,@nanhamdist) ```
```D2 = 0.5000 1.0000 1.0000 1.0000 0.3333 1.0000 1.0000 1.0000 0.3333 ```

## Input Arguments

collapse all

Input data, specified as a numeric matrix. `X` is an mx-by-n matrix and `Y` is an my-by-n matrix. Rows correspond to individual observations, and columns correspond to individual variables.

Data Types: `single` | `double`

Distance metric, specified as a character vector, string scalar, or function handle, as described in the following table.

ValueDescription
`'euclidean'`

Euclidean distance (default).

`'squaredeuclidean'`

Squared Euclidean distance. (This option is provided for efficiency only. It does not satisfy the triangle inequality.)

`'seuclidean'`

Standardized Euclidean distance. Each coordinate difference between observations is scaled by dividing by the corresponding element of the standard deviation, ```S = nanstd(X)```. Use `DistParameter` to specify another value for `S`.

`'mahalanobis'`

Mahalanobis distance using the sample covariance of `X`, `C = nancov(X)`. Use `DistParameter` to specify another value for `C`, where the matrix `C` is symmetric and positive definite.

`'cityblock'`

City block distance.

`'minkowski'`

Minkowski distance. The default exponent is 2. Use `DistParameter` to specify a different exponent `P`, where `P` is a positive scalar value of the exponent.

`'chebychev'`

Chebychev distance (maximum coordinate difference).

`'cosine'`

One minus the cosine of the included angle between points (treated as vectors).

`'correlation'`

One minus the sample correlation between points (treated as sequences of values).

`'hamming'`

Hamming distance, which is the percentage of coordinates that differ.

`'jaccard'`

One minus the Jaccard coefficient, which is the percentage of nonzero coordinates that differ.

`'spearman'`

One minus the sample Spearman's rank correlation between observations (treated as sequences of values).

`@distfun`

Custom distance function handle. A distance function has the form

```function D2 = distfun(ZI,ZJ) % calculation of distance ...```
where

• `ZI` is a `1`-by-`n` vector containing a single observation.

• `ZJ` is an `m2`-by-`n` matrix containing multiple observations. `distfun` must accept a matrix `ZJ` with an arbitrary number of observations.

• `D2` is an `m2`-by-`1` vector of distances, and `D2(k)` is the distance between observations `ZI` and `ZJ(k,:)`.

If your data is not sparse, you can generally compute distance more quickly by using a built-in distance instead of a function handle.

For definitions, see Distance Metrics.

When you use `'seuclidean'`, `'minkowski'`, or `'mahalanobis'`, you can specify an additional input argument `DistParameter` to control these metrics. You can also use these metrics in the same way as the other metrics with a default value of `DistParameter`.

Example: `'minkowski'`

Distance metric parameter values, specified as a positive scalar, numeric vector, or numeric matrix. This argument is valid only when you specify `Distance` as `'seuclidean'`, `'minkowski'`, or `'mahalanobis'`.

• If `Distance` is `'seuclidean'`, `DistParameter` is a vector of scaling factors for each dimension, specified as a positive vector. The default value is `nanstd(X)`.

• If `Distance` is `'minkowski'`, `DistParameter` is the exponent of Minkowski distance, specified as a positive scalar. The default value is 2.

• If `Distance` is `'mahalanobis'`, `DistParameter` is a covariance matrix, specified as a numeric matrix. The default value is `nancov(X)`. `DistParameter` must be symmetric and positive definite.

Example: `'minkowski',3`

Data Types: `single` | `double`

### Name-Value Pair Arguments

Specify optional comma-separated pairs of `Name,Value` arguments. `Name` is the argument name and `Value` is the corresponding value. `Name` must appear inside quotes. You can specify several name and value pair arguments in any order as `Name1,Value1,...,NameN,ValueN`.

Example: Either `'Smallest',K` or `'Largest',K`. You cannot use both `'Smallest'` and `'Largest'`.

Number of smallest distances to find, specified as the comma-separated pair consisting of `'Smallest'` and a positive integer. If you specify `'Smallest'`, then `pdist2` sorts the distances in each column of `D` in ascending order.

Example: `'Smallest',3`

Data Types: `single` | `double`

Number of largest distances to find, specified as the comma-separated pair consisting of `'Largest'` and a positive integer. If you specify `'Largest'`, then `pdist2` sorts the distances in each column of `D` in descending order.

Example: `'Largest',3`

Data Types: `single` | `double`

## Output Arguments

collapse all

Pairwise distances, returned as a numeric matrix.

If you do not specify either `'Smallest'` or `'Largest'`, then `D` is an mx-by-my matrix, where mx and my are the number of observations in `X` and `Y`, respectively. `D(i,j)` is the distance between observation `i` in `X` and observation `j` in `Y`. If observation i in `X` or observation j in `Y` contains `NaN`, then `D(i,j)` is `NaN` for the built-in distance functions.

If you specify either `'Smallest'` or `'Largest'` as `K`, then `D` is a `K`-by-my matrix. `D` contains either the `K` smallest or `K` largest pairwise distances to observations in `X` for each observation in `Y`. For each observation in `Y`, `pdist2` finds the `K` smallest or largest distances by computing and comparing the distance values to all the observations in `X`. If `K` is greater than mx, `pdist2` returns an mx-by-my matrix.

Sort index, returned as a positive integer matrix. `I` is the same size as `D`. `I` contains the indices of the observations in `X` corresponding to the distances in `D`.

collapse all

### Distance Metrics

A distance metric is a function that defines a distance between two observations. `pdist2` supports various distance metrics: Euclidean distance, standardized Euclidean distance, Mahalanobis distance, city block distance, Minkowski distance, Chebychev distance, cosine distance, correlation distance, Hamming distance, Jaccard distance, and Spearman distance.

Given an mx-by-n data matrix X, which is treated as mx (1-by-n) row vectors x1, x2, ..., xmx, and an my-by-n data matrix Y, which is treated as my (1-by-n) row vectors y1, y2, ...,ymy, the various distances between the vector xs and yt are defined as follows:

• Euclidean distance

`${d}_{st}^{2}=\left({x}_{s}-{y}_{t}\right)\left({x}_{s}-{y}_{t}{\right)}^{\prime }.$`

The Euclidean distance is a special case of the Minkowski distance, where p = 2.

• Standardized Euclidean distance

`${d}_{st}^{2}=\left({x}_{s}-{y}_{t}\right){V}^{-1}\left({x}_{s}-{y}_{t}{\right)}^{\prime },$`

where V is the n-by-n diagonal matrix whose jth diagonal element is (S(j))2, where S is a vector of scaling factors for each dimension.

• Mahalanobis distance

`${d}_{st}^{2}=\left({x}_{s}-{y}_{t}\right){C}^{-1}\left({x}_{s}-{y}_{t}{\right)}^{\prime },$`

where C is the covariance matrix.

• City block distance

`${d}_{st}=\sum _{j=1}^{n}|{x}_{sj}-{y}_{tj}|.$`

The city block distance is a special case of the Minkowski distance, where p = 1.

• Minkowski distance

`${d}_{st}=\sqrt[p]{\sum _{j=1}^{n}{|{x}_{sj}-{y}_{tj}|}^{p}}.$`

For the special case of p = 1, the Minkowski distance gives the city block distance. For the special case of p = 2, the Minkowski distance gives the Euclidean distance. For the special case of p = ∞, the Minkowski distance gives the Chebychev distance.

• Chebychev distance

`${d}_{st}={\mathrm{max}}_{j}\left\{|{x}_{sj}-{y}_{tj}|\right\}.$`

The Chebychev distance is a special case of the Minkowski distance, where p = ∞.

• Cosine distance

`${d}_{st}=\left(1-\frac{{x}_{s}{{y}^{\prime }}_{t}}{\sqrt{\left({x}_{s}{{x}^{\prime }}_{s}\right)\left({y}_{t}{{y}^{\prime }}_{t}\right)}}\right).$`
• Correlation distance

`${d}_{st}=1-\frac{\left({x}_{s}-{\overline{x}}_{s}\right){\left({y}_{t}-{\overline{y}}_{t}\right)}^{\prime }}{\sqrt{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{s}-{\overline{x}}_{s}\right)}^{\prime }}\sqrt{\left({y}_{t}-{\overline{y}}_{t}\right){\left({y}_{t}-{\overline{y}}_{t}\right)}^{\prime }}},$`

where

`${\overline{x}}_{s}=\frac{1}{n}\sum _{j}{x}_{sj}$`

and

`${\overline{y}}_{t}=\frac{1}{n}\sum _{j}{y}_{tj}.$`
• Hamming distance

`${d}_{st}=\left(#\left({x}_{sj}\ne {y}_{tj}\right)/n\right).$`
• Jaccard distance

`${d}_{st}=\frac{#\left[\left({x}_{sj}\ne {y}_{tj}\right)\cap \left(\left({x}_{sj}\ne 0\right)\cup \left({y}_{tj}\ne 0\right)\right)\right]}{#\left[\left({x}_{sj}\ne 0\right)\cup \left({y}_{tj}\ne 0\right)\right]}.$`
• Spearman distance

`${d}_{st}=1-\frac{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime }}{\sqrt{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{s}-{\overline{r}}_{s}\right)}^{\prime }}\sqrt{\left({r}_{t}-{\overline{r}}_{t}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime }}},$`

where

• rsj is the rank of xsj taken over x1j, x2j, ...xmx,j, as computed by `tiedrank`.

• rtj is the rank of ytj taken over y1j, y2j, ...ymy,j, as computed by `tiedrank`.

• rs and rt are the coordinate-wise rank vectors of xs and yt, i.e., rs = (rs1, rs2, ... rsn) and rt = (rt1, rt2, ... rtn).

• ${\overline{r}}_{s}=\frac{1}{n}\sum _{j}{r}_{sj}=\frac{\left(n+1\right)}{2}$.

• ${\overline{r}}_{t}=\frac{1}{n}\sum _{j}{r}_{tj}=\frac{\left(n+1\right)}{2}$.