# Documentation

### This is machine translation

Translated by
Mouseover text to see original. Click the button below to return to the English verison of the page.

# pdist2

Pairwise distance between two sets of observations

## Syntax

```D = pdist2(X,Y)D = pdist2(X,Y,distance)D = pdist2(X,Y,'minkowski',P)D = pdist2(X,Y,'mahalanobis',C)D = pdist2(X,Y,distance,'Smallest',K)D = pdist2(X,Y,distance,'Largest',K)[D,I] = pdist2(X,Y,distance,'Smallest',K)[D,I] = pdist2(X,Y,distance,'Largest',K)```

## Description

`D = pdist2(X,Y)` returns a matrix `D` containing the Euclidean distances between each pair of observations in the mx-by-n data matrix `X` and my-by-n data matrix `Y`. Rows of `X` and `Y` correspond to observations, columns correspond to variables. `D` is an mx-by-my matrix, with the (i,j) entry equal to distance between observation i in `X` and observation j in `Y`. The (i,j) entry will be `NaN` if observation i in `X` or observation j in `Y` contain `NaN`s.

`D = pdist2(X,Y,distance)` computes `D` using `distance`. Choices are:

MetricDescription
`'euclidean'`

Euclidean distance (default).

`'squaredeuclidean'`

Squared Euclidean distance. (This option is provided for efficiency only. It does not satisfy the triangle inequality.)

`'seuclidean'`

Standardized Euclidean distance. Each coordinate difference between rows in `X` and `Y` is scaled by dividing by the corresponding element of the standard deviation computed from `X`, `S=nanstd(X)`. To specify another value for `S`, use ```D = PDIST2(X,Y,'seuclidean',S)```.

`'cityblock'`

City block metric.

`'minkowski'`

Minkowski distance. The default exponent is 2. To compute the distance with a different exponent, use `D = pdist2(X,Y,'minkowski',P)`, where the exponent `P` is a scalar positive value.

`'chebychev'`

Chebychev distance (maximum coordinate difference).

`'mahalanobis'`

Mahalanobis distance, using the sample covariance of `X` as computed by `nancov`. To compute the distance with a different covariance, use `D = pdist2(X,Y,'mahalanobis',C)` where the matrix `C` is symmetric and positive definite.

`'cosine'`

One minus the cosine of the included angle between points (treated as vectors).

`'correlation'`

One minus the sample correlation between points (treated as sequences of values).

`'spearman'`

One minus the sample Spearman's rank correlation between observations, treated as sequences of values.

`'hamming'`

Hamming distance, the percentage of coordinates that differ.

`'jaccard'`

One minus the Jaccard coefficient, the percentage of nonzero coordinates that differ.

function

A distance function specified using @:
```D = pdist2(X,Y,@distfun)```.

A distance function must be of the form

`function D2 = distfun(ZI, ZJ)`
taking as arguments a 1-by-n vector `ZI` containing a single observation from `X` or `Y`, an m2-by-n matrix `ZJ` containing multiple observations from `X` or `Y`, and returning an m2-by-1 vector of distances `D2`, whose `J`th element is the distance between the observations `ZI` and `ZJ(J,:)`.

If your data is not sparse, generally it is faster to use a built-in `distance` than to use a function handle.

`D = pdist2(X,Y,distance,'Smallest',K)` returns a `K`-by-my matrix `D` containing the `K` smallest pairwise distances to observations in `X` for each observation in `Y`. `pdist2` sorts the distances in each column of `D` in ascending order. `D = pdist2(X,Y,distance,'Largest',K)` returns the `K` largest pairwise distances sorted in descending order. If `K` is greater than mx, `pdist2` returns an mx-by-my distance matrix. For each observation in `Y`, `pdist2` finds the `K` smallest or largest distances by computing and comparing the distance values to all the observations in `X`.

`[D,I] = pdist2(X,Y,distance,'Smallest',K)` returns a `K`-by-my matrix `I` containing indices of the observations in `X` corresponding to the `K` smallest pairwise distances in `D`. ```[D,I] = pdist2(X,Y,distance,'Largest',K)``` returns indices corresponding to the `K` largest pairwise distances.

### Metrics

Given an mx-by-n data matrix `X`, which is treated as mx (1-by-n) row vectors `x`1, `x`2, ..., `x`mx, and my-by-n data matrix `Y`, which is treated as my (1-by-n) row vectors `y`1, `y`2, ...,`y`my, the various distances between the vector `x`s and `y`t are defined as follows:

• Euclidean distance

`${d}_{st}^{2}=\left({x}_{s}-{y}_{t}\right)\left({x}_{s}-{y}_{t}{\right)}^{\prime }$`

Notice that the Euclidean distance is a special case of the Minkowski metric, where `p=`2.

• Standardized Euclidean distance

`${d}_{st}^{2}=\left({x}_{s}-{y}_{t}\right){V}^{-1}\left({x}_{s}-{y}_{t}{\right)}^{\prime }$`

where `V` is the n-by-n diagonal matrix whose jth diagonal element is `S`(j)2, where `S` is the vector of standard deviations.

• Mahalanobis distance

`${d}_{st}^{2}=\left({x}_{s}-{y}_{t}\right){C}^{-1}\left({x}_{s}-{y}_{t}{\right)}^{\prime }$`

where `C` is the covariance matrix.

• City block metric

`${d}_{st}=\sum _{j=1}^{n}|{x}_{sj}-{y}_{tj}|$`

Notice that the city block distance is a special case of the Minkowski metric, where `p=`1.

• Minkowski metric

`${d}_{st}=\sqrt[p]{\sum _{j=1}^{n}{|{x}_{sj}-{y}_{tj}|}^{p}}$`

Notice that for the special case of `p` = 1, the Minkowski metric gives the City Block metric, for the special case of `p` = 2, the Minkowski metric gives the Euclidean distance, and for the special case of `p=`∞, the Minkowski metric gives the Chebychev distance.

• Chebychev distance

`${d}_{st}={\mathrm{max}}_{j}\left\{|{x}_{sj}-{y}_{tj}|\right\}$`

Notice that the Chebychev distance is a special case of the Minkowski metric, where `p=`∞.

• Cosine distance

`${d}_{st}=\left(1-\frac{{x}_{s}{{y}^{\prime }}_{t}}{\sqrt{\left({x}_{s}{{x}^{\prime }}_{s}\right)\left({y}_{t}{{y}^{\prime }}_{t}\right)}}\right)$`
• Correlation distance

`${d}_{st}=1-\frac{\left({x}_{s}-{\overline{x}}_{s}\right){\left({y}_{t}-{\overline{y}}_{t}\right)}^{\prime }}{\sqrt{\left({x}_{s}-{\overline{x}}_{s}\right){\left({x}_{s}-{\overline{x}}_{s}\right)}^{\prime }}\sqrt{\left({y}_{t}-{\overline{y}}_{t}\right){\left({y}_{t}-{\overline{y}}_{t}\right)}^{\prime }}}$`

where

${\overline{x}}_{s}=\frac{1}{n}\sum _{j}{x}_{sj}$ and

`${\overline{y}}_{t}=\frac{1}{n}\sum _{j}{y}_{tj}$`
• Hamming distance

`${d}_{st}=\left(#\left({x}_{sj}\ne {y}_{tj}\right)/n\right)$`
• Jaccard distance

`${d}_{st}=\frac{#\left[\left({x}_{sj}\ne {y}_{tj}\right)\cap \left(\left({x}_{sj}\ne 0\right)\cup \left({y}_{tj}\ne 0\right)\right)\right]}{#\left[\left({x}_{sj}\ne 0\right)\cup \left({y}_{tj}\ne 0\right)\right]}$`
• Spearman distance

`${d}_{st}=1-\frac{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime }}{\sqrt{\left({r}_{s}-{\overline{r}}_{s}\right){\left({r}_{s}-{\overline{r}}_{s}\right)}^{\prime }}\sqrt{\left({r}_{t}-{\overline{r}}_{t}\right){\left({r}_{t}-{\overline{r}}_{t}\right)}^{\prime }}}$`

where

• rsj is the rank of xsj taken over x1j, x2j, ...xmx,j, as computed by `tiedrank`

• rtj is the rank of ytj taken over y1j, y2j, ...ymy,j, as computed by `tiedrank`

• rs and rt are the coordinate-wise rank vectors of xs and yt, i.e. rs = (rs1, rs2, ... rsn) and rt = (rt1, rt2, ... rtn)

• ${\overline{r}}_{s}=\frac{1}{n}\sum _{j}{r}_{sj}=\frac{\left(n+1\right)}{2}$

• ${\overline{r}}_{t}=\frac{1}{n}\sum _{j}{r}_{tj}=\frac{\left(n+1\right)}{2}$

## Examples

Generate random data and find the unweighted Euclidean distance, then find the weighted distance using two different methods:

```% Compute the ordinary Euclidean distance X = randn(100, 5); Y = randn(25, 5); D = pdist2(X,Y,'euclidean'); % euclidean distance % Compute the Euclidean distance with each coordinate % difference scaled by the standard deviation Dstd = pdist2(X,Y,'seuclidean'); % Use a function handle to compute a distance that weights % each coordinate contribution differently. Wgts = [.1 .3 .3 .2 .1]; weuc = @(XI,XJ,W)(sqrt(bsxfun(@minus,XI,XJ).^2 * W')); Dwgt = pdist2(X,Y, @(Xi,Xj) weuc(Xi,Xj,Wgts)); ```