kde

Kernel density estimate for univariate data

Since R2023b

collapse all in page

Syntax

[f,xf] = kde(a)

[f,xf,bw] = kde(a)

[___] = kde(a,Name=Value)

Description

example

[f,xf] = kde(a) estimates a probability density function (pdf) for the univariate data in the vector a and returns values f of the estimated pdf at the evaluation points xf. kde uses kernel density estimation to estimate the pdf. See Kernel Distribution for more information.

example

[f,xf,bw] = kde(a) also returns the bandwidth for the kernel smoothing function.

example

[___] = kde(a,Name=Value) specifies options using one or more name-value arguments. For example, kde(a,ProbabilityFcn="cdf") estimates the cumulative distribution function (cdf) for a instead of the pdf. Use this syntax with any of the output argument combinations in the previous syntaxes.

Examples

collapse all

Estimate Probability Functions

Open Live Script

Generate some normally distributed data.

rng(0,"twister") %  For reproducibility
a = randn(100,1);

Estimate the pdf for the sample data.

[fp,xfp] = kde(a);

fp contains the values for the estimated pdf at the evaluation points in xfp.

Estimate the cdf for the sample data.

[fc,xfc] = kde(a,ProbabilityFcn="cdf");

fc contains the values for the estimated cdf at the evaluation points in xfc. xfc and xfp contain the same evaluation points because they were both calculated with the sample data in a.

Evaluate the pdf and cdf for the normal distribution at the evaluation points.

np = (1/sqrt(2*pi))*exp(-.5*(xfp.^2));
nc = 0.5*(1+erf(xfc/sqrt(2)));

Plot the estimated pdf with the normal distribution pdf.

plot(xfp,fp,"-",xfp,np,"--")
legend("kde estimate","Normal density")

Plot the estimated pdf with the normal distribution pdf.

figure
plot(xfc,fc,"-",xfc,nc,"--")
legend("kde estimate","Normal cumulative",Location="northwest")

The plots show that the estimated pdf and cdf have shapes similar to the pdf and cdf of the standard normal distribution.

Inspect Bandwidth

Open Live Script

Generate some normally distributed data.

rng(0,"twister") %  For reproducibility
a = randn(100,1);

Estimate the pdf for the sample data. By default, kde uses the normal-approximation method to calculate the bandwidth for the kernel smoothing function.

[fn,xfn,bwn] = kde(a);

fn contains the values for the estimated pdf at the evaluation points in xfn, and bwn is the bandwidth for the kernel smoothing function.

Estimate the pdf using the plug-in method, and display the bandwidth associated with each estimated pdf.

[p,xp,bwp] = kde(a,Bandwidth="plug-in");
[bwn,bwp]

ans = 1×2

    0.4958    0.5751

The bandwidth calculated with the normal-approximation method is less than the bandwidth calculated with the plug-in method.

Plot the estimated pdfs.

plot(xfn,fn)
hold on
plot(xp,p)
legend("normal-approx","plug-in")

The estimated pdfs have shapes typical of a normal distribution. The peak of the pdf corresponding to the normal-approximation method is higher than the peak of the pdf corresponding to the plug-in method.

Compare Kernel Smoothers

Open Live Script

Generate some bimodal sample data.

rng(0,"twister") %  For reproducibility
a = [randn(100,1)-5; randn(20,1)+5];

Use the default "normal" kernel smoothing function to estimate the pdf for the sample data. Use the "box", "triangle", and "parabolic" kernel smoothing functions to calculate three more estimates for the pdf.

[f1,xf1] = kde(a);
[f2,xf2] = kde(a,Kernel="box");
[f3,xf3] = kde(a,Kernel="triangle");
[f4,xf4] = kde(a,Kernel="parabolic");

xf1, xf2, xf3, and xf4 contain the same evaluation points because they were each calculated with the sample data in a. f1, f2, f3, and f4 contain the values of each estimated pdf at the evaluation points.

Plot the estimated pdfs.

tiledlayout(2,2)
nexttile
plot(xf1,f1) %  normal
nexttile
plot(xf2,f2) %  box
nexttile
plot(xf3,f3) %  triangle
nexttile
plot(xf4,f4) %  parabolic

The plots show that the four estimated pdfs have similar vertical ranges and two peaks each. The pdf calculated with the "box" kernel appears to be the least smooth of the four estimates.

Input Arguments

collapse all

`a` — Sample data
numeric vector

Sample data used to estimate the probability function, specified as a numeric vector.

Data Types: single | double

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: kde(a,Kernel="box",Bandwidth=0.8,Weight=wgt) specifies a box kernel smoothing function with a bandwidth of 0.8 and vector of observation weights wgt.

`Bandwidth` — Bandwidth for kernel smoothing function
`"normal-approx"` (default) | `"plug-in"` | positive scalar

Bandwidth for the kernel smoothing function, specified as "normal-approx", "plug-in", or a positive scalar.

When Bandwidth is "normal-approx", kde uses the normal-approximation method, or Silverman's rule of thumb, to calculate the bandwidth.
When Bandwidth is "plug-in", kde uses the improved plug-in method described in [1] to calculate the bandwidth. The plug-in method is sometimes called the Sheather-Jones method.
When Bandwidth is a positive scalar, its value controls the smoothness of the probability function estimate. As the value increases, the probability function estimate gets smoother.

To see how Bandwidth affects the kernel smoothing function, see Kernel.

Example: kde(a,Bandwidth="plug-in")

Data Types: single | double | string | char

`EvaluationPoints` — Points at which to evaluate estimated probability function
numeric vector

Points at which to evaluate the estimated probability function, specified as a numeric vector. By default, kde evaluates the estimated probability function at NumPoints evenly spaced points that cover the range of the observations in a.

If you specify both the NumPoints and EvaluationPoints name-value arguments, kde ignores NumPoints.

Example: kde(a,EvaluationPoints=linspace(0,10,50))

Data Types: single | double

`Kernel` — Type of kernel smoothing function
`"normal"` (default) | `"box"` | `"triangle"` | `"parabolic"` | function handle

Type of kernel smoothing function, specified as a function handle or one of the values in this table.

Value	Equation
`"normal"`	$K_{i} (x) = \frac{1}{\sqrt{2 π}} e^{\frac{- d_{i}^{2}}{2}}$
`"box"`	$K_{i} (x) = {\begin{matrix} \frac{1}{2 \sqrt{3}}, \| d_{i} \| \leq \sqrt{3} \\ 0, \| d_{i} \| > \sqrt{3} \end{matrix}$
`"triangle"`	$K_{i} (x) = {\begin{matrix} \frac{1 - \frac{\| d_{i} \|}{\sqrt{6}}}{\sqrt{6}}, \| d_{i} \| \leq \sqrt{6} \\ 0, \| d_{i} \| > \sqrt{6} \end{matrix}$
`"parabolic"`	$\begin{array}{l} K_{i, h} (x) = \max (0, \frac{3}{4} u), \\ u = \frac{1 - \frac{z^{2}}{5}}{\sqrt{5}}, \\ z = \max (- \sqrt{5}, \min (d_{i}, \sqrt{5})) \end{array}$

In the table, $d_{i} = \frac{x - a_{i}}{h}$ , where h is the bandwidth specified in the Bandwidth name-value argument, and a_i is the element at position i in a. A random variable with a pdf defined by one of the kernels in the table has a variance of 1. A parabolic kernel smoothing function is sometimes called an Epanechnikov smoothing function.

If you specify Kernel as a function handle, the function must accept a matrix or column vector of arbitrary length as its only input argument and return a nonnegative matrix or vector of the same size.

For more information about how kde uses the kernel smoothing function to estimate the probability function, see Kernel Distribution.

Example: kde(a,Kernel="parabolic")

Data Types: string | char | function_handle

`NumPoints` — Number of evaluation points
positive integer scalar

Number of evaluation points for the estimated probability function, specified as a positive integer scalar. By default, NumPoints = max(100,u), where u is the square root of the number of elements in a, rounded to the nearest integer.

If you specify both the NumPoints and EvaluationPoints name-value arguments, kde ignores NumPoints.

Example: kde(a,NumPoints=100)

Data Types: single | double

`ProbabilityFcn` — Probability function
`"pdf"` (default) | `"cdf"`

Probability function to estimate, specified as "pdf" or "cdf". When ProbabilityFcn is "pdf", kde estimates a probability density function. To estimate a cumulative distribution function, specify ProbabilityFcn as "cdf".

Example: kde(a,ProbabilityFcn="cdf")

`Support` — Interval for sample data
`"unbounded"` (default) | `"positive"` | `"nonnegative"` | `"negative"` | two-element numeric vector

Interval for the sample data, specified as a two-element numeric vector, "unbounded", "positive", "nonnegative", or "negative". The elements of a must be in the interval specified by Support. The estimated probability function evaluates to 0 outside of the interval.

If you specify Support as a two-element vector [L U] or [L;U], L must be greater than max(a) and U must be less than min(a). The interval is open with lower bound L and upper bound U.

If you specify Support as a string, the sample data exists inside an interval described in this table.

Value	Support
`"unbounded"`	$(- I n f, I n f)$
`"positive"`	$(0, I n f)$
`"nonnegative"`	$[0, I n f)$
`"negative"`	$(- I n f, 0)$

Example: kde(a,Support="nonnegative")

Data Types: single | double | string | char

`Weight` — Observation weights
nonnegative vector

Observation weights, specified as a nonnegative vector. By default, kde weights all observations in a equally. For more information about how kde uses weights to estimate the probability function, see Kernel Distribution.

Data Types: single | double

Output Arguments

collapse all

`f` — Estimated function values
numeric vector

Estimated function values, returned as a numeric vector. The length of f is equal to the number of evaluation points in xf.

`xf` — Evaluation points
numeric vector

Evaluation points, returned as a numeric vector. xf has the same size as the EvaluationPoints name-value argument, if EvaluationPoints is specified. Otherwise, the size of xf is given by the NumPoints name-value argument.

`bw` — Bandwidth
positive scalar

Bandwidth for the kernel smoothing function, returned as a positive scalar. You can use the Bandwidth name-value argument to specify the value for bw or the method for calculating bw.

More About

collapse all

Kernel Distribution

A kernel distribution is a nonparametric representation of a probability density function (pdf) of a random variable. You can use a kernel distribution when a parametric distribution cannot properly describe the data or when you want to avoid making assumptions about the distribution of the data. A kernel distribution is defined by a smoothing function and a bandwidth value, which control the smoothness of the resulting density curve.

The kernel estimator is an estimated probability function for a random variable. For any real values of x, the kernel estimator for the pdf is given by

${\hat{f}}_{h} (x) = \frac{1}{n h} \sum_{i = 1}^{n} w_{i} K (\frac{x - x_{i}}{h}),$

where the x_i values are random samples from an unknown distribution, w_i values are their corresponding weights, n is the sample size, $K$ is the kernel smoothing function, and h is the bandwidth.

For any real values of x, the kernel estimator for the cumulative distribution function (cdf) is given by

${\hat{F}}_{h} (x) = \int_{- \infty}^{x} {\hat{f}}_{h} (t) d t = \frac{1}{n h} \sum_{i = 1}^{n} w_{i} G (\frac{x - x_{i}}{h}),$

where $G (x) = \int_{- \infty}^{x} K (t) d t$ .

For more details, see Kernel Distribution (Statistics and Machine Learning Toolbox).

References

[1] Botev, Z. I., J. F. Grotowski, and D. P. Kroese. "Kernel Density Estimation via Diffusion." The Annals of Statistics, vol. 38, no. 5 (October 1, 2010). https://projecteuclid.org/journals/annals-of-statistics/volume-38/issue-5/Kernel-density-estimation-via-diffusion/10.1214/10-AOS799.full

[2] Bowman, A. W., and A. Azzalini. "Applied Smoothing Techniques for Data Analysis." New York: Oxford University Press Inc., 1997.

[3] Hill, P. D. "Kernel estimation of a distribution function." Communications in Statistics - Theory and Methods. 14, no. 3(January 1985): 605–620.

[4] Jones, M. C. "Simple boundary correction for kernel density estimation." Statistics and Computing. no. 3(September 1993): 135–146.

[5] Silverman, B. W. "Density Estimation for Statistics and Data Analysis." Chapman & Hall/CRC, 1986.

Version History

Introduced in R2023b

kde

Syntax

Description

Examples

Estimate Probability Functions

Inspect Bandwidth

Compare Kernel Smoothers

Input Arguments

`a` — Sample data
numeric vector

Name-Value Arguments

`Bandwidth` — Bandwidth for kernel smoothing function
`"normal-approx"` (default) | `"plug-in"` | positive scalar

`EvaluationPoints` — Points at which to evaluate estimated probability function
numeric vector

`Kernel` — Type of kernel smoothing function
`"normal"` (default) | `"box"` | `"triangle"` | `"parabolic"` | function handle

`NumPoints` — Number of evaluation points
positive integer scalar

`ProbabilityFcn` — Probability function
`"pdf"` (default) | `"cdf"`

`Support` — Interval for sample data
`"unbounded"` (default) | `"positive"` | `"nonnegative"` | `"negative"` | two-element numeric vector

`Weight` — Observation weights
nonnegative vector

Output Arguments

`f` — Estimated function values
numeric vector

`xf` — Evaluation points
numeric vector

`bw` — Bandwidth
positive scalar

More About

Kernel Distribution

References

Version History

See Also

Functions

Topics

kde

Syntax

Description

Examples

Estimate Probability Functions

Inspect Bandwidth

Compare Kernel Smoothers

Input Arguments

a — Sample data numeric vector

Name-Value Arguments

Bandwidth — Bandwidth for kernel smoothing function "normal-approx" (default) | "plug-in" | positive scalar

EvaluationPoints — Points at which to evaluate estimated probability function numeric vector

Kernel — Type of kernel smoothing function "normal" (default) | "box" | "triangle" | "parabolic" | function handle

NumPoints — Number of evaluation points positive integer scalar

ProbabilityFcn — Probability function "pdf" (default) | "cdf"

Support — Interval for sample data "unbounded" (default) | "positive" | "nonnegative" | "negative" | two-element numeric vector

Weight — Observation weights nonnegative vector

Output Arguments

f — Estimated function values numeric vector

xf — Evaluation points numeric vector

bw — Bandwidth positive scalar

More About

Kernel Distribution

References

Version History

See Also

Functions

Topics

`a` — Sample data
numeric vector

`Bandwidth` — Bandwidth for kernel smoothing function
`"normal-approx"` (default) | `"plug-in"` | positive scalar

`EvaluationPoints` — Points at which to evaluate estimated probability function
numeric vector

`Kernel` — Type of kernel smoothing function
`"normal"` (default) | `"box"` | `"triangle"` | `"parabolic"` | function handle

`NumPoints` — Number of evaluation points
positive integer scalar

`ProbabilityFcn` — Probability function
`"pdf"` (default) | `"cdf"`

`Support` — Interval for sample data
`"unbounded"` (default) | `"positive"` | `"nonnegative"` | `"negative"` | two-element numeric vector

`Weight` — Observation weights
nonnegative vector

`f` — Estimated function values
numeric vector

`xf` — Evaluation points
numeric vector

`bw` — Bandwidth
positive scalar