Main Content

kde

Kernel density estimate for univariate data

Since R2023b

    Description

    example

    [f,xf] = kde(a) estimates a probability density function (pdf) for the univariate data in the vector a and returns values f of the estimated pdf at the evaluation points xf. kde uses kernel density estimation to estimate the pdf. See Kernel Distribution for more information.

    example

    [f,xf,bw] = kde(a) also returns the bandwidth for the kernel smoothing function.

    example

    [___] = kde(a,Name=Value) specifies options using one or more name-value arguments. For example, kde(a,ProbabilityFcn="cdf") estimates the cumulative distribution function (cdf) for a instead of the pdf. Use this syntax with any of the output argument combinations in the previous syntaxes.

    Examples

    collapse all

    Generate some normally distributed data.

    rng(0,"twister") %  For reproducibility
    a = randn(100,1);

    Estimate the pdf for the sample data.

    [fp,xfp] = kde(a);

    fp contains the values for the estimated pdf at the evaluation points in xfp.

    Estimate the cdf for the sample data.

    [fc,xfc] = kde(a,ProbabilityFcn="cdf");

    fc contains the values for the estimated cdf at the evaluation points in xfc. xfc and xfp contain the same evaluation points because they were both calculated with the sample data in a.

    Evaluate the pdf and cdf for the normal distribution at the evaluation points.

    np = (1/sqrt(2*pi))*exp(-.5*(xfp.^2));
    nc = 0.5*(1+erf(xfc/sqrt(2)));

    Plot the estimated pdf with the normal distribution pdf.

    plot(xfp,fp,"-",xfp,np,"--")
    legend("kde estimate","Normal density")

    Plot the estimated pdf with the normal distribution pdf.

    figure
    plot(xfc,fc,"-",xfc,nc,"--")
    legend("kde estimate","Normal cumulative",Location="northwest")

    The plots show that the estimated pdf and cdf have shapes similar to the pdf and cdf of the standard normal distribution.

    Generate some normally distributed data.

    rng(0,"twister") %  For reproducibility
    a = randn(100,1);

    Estimate the pdf for the sample data. By default, kde uses the normal-approximation method to calculate the bandwidth for the kernel smoothing function.

    [fn,xfn,bwn] = kde(a);

    fn contains the values for the estimated pdf at the evaluation points in xfn, and bwn is the bandwidth for the kernel smoothing function.

    Estimate the pdf using the plug-in method, and display the bandwidth associated with each estimated pdf.

    [p,xp,bwp] = kde(a,Bandwidth="plug-in");
    [bwn,bwp]
    ans = 1×2
    
        0.4958    0.5751
    
    

    The bandwidth calculated with the normal-approximation method is less than the bandwidth calculated with the plug-in method.

    Plot the estimated pdfs.

    plot(xfn,fn)
    hold on
    plot(xp,p)
    legend("normal-approx","plug-in")

    The estimated pdfs have shapes typical of a normal distribution. The peak of the pdf corresponding to the normal-approximation method is higher than the peak of the pdf corresponding to the plug-in method.

    Generate some bimodal sample data.

    rng(0,"twister") %  For reproducibility
    a = [randn(100,1)-5; randn(20,1)+5];

    Use the default "normal" kernel smoothing function to estimate the pdf for the sample data. Use the "box", "triangle", and "parabolic" kernel smoothing functions to calculate three more estimates for the pdf.

    [f1,xf1] = kde(a);
    [f2,xf2] = kde(a,Kernel="box");
    [f3,xf3] = kde(a,Kernel="triangle");
    [f4,xf4] = kde(a,Kernel="parabolic");

    xf1, xf2, xf3, and xf4 contain the same evaluation points because they were each calculated with the sample data in a. f1, f2, f3, and f4 contain the values of each estimated pdf at the evaluation points.

    Plot the estimated pdfs.

    tiledlayout(2,2)
    nexttile
    plot(xf1,f1) %  normal
    nexttile
    plot(xf2,f2) %  box
    nexttile
    plot(xf3,f3) %  triangle
    nexttile
    plot(xf4,f4) %  parabolic

    The plots show that the four estimated pdfs have similar vertical ranges and two peaks each. The pdf calculated with the "box" kernel appears to be the least smooth of the four estimates.

    Input Arguments

    collapse all

    Sample data used to estimate the probability function, specified as a numeric vector.

    Data Types: single | double

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: kde(a,Kernel="box",Bandwidth=0.8,Weight=wgt) specifies a box kernel smoothing function with a bandwidth of 0.8 and vector of observation weights wgt.

    Bandwidth for the kernel smoothing function, specified as "normal-approx", "plug-in", or a positive scalar.

    • When Bandwidth is "normal-approx", kde uses the normal-approximation method, or Silverman's rule of thumb, to calculate the bandwidth.

    • When Bandwidth is "plug-in", kde uses the improved plug-in method described in [1] to calculate the bandwidth. The plug-in method is sometimes called the Sheather-Jones method.

    • When Bandwidth is a positive scalar, its value controls the smoothness of the probability function estimate. As the value increases, the probability function estimate gets smoother.

    To see how Bandwidth affects the kernel smoothing function, see Kernel.

    Example: kde(a,Bandwidth="plug-in")

    Data Types: single | double | string | char

    Points at which to evaluate the estimated probability function, specified as a numeric vector. By default, kde evaluates the estimated probability function at NumPoints evenly spaced points that cover the range of the observations in a.

    If you specify both the NumPoints and EvaluationPoints name-value arguments, kde ignores NumPoints.

    Example: kde(a,EvaluationPoints=linspace(0,10,50))

    Data Types: single | double

    Type of kernel smoothing function, specified as a function handle or one of the values in this table.

    ValueEquation
    "normal"Ki(x)=12πedi22
    "box"Ki(x)={123,|di|30,|di|>3
    "triangle"Ki(x)={1|di|66,|di|60,|di|>6
    "parabolic"Ki,h(x)=max(0,34u),u=1z255,z=max(5,min(di,5))

    In the table, di=xaih, where h is the bandwidth specified in the Bandwidth name-value argument, and ai is the element at position i in a. A random variable with a pdf defined by one of the kernels in the table has a variance of 1. A parabolic kernel smoothing function is sometimes called an Epanechnikov smoothing function.

    If you specify Kernel as a function handle, the function must accept a matrix or column vector of arbitrary length as its only input argument and return a nonnegative matrix or vector of the same size.

    For more information about how kde uses the kernel smoothing function to estimate the probability function, see Kernel Distribution.

    Example: kde(a,Kernel="parabolic")

    Data Types: string | char | function_handle

    Number of evaluation points for the estimated probability function, specified as a positive integer scalar. By default, NumPoints = max(100,u), where u is the square root of the number of elements in a, rounded to the nearest integer.

    If you specify both the NumPoints and EvaluationPoints name-value arguments, kde ignores NumPoints.

    Example: kde(a,NumPoints=100)

    Data Types: single | double

    Probability function to estimate, specified as "pdf" or "cdf". When ProbabilityFcn is "pdf", kde estimates a probability density function. To estimate a cumulative distribution function, specify ProbabilityFcn as "cdf".

    Example: kde(a,ProbabilityFcn="cdf")

    Interval for the sample data, specified as a two-element numeric vector, "unbounded", "positive", "nonnegative", or "negative". The elements of a must be in the interval specified by Support. The estimated probability function evaluates to 0 outside of the interval.

    If you specify Support as a two-element vector [L U] or [L;U], L must be greater than max(a) and U must be less than min(a). The interval is open with lower bound L and upper bound U.

    If you specify Support as a string, the sample data exists inside an interval described in this table.

    ValueSupport
    "unbounded"(Inf,Inf)
    "positive"(0,Inf)
    "nonnegative"[0,Inf)
    "negative"(Inf,0)

    Example: kde(a,Support="nonnegative")

    Data Types: single | double | string | char

    Observation weights, specified as a nonnegative vector. By default, kde weights all observations in a equally. For more information about how kde uses weights to estimate the probability function, see Kernel Distribution.

    Data Types: single | double

    Output Arguments

    collapse all

    Estimated function values, returned as a numeric vector. The length of f is equal to the number of evaluation points in xf.

    Evaluation points, returned as a numeric vector. xf has the same size as the EvaluationPoints name-value argument, if EvaluationPoints is specified. Otherwise, the size of xf is given by the NumPoints name-value argument.

    Bandwidth for the kernel smoothing function, returned as a positive scalar. You can use the Bandwidth name-value argument to specify the value for bw or the method for calculating bw.

    More About

    collapse all

    Kernel Distribution

    A kernel distribution is a nonparametric representation of a probability density function (pdf) of a random variable. You can use a kernel distribution when a parametric distribution cannot properly describe the data or when you want to avoid making assumptions about the distribution of the data. A kernel distribution is defined by a smoothing function and a bandwidth value, which control the smoothness of the resulting density curve.

    The kernel estimator is an estimated probability function for a random variable. For any real values of x, the kernel estimator for the pdf is given by

    f^h(x)=1nhi=1nwiK(xxih),

    where the xi values are random samples from an unknown distribution, wi values are their corresponding weights, n is the sample size, K is the kernel smoothing function, and h is the bandwidth.

    For any real values of x, the kernel estimator for the cumulative distribution function (cdf) is given by

    F^h(x)=xf^h(t)dt=1nhi=1nwiG(xxih),

    where G(x)=xK(t)dt.

    For more details, see Kernel Distribution (Statistics and Machine Learning Toolbox).

    References

    [1] Botev, Z. I., J. F. Grotowski, and D. P. Kroese. "Kernel Density Estimation via Diffusion." The Annals of Statistics, vol. 38, no. 5 (October 1, 2010). https://projecteuclid.org/journals/annals-of-statistics/volume-38/issue-5/Kernel-density-estimation-via-diffusion/10.1214/10-AOS799.full

    [2] Bowman, A. W., and A. Azzalini. "Applied Smoothing Techniques for Data Analysis." New York: Oxford University Press Inc., 1997.

    [3] Hill, P. D. "Kernel estimation of a distribution function." Communications in Statistics - Theory and Methods. 14, no. 3(January 1985): 605–620.

    [4] Jones, M. C. "Simple boundary correction for kernel density estimation." Statistics and Computing. no. 3(September 1993): 135–146.

    [5] Silverman, B. W. "Density Estimation for Statistics and Data Analysis." Chapman & Hall/CRC, 1986.

    Version History

    Introduced in R2023b

    See Also

    Functions

    Topics