## Naive Bayes Classification

The naive Bayes classifier is designed for use when predictors are independent of one another within each class, but it appears to work well in practice even when that independence assumption is not valid. It classifies data in two steps:

Training step: Using the training data, the method estimates the parameters of a probability distribution, assuming predictors are conditionally independent given the class.

Prediction step: For any unseen test data, the method computes the posterior probability of that sample belonging to each class. The method then classifies the test data according the largest posterior probability.

The class-conditional independence assumption greatly simplifies the training step since you can estimate the one-dimensional class-conditional density for each predictor individually. While the class-conditional independence between predictors is not true in general, research shows that this optimistic assumption works well in practice. This assumption of class-conditional independence of the predictors allows the naive Bayes classifier to estimate the parameters required for accurate classification while using less training data than many other classifiers. This makes it particularly effective for data sets containing many predictors.

### Supported Distributions

The training step in naive Bayes classification is based on estimating
*P*(*X*|*Y*), the probability
or probability density of predictors `X`

given class
`Y`

. The naive Bayes classification model `ClassificationNaiveBayes`

and training function `fitcnb`

provide support for normal (Gaussian), kernel, multinomial,
and multivariate, multinomial predictor conditional distributions. To specify
distributions for the predictors, use the `DistributionNames`

name-value pair argument of
`fitcnb`

. You can specify one type of distribution for all
predictors by supplying the character vector or string scalar corresponding to the
distribution name, or specify different distributions for the predictors by
supplying a length *D* string array or cell array of character
vectors, where *D* is the number of predictors (that is, the number
of columns of *X*).

#### Normal (Gaussian) Distribution

The `'normal'`

distribution (specify using
`'normal'`

) is appropriate for predictors that have normal
distributions in each class. For each predictor you model with a normal
distribution, the naive Bayes classifier estimates a separate normal
distribution for each class by computing the mean and standard deviation of the
training data in that class.

#### Kernel Distribution

The `'kernel'`

distribution (specify using
`'kernel'`

) is appropriate for predictors that have a
continuous distribution. It does not require a strong assumption such as a
normal distribution and you can use it in cases where the distribution of a
predictor may be skewed or have multiple peaks or modes. It requires more
computing time and more memory than the normal distribution. For each predictor
you model with a kernel distribution, the naive Bayes classifier computes a
separate kernel density estimate for each class based on the training data for
that class. By default the kernel is the normal kernel, and the classifier
selects a width automatically for each class and predictor. The software
supports specifying different kernels for each predictor, and different widths
for each predictor or class.

#### Multivariate Multinomial Distribution

The multivariate, multinomial distribution (specify using
`'mvmn'`

) is appropriate for a predictor whose observations
are categorical. Naive Bayes classifier construction using a multivariate
multinomial predictor is described below. To illustrate the steps, consider an
example where observations are labeled 0, 1, or 2, and a predictor the weather
when the sample was conducted.

Record the distinct categories represented in the observations of the entire predictor. For example, the distinct categories (or predictor levels) might include sunny, rain, snow, and cloudy.

Separate the observations by response class. For example, segregate observations labeled 0 from observations labeled 1 and 2, and observations labeled 1 from observations labeled 2.

For each response class, fit a multinomial model using the category relative frequencies and total number of observations. For example, for observations labeled 0, the estimated probability it was sunny is $${p}_{sunny|0}$$ = (number of sunny observations with label 0)/(number of observations with label 0), and similar for the other categories and response labels.

The class-conditional, multinomial random variables comprise a multivariate multinomial random variable.

Here are some other properties of naive Bayes classifiers that use multivariate multinomial.

For each predictor you model with a multivariate multinomial distribution, the naive Bayes classifier:

Records a separate set of distinct predictor levels for each predictor

Computes a separate set of probabilities for the set of predictor levels for each class.

The software supports modeling continuous predictors as multivariate multinomial. In this case, the predictor levels are the distinct occurrences of a measurement. This can lead a predictor having many predictor levels. It is good practice to discretize such predictors.

If an *observation* is a set of successes for various
categories (represented by all of the predictors) out of a fixed number of
independent trials, then specify that the predictors comprise a multinomial
distribution. For details, see Multinomial Distribution.

#### Multinomial Distribution

The multinomial distribution (specify using
`'DistributionNames','mn'`

) is appropriate when, given the
class, each *observation* is a multinomial random variable.
That is, observation, or row, *j* of the predictor data
*X* represents *D* categories, where
*x _{jd}* is the number of successes
for category (i.e., predictor)

*d*in $${n}_{j}={\displaystyle \sum _{d=1}^{D}{x}_{jd}}$$ independent trials. The steps to train a naive Bayes classifier are outlined next.

For each class, fit a multinomial distribution for the predictors given the class by:

Aggregating the weighted, category counts over all observations. Additionally, the software implements additive smoothing [1].

Estimating the

*D*category probabilities within each class using the aggregated category counts. These category probabilities compose the probability parameters of the multinomial distribution.

Let a new observation have a total count of

*m*. Then, the naive Bayes classifier:Sets the total count parameter of each multinomial distribution to

*m*For each class, estimates the class posterior probability using the estimated multinomial distributions

Predicts the observation into the class corresponding to the highest posterior probability

Consider the so-called the bag-of-tokens model, where there is a bag
containing a number of tokens of various types and proportions. Each predictor
represents a distinct type of token in the bag, an observation is
*n* independent draws (i.e., with replacement) of tokens
from the bag, and the data is a vector of counts, where element
*d* is the number of times token *d*
appears.

A machine-learning application is the construction of an email spam classifier, where each predictor represents a word, character, or phrase (i.e., token), an observation is an email, and the data are counts of the tokens in the email. One predictor might count the number of exclamation points, another might count the number of times the word "money" appears, and another might count the number of times the recipient's name appears. This is a naive Bayes model under the further assumption that the total number of tokens (or the total document length) is independent of response class.

Other properties of naive Bayes classifiers that use multinomial observations include:

Classification is based on the relative frequencies of the categories. If

*n*= 0 for observation_{j}*j*, then classification is not possible for that observation.The predictors are not conditionally independent since they must sum to

*n*._{j}Naive Bayes is not appropriate when

*n*provides information about the class. That is, this classifier requires that_{j}*n*is independent of the class._{j}If you specify that the predictors are conditionally multinomial, then the software applies this specification to all predictors. In other words, you cannot include

`'mn'`

in a cell array when specifying`'DistributionNames'`

.

If a *predictor* is categorical, i.e., is multinomial
within a response class, then specify that it is multivariate multinomial. For
details, see Multivariate Multinomial Distribution.

## References

[1] Manning, C. D., P. Raghavan, and M. Schütze.
*Introduction to Information Retrieval*, NY: Cambridge
University Press, 2008.