Large, high-dimensional data sets are common in the modern era of computer-based instrumentation and electronic data storage. High-dimensional data present many challenges for statistical visualization, analysis, and modeling.

Data visualization, of course, is impossible beyond a few dimensions. As a result, pattern recognition, data preprocessing, and model selection must rely heavily on numerical methods.

A fundamental challenge in high-dimensional data analysis is the so-called *curse of dimensionality*. Observations in a
high-dimensional space are necessarily sparser and less representative than those in
a low-dimensional space. In higher dimensions, data over-represent the edges of a
sampling distribution, because regions of higher-dimensional space contain the
majority of their volume near the surface. (A *d*-dimensional
spherical shell has a volume, relative to the total volume of the sphere, that
approaches 1 as *d* approaches infinity.) In high dimensions,
typical data points at the interior of a distribution are sampled less
frequently.

Often, many of the dimensions in a data set—the measured features—are not useful in producing a model. Features may be irrelevant or redundant. Regression and classification algorithms may require large amounts of storage and computation time to process raw data, and even if the algorithms are successful the resulting models may contain an incomprehensible number of terms.

Because of these challenges, multivariate statistical methods often begin with
some type of *dimension reduction*, in which data are
approximated by points in a lower-dimensional space. Dimension reduction is the goal
of the methods presented in this chapter. Dimension reduction often leads to simpler
models and fewer measured variables, with consequent benefits when measurements are
expensive and visualization is important.

The multivariate linear regression model expresses a *d*-dimensional
continuous response vector as a linear combination of predictor terms
plus a vector of error terms with a multivariate normal distribution.
Let $${y}_{i}={\left({y}_{i1},\dots ,{y}_{id}\right)}^{\prime}$$ denote the response vector for
observation *i*, *i* = 1,...,*n*.
In the most general case, given the *d*-by-*K* design
matrix $${X}_{i}$$ and the *K*-by-1
vector of coefficients$$\beta $$, the multivariate
linear regression model is

$${y}_{i}={X}_{i}\beta +{\epsilon}_{i},$$

where the *d*-dimensional
vector of error terms follows a multivariate normal distribution,

$${\epsilon}_{i}\sim MV{N}_{d}\left(0,\Sigma \right).$$

The model assumes independence between observations,
meaning the error variance-covariance matrix for the *n* stacked *d*-dimensional
response vectors is

$${I}_{n}\otimes \Sigma =\left(\begin{array}{ccc}\Sigma & & 0\\ & \ddots & \\ 0& & \Sigma \end{array}\right).$$

If $$y$$ denotes
the *nd*-by-1 vector of stacked *d*-dimensional
responses, and $$X$$ denotes the *nd*-by-*K* matrix
of stacked design matrices, then the distribution of the response
vector is

$$y\sim MV{N}_{nd}(X\beta ,{I}_{n}\otimes \Sigma ).$$

To fit multivariate linear regression models of the form

$${y}_{i}={X}_{i}\beta +{\epsilon}_{i},\text{\hspace{0.17em}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\epsilon}_{i}\sim MV{N}_{d}(0,\Sigma )$$

in Statistics and Machine
Learning Toolbox™, use `mvregress`

.
This function fits multivariate regression models with a diagonal
(heteroscedastic) or unstructured (heteroscedastic and correlated)
error variance-covariance matrix, $$\Sigma ,$$ using
least squares or maximum likelihood estimation.

Many variations of multivariate regression might not initially
appear to be of the form supported by `mvregress`

,
such as:

Multivariate general linear model

Multivariate analysis of variance (MANOVA)

Longitudinal analysis

Panel data analysis

Seemingly unrelated regression (SUR)

Vector autoregressive (VAR) model

In many cases, you can frame these problems in the
form used by `mvregress`

(but `mvregress`

does
not support parameterized error variance-covariance matrices). For
the special case of one-way MANOVA, you can alternatively use `manova1`

. Econometrics
Toolbox™ has
functions for VAR estimation.

The multivariate linear regression model is distinct from the multiple linear
regression model, which models a *univariate* continuous
response as a linear combination of exogenous terms plus an independent and
identically distributed error term. To fit a multiple linear regression model,
use `fitlm`

.

`fitlm`

| `manova1`

| `mvregress`

| `mvregresslike`

- Set Up Multivariate Regression Problems
- Multivariate General Linear Model
- Fixed Effects Panel Model with Concurrent Correlation
- Longitudinal Analysis