# regstats The design matrix has more predictor variables than observations.

11 views (last 30 days)
King To Leung on 31 Jul 2022
Answered: Walter Roberson on 31 Jul 2022
I used the following code to run a regression, the system shows
Error using regstats (line 132)
The design matrix has more predictor variables than observations.
My codes:
fm_betas=NaN(length(ud),4); % 4 columns for the constant term, size, bm, pe
for i=1:length(ud) % We run a regression for each time period
tdata=data_crsp(data_crsp(:,c.date)==ud(i),:);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
reg_results = regstats(tdata(:,c.fut_ret), [log(tdata(:,c.cap)), log(tdata(:,c.bm)), tdata(:,c.pe)], 'linear', {'beta'});
fm_betas(i,:)=reg_results.beta';
end
mean(fm_betas)
% ud=unique(data_crsp(:,c.date)); %data_crsp is the data set
% I have checked there is not infinite no. in the data

dpb on 31 Jul 2022
The problem is NOT that there are NaN or Inf in the data (although that could also be a cause since they're treated as missing values), the problem is as the error message says -- by the time you've selected the subset of data for one or more of your time periods, the resulting height(tdata) < 4, the number of coefficients you're trying to estimate (3 independent plus 1 intercept).
"You can't do that!" -- you'll have to only fit over periods that have at least that many points; it would be far better to have well more than that.
You'll have to dig into the data set and see where either your selection logic isn't doing what you think or find groupings that have sufficient data in them; we can't see the data...

Walter Roberson on 31 Jul 2022
reg_results = regstats(tdata(:,c.fut_ret), [log(tdata(:,c.cap)), log(tdata(:,c.bm)), tdata(:,c.pe)], 'linear', {'beta'});
You are providing three prediction variables and one result variable, and you are not providing a type of model, so you default to linear. You are trying to find three linear coefficients, one for each of the three variables. Your calculation is effectively
[log(tdata(:,c.cap)), log(tdata(:,c.bm)), tdata(:,c.pe)] \ tdata(:,c.fut_ret)
In order to do that, you need at least three rows of input.
tdata=data_crsp(data_crsp(:,c.date)==ud(i),:);
What happpens if there are only 1 or 2 rows found by that test ?