OLS regression with missing values within the Y variable

4 views (last 30 days)
Hello Matlab community, i have a problem regarding making a OLS regression where my y values have missing datapoints. I am currently writing my masters and are doing a research on "Active ETFs". I have gathered a long list of active ETFs over the past decade, due to somee ETFs being "dead" and having different start times i have missing datapoints.
Due to my data being larger than 5mb i cant insert it.
I talked with my superevisor which said i should use a if statement so i could make the OLS only use datapoints with a value. But everything i have tried so far has failed and i am therefore seeking help here.
This is the code i currently have:
% Housekeeping
clear;
clc;
addpath 'Matlab codes'\
% Loading data for the period 2013:01 to 2023:01
RiskFreeRate = readmatrix('RiskFree.xlsx','Range',[3 3 6 121]);
Geo = readtable('Data.xlsx','range', [1 2 1112 2]);
Active_ETFS = readtable('Data.xlsx','range',[2 15 1112 133]);
Benchmark=readmatrix('Benchmarkss.xlsx', 'range', [1 2 5 120]);
famaFrenchFactorsUS = readmatrix('fama_french_5_factor.xlsx','range',[6 1 124 7]);
% Picking out the factors
excessMarketReturn = famaFrenchFactorsUS(:,2)';
SMBUS = famaFrenchFactorsUS(:,3)';
HMLUS = famaFrenchFactorsUS(:,4)';
RMWUS = famaFrenchFactorsUS(:,5)';
CMAUS = famaFrenchFactorsUS(:,6)';
RFUS = famaFrenchFactorsUS(:,7)';
famaFrenchFactorsASIA = readmatrix("Asia_Pacific_ex_Japan_5_Factors.csv",'range',[6 1 124 7]);
% Picking out the factors
excessMarketReturn = famaFrenchFactorsASIA(:,2)';
SMBASIA = famaFrenchFactorsASIA(:,3)';
HMLASIA = famaFrenchFactorsASIA(:,4)';
RMWASIA = famaFrenchFactorsASIA(:,5)';
CMAASIA = famaFrenchFactorsASIA(:,6)';
RFASIA = famaFrenchFactorsASIA(:,7)';
famaFrenchFactorsDev = readmatrix("Developed_5_Factors.csv",'range',[6 1 124 7]);
% Picking out the factors
excessMarketReturn = famaFrenchFactorsDev(:,2)';
SMBDev = famaFrenchFactorsDev(:,3)';
HMLDev = famaFrenchFactorsDev(:,4)';
RMWDev = famaFrenchFactorsDev(:,5)';
CMADev = famaFrenchFactorsDev(:,6)';
RFDev = famaFrenchFactorsDev(:,7)';
famaFrenchFactorsEU = readmatrix("Europe_5_Factors.csv",'range',[6 1 124 7]);
% Picking out the factors
excessMarketReturn = famaFrenchFactorsEU(:,2)';
SMBDev = famaFrenchFactorsEU(:,3)';
HMLDev = famaFrenchFactorsEU(:,4)';
RMWDev = famaFrenchFactorsEU(:,5)';
CMADev = famaFrenchFactorsEU(:,6)';
RFDev = famaFrenchFactorsEU(:,7)';
% All risk free rates
US = RiskFreeRate(1,:);
Euro = RiskFreeRate(2,:);
AsiaPasific = RiskFreeRate(3,:);
%Adding the geographical focus to the ETFs
GeoETF=[Geo Active_ETFS];
GeoETF=table2cell(GeoETF);
Benchmark %I have not done this for the rest of the benchmarks yet
SP500=Benchmark(2,:)-US;
% Create a structure to store the ETF data for each geographical focus
etf_data = struct('AsiaPasific', [], 'Europe', [], 'Global', [], 'USA', []);
% Loop through the ETF matrix and assign each ETF to its corresponding geographical focus
for i = 1:size(GeoETF, 1)
geo_focus = GeoETF{i, 1};
etf_returns = GeoETF(i, 2:end);
% Assign the ETF returns to the corresponding geographical focus in the structure
etf_data.(geo_focus) = [etf_data.(geo_focus); etf_returns];
end
USAmat = cell2mat(etf_data.USA);
APmat = cell2mat(etf_data.AsiaPasific);
Europemat = cell2mat(etf_data.Europe);
Globalmat = cell2mat(etf_data.Global);
USANetReturn = USAmat - US;
EuropeNetReturn = Europemat - Euro;
APNetReturn = APmat - AsiaPasific;
GlobalNetReturn = Globalmat - US;

Answers (1)

Drew
Drew on 17 Apr 2023
Edited: Drew on 18 Apr 2023
Additional info based on the comments below:
It looks like you have a timeseries of ETF data. If you want to model it as a timeseries and do forecasting, you may find it helpful to follow this example of time series forecasting using an ensemble of boosted regression trees: https://www.mathworks.com/help/stats/time-series-forecasting-using-ensemble-of-boosted-regression-trees.html
For a broader look at options for financial modeling, you could consider other tools in the MATLAB computational finance suite (https://www.mathworks.com/solutions/computational-finance/computational-finance-suite.html) including the financial modeling and econometrics toolboxes. The computational finance suite includes tools to "perform time-series analysis and create predictive models."
For regression problems, it is true that the "Response variable must be a numeric vector". The first step for any simple regression modeling is to prepare your data, as described at https://www.mathworks.com/help/stats/linear-regression-model-workflow.html . Once you prepare the data, you can answer these questions:
  • How many observations are there in the data set? (Reference the example https://www.mathworks.com/help/stats/time-series-forecasting-using-ensemble-of-boosted-regression-trees.html for one way of creating a table of observations-with-lag from a timeseries of measurements, if that is something that you want to do with your data.)
  • How many predictor variables are there? What is the datatype of each predictor?
  • What is your response variable? That is, what are you trying to predict? What is the datatype of the response variable? How many unique response variable values are there in the dataset?
First answer:
If your dataset has many missing values, consider using a different type of regression model which has better support for missing values. For example, regression trees built with fitrtree (see https://www.mathworks.com/help/stats/fitrtree.html) or an ensemble of trees built with fitrensemble (https://www.mathworks.com/help/stats/fitrensemble.html) will make use of observations that have some missing values in the predictors. These specific sections of the fitrtree doc mention the handling of missing values in the input predictor data X and the response data Y:
Ordinary Least Squares (OLS) fitting can be done with the fitlm function (https://www.mathworks.com/help/stats/fitlm.html). The fitlm function will not use observations that contain missing values. So, if many or most of your observations include missing values, this is probably not the best choice of model type for your dataset. As mentioned earlier, consider using tree models. You could also consider methods of predictor imputation prior to building an OLS model.
If this answer is helpful for you, please remember to accept this answer.
  3 Comments
Emil Petersen
Emil Petersen on 18 Apr 2023
All my ETFs has a full line of data, meaning that i have data from the beginning of the ETF to either the end of the period or untill it is dead.
Emil Petersen
Emil Petersen on 18 Apr 2023
To answer the questions
  • There are 1119 different ETFs where a large amount of them wasnt created in the first time period. There are 119 different periods all reflecting a 1 months excess return.
  • I am gonna use the 5 Fama/french factors to obtain the pereformance of these ETFs, which will be my base for a descriptive analysis regarding thee differencese in performance of Active ETFs and Mutual funds. The analysis should be closely like "Another Puzzle: The Growth in Actively Managed Mutual Funds" by Marting J. Gruber. https://www.jstor.org/stable/2329222
  • Thee different ETFs are set up into 4 differeent geographical groups, where the largest group is around 780 individual ETFs. The purpose is to measure the performance against relevant benchmarks and be able to tell whether it has outperformed mutual funds of equal geographical focuses.
I am able to conduct the OLS by using Fitlm if im using ETFs which has datapoints in every pereiod. If i set up the data so there arent any NaN values i will be using the wrong benchmark datapoints for the regression.
Thank you so much for trying to help me!!

Sign in to comment.

Products


Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!