Choice of data structure - cellarray or dataset

I have data of the structure 3 columns by 1.5 million rows. Each column is a variable namely cocode, date, price. I would like to extract /group data from the above data structure based on code and date and perform operations on it or ideally create a time series object for each cocode for a specific time period. Which data type would be the best to use for such data ? "cellArray" or "dataset" or any other ?(I am familiar with reading from excel and dumping the data into either of these) Also what is the ideal method to query the data set based on cocode and date without having to use nested loops ? Is it ok to use a for loop to gather all data or use "find" and get the indices of cocode equal to distinct value of cocode in my data(run in a loop) and use the same indices to access other data corresponding to it? Also I would want to query again based on dates from the above resultset and then perform my analysis on it. If I can group values using "accumarray" how will I pass the indices and can I return input "cellarray" to it ? So bottomline I want to group by cocode and then by dates and have the data , may be in a time series object.

 Accepted Answer

  • Store it as double matrix. You need to convert the dates, if in string format, to serial dates with datenum.
  • Use logical indexing (no need of FOR loops)
  • If you want to group data by date and id you may want to give a look at Pivot/UnPivot which I specifically use to work with panels of time series
EDIT
How data can be organized:
% 1. Double matrix
Amat = [repmat((now-2:now).',3,1),...
reshape(repmat(1:3,3,1),[],1),...
rand(9,1)];
% 2. Cell
Acel = [cellstr(datestr(Amat(:,1),'dd/mm/yyyy')),...
num2cell(Amat(:,2)),...
num2cell(Amat(:,3))];
% 3. Dataset
Adat = dataset({Acel(:,1),'Date'},{Amat(:,2:3), 'Cocode','Rend'});
Memory used
whos
Name Size Bytes Class
Acel 9x3 1944 cell --> 1.5e6x3 ~ 308 mb
Adat 9x3 2114 dataset --> 1.5e6x3 ~ 336 mb
Amat 9x3 216 double --> 1.5e6x3 ~ 34 mb
Convert datestrings in serial dates for dataset and cell to save memory
Acel(:,1) = num2cell(datenum(Acel(:,1),'dd/mm/yyyy'));
Adat.Date = datenum(Adat.Date,'dd/mm/yyyy');
whos
Name Size Bytes Class
Acel 9x3 1836 cell
Adat 9x3 1466 dataset
Pros/cons:
  • doule matrix: save on memory --> faster; cannot mix datatypes, so each date should be converted into serial numbers and doesn't have grpstats functionalities as the dataset class but should use accumarray or my Pivot/unPivot.
  • dataset: easier to use and more friendly behavior (grpstats, stack/unstack etc...), can mix datatypes; consumes a LOT of memory
  • cell array: can mix datatypes and consumes less memory than dataset; 'harder' to work with and still uses a LOT of memory
How to group data
Pivot(Amat(:,[2,1,3]))
NaN 1 2 3
7.3457e+005 0.54701 0.18896 0.36848
7.3457e+005 0.29632 0.68678 0.62562
7.3457e+005 0.74469 0.18351 0.78023
Pivot(Acel(:,[2,1,3]))
[ NaN] [ 1] [ 2] [ 3]
'06/03/2011' [0.54701] [0.18896] [0.36848]
'07/03/2011' [0.29632] [0.68678] [0.62562]
'08/03/2011' [0.74469] [0.18351] [0.78023]
unstack(Adat,'Rend','Cocode','AggregationFun',@sum)
Date x1 x2 x3
'06/03/2011' 0.54701 0.18896 0.36848
'07/03/2011' 0.29632 0.68678 0.62562
'08/03/2011' 0.74469 0.18351 0.78023
Oleg

3 Comments

Sir,
I am not sure I understand,pardon my ignorance.
1. Will I be allowed to store a character string (code) date and a double precision as part of the same matrix ? i.e am I allowed to store mixed data types in the same matrix. That brings me back to the question should I use cellaray or dataset.
2. I don't follow how logical indexing will help me. Could you help with an example. I have data in which col1(code) repeats itself and for each code I have a date and corresponding value. So basically I need to go over the entire data gathering together the rows that have the same code in matrix/array or the like. And after doing so, I have to go thru each of these arrays to further group them by weeks/months. And then perform my functions on each of these.
Could you please help me with a more descriptive answer.
Thank You
The only kind of array that can store strings and numbers in the same array, is a cell array.
what about structures and dataset arrays?

Sign in to comment.

More Answers (1)

In addition to what Oleg suggested, does "code" store a small number of possible values, or are they all distinct (for 1.5 million rows)? In the former case, you might want to use a nominal array.
If you have only three columns, why not store them as separate variables? But if you really want to keep them together, either a 1-by-3 cell array or a dataset array would be the most natural, and least memory-intensive (other than numeric, of course).

5 Comments

My data format goes this way,
1. [code, date, price]. There are more than 2000 rows which have the same code. So the answer is, the code repeats. And for each code there is a date. So the code and the date combination is distinct, not just the code.
2. The reason why I'd like them together and not as seperate variables is that for each code, the date and itz value are more like a key value pair. I would like to pull out data for each code based on dates to perform my function.
3. If I opt for a cellarray (columns 1 and 3 are numeric, 2 is string although it is a date) or dataset I am not sure how should I iterate over 1.5 million rows to put have about 500 arrays (number of dictinct codes) each having their own date,value pair. ANd next step would be to go over each of this to categorize them by dates before I run my function. I am not sure if I can accomplish this if I have them as seperate variables for the fear mismatching dates and values.
I'm a bit confused. In 1 you said "the code and the date combination is distinct", but what you said in 2 & 3 sounds like you want to extract each code and each date, then do something. But wouldn't that result in a single value for price?
It sounds quite like using grpstats would do what you want, but I'm not sure. Are you saying that you have a set of codes and a set of dates, and for each unique code-date pair, you want to extract the associated prices and perform some analysis on it? If so, what analysis? Is it something that should return a single value (for each code-date pair)? Or is the date extraction something more complex (like: for each code, divide up the prices by week, based on the date)?
Itz like company code, and itz corresponding stock price for each date. I group all data for each company (i.e) each code and then I want to operate on monthly / weekly datasets of each company. I have 500 and odd distinct companies and itz stock price for several dates making it 1.5 rows and 3 columns.
Give a look at my edited answer.
OK, I'd recommend a nominal array for the company code, then combine everything together in a cell or dataset array. If you go with cells, store the three arrays in a 1-by-3 cell array. I'd also recommend what Oleg suggested about converting the dates to date numbers. You'll most likely need that anyway, plus you can store a double array instead of a cell array of strings.
To do the analysis, see if you can write a function that will return what you need for a single company/code. Then you can use grpstats to do all companies, by specifying code as your grouping variable and your function as the stat to calculate.
Inside your function, use logical indexing to extract the dates you want. You may have to do the analysis in a loop. Look at the various time/date functions in MATLAB to help here. In particular, using date vectors (then datenum to convert to serial date numbers) is a good way to deal with months. For weeks, you can always just add multiples of 7 to a date number.
Logical indexing works something like this: suppose you have dates stored as date numbers in an array "date" and you know the start and end dates, t1and t2. Then
x = price(date>=t1 & date<=t2);
extracts the prices for the given range of dates.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!