data management for large datasets

Question

Sara on 16 Feb 2012

1
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/29335-data-management-for-large-datasets

Hi,

I will have two sets of field data -- one taken for six weeks last year, and another taken for two months this year. For EACH dataset I have variables collected from 4-8 different sources, for (up to) 50 days, collected at up to 3 different sites. Both datasets, in their entirety, span about 200 - 300 columns and between 8,000 - 15,000 rows.

Within that, I'm trying to figure out how to set up my code for analysing both sets of data. I want to do some different things --

Analyse the data from each source separately to check for error
Filter out a large quantity (up to 25%) of data which is poor quality
Check all the filtered data from ONE dataset for trends between days (rows) and variables (columns)
Compare filtered data in one dataset between three sites (eg all collected at the same time, on the same days)
Compare different (filtered) variables within a single dataset over time, and
Perform analysis on the changes between both (filtered) datasets.

I have no idea how to structure and maintain my code to allow me to do all of these things. I know some of the tests I want to do but others I haven't thought of yet. At the moment I have about 10 different programs which load and structure my raw datafiles in different ways (one comprised of an array of structs, another where data is subset into variables etc), but this is incredibly confusing and has led to a lot of error and enormous amounts of repetition. Deeply nested structs became impossible to work with last year.

I will also have a set of images I want to analyse at the same time, taken from the same days, so I need to take that into account too.

Matlab is so powerful and there are so many ways of managing data. Does anyone have any ideas on organising such a large dataset to be able to analyse so many different parts of it?

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Richard Willey on 16 Feb 2012

1
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/29335-data-management-for-large-datasets#answer_37681

Have you looked into the dataset array that ships with Statistics Toolbox?

The dataset array is a special data type that can store heterogeneous data. (I can have a column of strings, followed by a column of categoricals, followed by a column of doubles,...)

The dataset array ships with a variety of built in methods that are designed to simplify data analysis. For example, there is a built in method for "joins" just like you'd find in a relational database. There's also a built in method for converting your data from a tall format to a wide format (and vice-versa)

As a practical example, you cite a requirement to "Compare filtered data in one dataset between three sites (eg all collected at the same time, on the same days)". The join operation would make that a lot easier...

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

data management for large datasets

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

0 Comments
Show -2 older commentsHide -2 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

data management for large datasets

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

0 Comments Show -2 older commentsHide -2 older comments

More Answers (0)

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

0 Comments
Show -2 older commentsHide -2 older comments