Big data question. how to generate a variable efficiently and aggregate
Show older comments
I have a file of tens of millions observations with a string identifier, which I load as a datastore:
- ............. V1 ..... V2 ............ V3 ........ V4
- # # * # KLM88 2001-06-30 10 COMPANY1
- # # * # KLM88 2000-12-31 20 COMPANY1
- # # * # MNH7C 2001-09-30 23 COMPANY1
- # # * # MNH7C 2001-06-30 15 COMPANY1
- # # * # MNH7C 2000-12-31 6 COMPANY1
- # # * # HG9LB 2000-12-31 2 COMPANY1
I also have a mat file with some extra information and matching of first variable:
- # KLM88 COUNTRYA
- # MNH7C COUNTRYA
- # HG9LB COUNTRYB
I wish for an end result such that I aggregate on country and date and company my dataset :
- # * # 2001-09-30 23 COMPANY1 COUNTRYA
- # * # 2001-06-30 25 COMPANY1 COUNTRYA
- # * # 2000-12-31 26 COMPANY1 COUNTRYA
- # * # HG9LB 2000-12-31 2 COMPANY1 COUNTRYB
I know I can do so by reading per dataChunk and with for loop assigning the country. However, that takes a huge amount of time. Any other suggestions of how to do so? I am fairly new to the concepts of tall arrays/ mapreduce etc. Thus, I am not sure how could I arrive to what I want more efficiently.
Accepted Answer
More Answers (0)
Categories
Find more on MapReduce in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!