Clear Filters
Clear Filters

How to perform row operations on a large MATRIX or CELL efficiently?

2 views (last 30 days)
I have a large CELL of the order 18000x4 (Or a MATRIX of the similar order) Each CELL element is a 'name' and therefore a string (In case of Matrix it is a 'number') The operation I need to do is explained in the following example (Please find relevant files attached). Example:
CELL INPUT
'1' '3' '4' []
'1' '3' '5' []
'1' '3' '6' []
'1' '4' '4' []
'1' '4' '5' []
'1' '4' '6' []
'2' '3' '4' []
'2' '3' '5' []
'2' '3' '6' []
'2' '4' '4' []
'2' '4' '5' []
'2' '4' '6' []
'5' '4' [] []
'5' '7' [] []
'6' '8' '4' []
'6' '8' '7' []
'9' '10' [] []
CELL OUTPUT
'1' '4' [] []
'1' '3' '5' []
'1' '3' '6' []
'2' '4' [] []
'2' '3' '5' []
'2' '3' '6' []
'4' '5' [] []
'5' '7' [] []
'4' '6' '8' []
'6' '7' '8' []
'10' '9' [] []
There are 2 main operations done to get the output:
1) Delete the duplicates in the rows. (1 4 4) ---> (1 4) and (2 4 4) ---> (2 4) in the above example.
2) After step (1), compare every row and delete the supersets of the row if any. (1 3 4) (1 4 5) (1 4 6) deleted because they are supersets of (1 4). Similarly, (2 3 4) (2 4 5) (2 4 6) are deleted because they are supersets of (2 4)
I have written a MATLAB function called 'minimize' which does this operation. But, when I need to do this for a large CELL of order 8656x4, MATLAB keeps running and there is no solution after 48hours. I need to get this done for even larger orders of the CELL and within a time frame of 1-2 hours.
Could you please provide any method/suggestion to efficiently solve this?
  2 Comments
Carl
Carl on 10 Oct 2017
Edited: Carl on 10 Oct 2017
Hi Ashish. I see from your .mat files that you have analogous matrices of type 'cell' and 'double'. Just to clarify, is it only slow when performing those operations on the 'cell' data? Or do you see the same performance when operating on the 'double' type matrices as well?
Initial suggestions for improving performance would be to take a vectorized, rather than loop-based, approach to copying data. For example, change:
for k = 1:size(temp_name,2)
minimal_row(i,k) = temp_name(k);
end
to
minimal_row(i,:) = temp_name;
Or,
for j = 1:size(number_matrix,2)
if number_matrix(i,j) == 0
break;
else
temp_name(j) = number_matrix(i,j) ;
end
end
to
temp_name = number_matrix(i,:);
temp_name = temp_name(temp_name ~= 0)
This type of syntax can be optimized more efficiently.
Ashish Keshkamat
Ashish Keshkamat on 11 Oct 2017
Hello Carl, It is relatively better when I solve using 'double' type matrices as compared to cells. But,the difference is not appreciable.
Thank you for your feedback. Let me try the vectorised approach.

Sign in to comment.

Accepted Answer

Ashish Keshkamat
Ashish Keshkamat on 14 Nov 2017
Hello All,
Thank you for your time and effort to help me solve this problem. I have figured it out to solve the 'minimize' more efficiently. This function is part of a bigger program I am working on, and was the bottle-neck. There is a pattern in which the input to 'minimize' is generated. I track the duplicates in while generating the input and only compare those rows and delete the super-sets among them. This reduces the computations from thousands to hundreds and sometimes even to tens.
Carl, Thanks for the suggestion of vectorization. I used it in many instances.
Sharath, Thank you for pointing out the flaws in my Algorithm. It helped me to think in different ways to reduce the number of computations.
With all the optimization done, I am able to get my output using input of the order 8656x4 in 20 seconds.
Best Regards, Ashish

More Answers (1)

Sharath Chandran
Sharath Chandran on 11 Oct 2017
Edited: Sharath Chandran on 3 Nov 2017
Hi Ashish,
I reckon, this has more to do with your Algorithm.
Let us consider a matrix of dimension 'm x n' ('m' rows & 'n' columns). The time complexity of the first operation that you perform is O(m * n). This roughly equals to '34624' number of computations which is reasonable number.
On the other hand the time complexity of the second operation that you are performing is O(m^2 * n), which roughly equals to at least '299705344' number of computations, after plugging in the value 'm=8656' and 'n=4'. This is clearly the bottleneck of your current algorithm and hence it needs to be optimized.
Alternate approach:
1. After performing step 1, store each row in a Hash-Map - O(m) (let us ignore the time-complexity of storing data in Hash-Map for the time being).
2. For every row, generate subsets and check if any subset is present in the Hash-Map. This should not be an overhead as max. no. of subsets for a 4-element row is 2^4-2 = 14. Here we are subtracting '2' as we are not considering the empty set and the set itself - O(2^n).
3. If any subset is already present in the Hash-Map, reject this set (row) since one of its subset already exists.
4. Total time complexity - O(m * 2^n). This roughly equals to at least '138496' number of computations which is a drastic improvement.
I hope this was helpful.
Please comment below in case you have any queries.
_
Regards,
MathWorks Technical Support Team
Note: This is not the exact time complexity analysis as we have not taken into account the complexity of storing data in the Hash-map.
  1 Comment
Ashish Keshkamat
Ashish Keshkamat on 7 Nov 2017
Hello Sharath,
Thank you for your feedback. I understand the proposed algorithm reduces the number of computations drastically. But, I'm not aware of the concepts of using Hash-map (Hash-table).
Let me see what can be done.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!