You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
getting error while joining 2 file size of 2 mb and 3 GB
2 views (last 30 days)
Show older comments
hello Everyone!!
i have to join two files, i am using innerjoin with one key column 'XDRID' but am getting the following error.can anybody help me in knowing to why am getting this error and how to solve.
Error using table/innerjoin (line 78)
Left and right key variables 'XDRID' and 'XDRID' are not comparable
because one is a non-cell.
9 Comments
Walter Roberson
on 9 Oct 2016
For the two tables, say A and B, please show class(A.XDRID) and class(B.XDRID), and class(A.XDRID{1}) and class(B.XDRID{1})
naadiya khudabux
on 9 Oct 2016
Edited: Walter Roberson
on 9 Oct 2016
account,phone,XDRID
268451,13637310274,1000077641156912666
90768,18373185301,1000105332868488876
95242,15207485247,1000116352552530729
310253,18797502608,1000390107216725540
312187,15513070156,100040138619028759
237644,13027424020,1000423234229975214
272291,13077304015,1000424625869287301
82124,13297408535,1000590115562800405
313384,15608428878,1000788671238803447
249159,18390986833,1000823205270429980
53951,13875857258,1000937019137751718
260556,13975803629,1000999511586349452
naadiya khudabux
on 9 Oct 2016
Edited: Walter Roberson
on 9 Oct 2016
'160a01184201c500' NaN 3.63391677599642e+18 6.04918968728055e+18 7.30444913473469e+18 NaN NaN 29547 121298690 'CMNET' 103 1467304572016.00 1467304690756.00 1 5 17 NaN NaN NaN NaN 0 NaN 80 889 532 14 4 0 0 0 0 33 100 0 0 1 40 NaN NaN 2 0 1 6 200 40 40 69 'data.video.qiyi.com' NaN NaN NaN 'text/plain' NaN NaN 143 0 3 0 40 0 NaN
'160a011842034200' NaN 3.63391677599642e+18 6.04918968728055e+18 7.30444913473469e+18 NaN NaN 29547 121298690 'CMNET' 103 1467304573036.00 1467304573246.00 1 5 17 NaN NaN NaN NaN 0 NaN 80 804 279 4 3 0 0 0 0 27 69 0 0 90 28 NaN NaN 1 0 1 6 200 28 28 49 'mbdlog.iqiyi.com' NaN NaN NaN 'text/html' NaN NaN 0 0 3 0 28 0 NaN
'160a01184204f700' NaN 3.63391677599642e+18 6.04918968728055e+18 7.30444913473469e+18 NaN NaN 29547 121298690 'CMNET' 103 1467304574196.00 1467304574341.00 1 5 17 NaN NaN NaN NaN 0 NaN 80 858 279 4 3 0 0 0 0 33 60 0 0 29 33 NaN NaN 1 0 1 6 200 33 33 55 'mbdlog.iqiyi.com' NaN NaN NaN 'text/html' NaN NaN 0 0 3 0 33 0 NaN
'160a01184209c800' NaN 7.17462676157379e+18 7.63062400670183e+16 4.97705383067421e+18 NaN NaN 29547 118294295 'CMNET' 103 1467304577391.00 1467304577590.00 1 3 2 NaN NaN NaN NaN 0 NaN 80 892 1086 6 5 0 1 0 0 30 79 0 0 10 50 NaN NaN 1 0 1 6 200 50 50 70 'api.weibo.cn' NaN NaN NaN 'application/json' NaN NaN 0 0 1 0 50 0 3
'160a0118421bcb00' NaN 3.63391677599642e+18 6.04918968728055e+18 7.30444913473469e+18 NaN NaN 29547 121298690 'CMNET' 103 1467304589045.00 1467304619236.00 1 6 36 NaN NaN NaN NaN 0 NaN 80 564 2590 8 10 0 0 0 0 2 60 0 0 10 4 NaN NaN 1 0 1 6 200 4 4 24 'p1.music.126.net' NaN NaN NaN 'image/jpg' NaN NaN 1800 0 3 0 4 0 NaN
'160a0118421bcc00' NaN 3.63391677599642e+18 6.04918968728055e+18 7.30444913473469e+18 NaN NaN 29547 121298690 'CMNET' 103 1467304589006.00 1467304589225.00 1 6 36 NaN NaN NaN NaN 0 NaN 80 448 2207 6 5 0 0 0 0 2 60 0 0 10 109 NaN NaN 1 0 1 6 200 109 109 149 'p1.music.126.net' NaN NaN NaN 'image/jpg' NaN NaN 1666 0 3 0 109 0 NaN
'160a011842376000' NaN 1.59887515923291e+18 4.90256020833500e+18 8.61871928453489e+18 NaN NaN 29547 117726722 'CMNET' 103 1467304608515.00 1467304608789.00 1 1 9 5 0 NaN NaN 0 NaN 80 1142 6381 9 9 0 0 0 0 23 40 0 0 50 95 NaN NaN 1 0 1 5 200 95 95 159 'short.weixin.qq.com' NaN NaN NaN 'application/octet-stream' NaN NaN 5909 0 3 0 95 0 1
'160a0118423b6400' NaN 8.67272675720160e+18 6.73500442591658e+18 3.26382062659079e+18 NaN NaN 29633 252422019 'CMNET' 103 1467304611735.00 1467304618733.00 1 18 10078 NaN NaN NaN NaN 0 NaN 80 1075 264 6 6 0 0 0 0 4 30 0 0 6934 0 NaN NaN 1 0 1 6 1001 0 0 0 'q.dmzj.com' NaN NaN NaN '' NaN NaN 0 0 3 1 0 0 NaN
Walter Roberson
on 9 Oct 2016
Please label the fields for the second table.
It looks like the 8th field of the second table might be the account. Are you trying to join based on that?
naadiya khudabux
on 9 Oct 2016
if true
XDRID RequestTime
__________________ _______________
'160a01184201c500' '1467304572016'
'160a011842034200' '1467304573036'
'160a01184204f700' '1467304574196'
'160a01184209c800' '1467304577391'
'160a0118421bcb00' '1467304589045'
'160a0118421bcc00' '1467304589006'
'160a011842376000' '1467304608515'
'160a0118423b6400' '1467304611735'
K>>
end
naadiya khudabux
on 9 Oct 2016
consider this as my 2nd table. i am trying to join on the base of 'XDRID'.
naadiya khudabux
on 9 Oct 2016
if true
K>> drop.VariableNames
ans =
Columns 1 through 6
'XDRID' 'RAT' 'IMSI' 'IMEI' 'MSISDN' 'MachineIPAddType'
Columns 7 through 12
'SGWGGSNIPAdd' 'TAC' 'ECI' 'APN' 'AppTypeCode' 'RequestTime'
Columns 13 through 16
'ProcedureEndTime' 'ProtocolType' 'AppType' 'AppSubType'
Columns 17 through 21
'AppContent' 'AppStatus' 'USER_IPv4' 'UserPort' 'L4Protocal'
Columns 22 through 25
'AppServerIP_IPv4' 'AppServerPort' 'ULTraffic' 'DLTraffic'
Columns 26 through 29
'ULIPPacket' 'DLIPPacket' 'ULTCPOoOPacket' 'DLTCPOoOPacket'
Columns 30 through 32
'ULTCPRetransPacket' 'DLTCPRetransPacket' 'TCPSYNAtteDelay'
Columns 33 through 35
'TCPSYNComfirmDelay' 'ULIPFRAGPACKETS' 'DLIPFRAGPACKETS'
Columns 36 through 39
'TCPSYNSuccFirstRe…' 'FirstReqToFirstRe…' 'WindowSize' 'MSSSize'
Columns 40 through 43
'TCPSYNAtte' 'TCPConnStatus' 'SessionIsEnd' 'EventType'
Columns 44 through 46
'HTTPWAPStatus' 'FirstHTTPResPacke…' 'LastHTTPPacketDelay'
Columns 47 through 51
'LastACKPacketDelay' 'HOST' 'URI' 'XOnlineHost' 'UserAgent'
Columns 52 through 55
'HTTP_content_type' 'refer_URI' 'Cookie' 'ContentLength'
Columns 56 through 59
'DestBeha' 'OperBehaIden' 'OperFinishIden' 'OperDelay'
Columns 60 through 61
'Browser' 'PortalAppCollection'
end
These are the labels for all the columns
Accepted Answer
Walter Roberson
on 9 Oct 2016
The error message you are getting would be consistent with the first column of the second table being named XDRID, the same as the third column of the first table. However, the third column of the first table is numeric, while the first column of the second table is character vector.
When you do a table join, the fields that have the same name in the two tables must also have the same data type.
50 Comments
Walter Roberson
on 9 Oct 2016
In the first table, the values such as 1000077641156912666 are 19 digit numbers, with values exceeding the maximum that can be represented in a double precision number, but within range of what can be represented as an uint64, so if the data was read in carefully there is a possibility that all digits are meaningful.
In the second table, the values such as '160a01184201c500' are 16 characters that are hex digits. Those might represent unsigned 64 bit numbers; if so then it is important to know the byte order to do the decoding. I cannot tell at the moment what the intended order is. The hex representation of the values in the first table, the ones starting with 1000*, all come out starting with '0d' hex, which is a byte that does not appear in any of the hex values in the second table, suggesting that if they do represent 64 bit integers then they are in a different range.
In any case, what I speculated earlier is still the case: in one table the values for XDRID are numeric and in the other table they are string. You cannot join tables on a field if the representations for the fields are different between the tables.
naadiya khudabux
on 9 Oct 2016
I have converted them to the same datatype even then the same problem.
Walter Roberson
on 9 Oct 2016
Attach a .mat file that has a small extract of the tables, such as the first 10 rows of each table.
naadiya khudabux
on 10 Oct 2016
@ Walter Roberson
can you please suggest me the good tutorial to read a non-tabular text file using datastore or MapReduce? it will be highly appreciated.
Walter Roberson
on 10 Oct 2016
I cannot think of a tutorial about that topic at the moment. It appears to me that you would need to use a fileDatastore for that.
The objects that you have shown so-far have been tabular, not non-tabular.
naadiya khudabux
on 11 Oct 2016
Edited: Walter Roberson
on 13 Nov 2016
'160a01184201c500' NaN 3.63391677599642e+18 6.04918968728055e+18 7.30444913473469e+18 NaN NaN 29547 121298690 'CMNET' 103 1467304572016.00 1467304690756.00 1 5 17 NaN NaN NaN NaN 0 NaN 80 889 532 14 4 0 0 0 0 33 100 0 0 1 40 NaN NaN 2 0 1 6 200 40 40 69 'data.video.qiyi.com' NaN NaN NaN 'text/plain' NaN NaN 143 0 3 0 40 0 NaN
'160a011842034200' NaN 3.63391677599642e+18 6.04918968728055e+18 7.30444913473469e+18 NaN NaN 29547 121298690 'CMNET' 103 1467304573036.00 1467304573246.00 1 5 17 NaN NaN NaN NaN 0 NaN 80 804 279 4 3 0 0 0 0 27 69 0 0 90 28 NaN NaN 1 0 1 6 200 28 28 49 'mbdlog.iqiyi.com' NaN NaN NaN 'text/html' NaN NaN 0 0 3 0 28 0 NaN
'160a01184204f700' NaN 3.63391677599642e+18 6.04918968728055e+18 7.30444913473469e+18 NaN NaN 29547 121298690 'CMNET' 103 1467304574196.00 1467304574341.00 1 5 17 NaN NaN NaN NaN 0 NaN 80 858 279 4 3 0 0 0 0 33 60 0 0 29 33 NaN NaN 1 0 1 6 200 33 33 55 'mbdlog.iqiyi.com' NaN NaN NaN 'text/html' NaN NaN 0 0 3 0 33 0 NaN
'160a01184209c800' NaN 7.17462676157379e+18 7.63062400670183e+16 4.97705383067421e+18 NaN NaN 29547 118294295 'CMNET' 103 1467304577391.00 1467304577590.00 1 3 2 NaN NaN NaN NaN 0 NaN 80 892 1086 6 5 0 1 0 0 30 79 0 0 10 50 NaN NaN 1 0 1 6 200 50 50 70 'api.weibo.cn' NaN NaN NaN 'application/json' NaN NaN 0 0 1 0 50 0 3
'160a0118421bcb00' NaN 3.63391677599642e+18 6.04918968728055e+18 7.30444913473469e+18 NaN NaN 29547 121298690 'CMNET' 103 1467304589045.00 1467304619236.00 1 6 36 NaN NaN NaN NaN 0 NaN 80 564 2590 8 10 0 0 0 0 2 60 0 0 10 4 NaN NaN 1 0 1 6 200 4 4 24 'p1.music.126.net' NaN NaN NaN 'image/jpg' NaN NaN 1800 0 3 0 4 0 NaN
'160a0118421bcc00' NaN 3.63391677599642e+18 6.04918968728055e+18 7.30444913473469e+18 NaN NaN 29547 121298690 'CMNET' 103 1467304589006.00 1467304589225.00 1 6 36 NaN NaN NaN NaN 0 NaN 80 448 2207 6 5 0 0 0 0 2 60 0 0 10 109 NaN NaN 1 0 1 6 200 109 109 149 'p1.music.126.net' NaN NaN NaN 'image/jpg' NaN NaN 1666 0 3 0 109 0 NaN
'160a011842376000' NaN 1.59887515923291e+18 4.90256020833500e+18 8.61871928453489e+18 NaN NaN 29547 117726722 'CMNET' 103 1467304608515.00 1467304608789.00 1 1 9 5 0 NaN NaN 0 NaN 80 1142 6381 9 9 0 0 0 0 23 40 0 0 50 95 NaN NaN 1 0 1 5 200 95 95 159 'short.weixin.qq.com' NaN NaN NaN 'application/octet-stream' NaN NaN 5909 0 3 0 95 0 1
'160a0118423b6400' NaN 8.67272675720160e+18 6.73500442591658e+18 3.26382062659079e+18 NaN NaN 29633 252422019 'CMNET' 103 1467304611735.00 1467304618733.00 1 18 10078 NaN NaN NaN NaN 0 NaN 80 1075 264 6 6 0 0 0 0 4 30 0 0 6934 0 NaN NaN 1 0 1 6 1001 0 0 0 'q.dmzj.com' NaN NaN NaN '' NaN NaN 0 0 3 1 0 0 NaN
naadiya khudabux
on 11 Oct 2016
Edited: naadiya khudabux
on 11 Oct 2016
I have this file posted shown above the size of 3Gb.i have to analyze it. the first column is a phone number that I need to decode.advise and their is a time of application requested. I have another file containing analyzed data of phone numbers. so kindly advice me how should I start with 'datastore with MapReduce' or 'file datastore with MapReduce' . am using matlab 2015b. this is not supported with tall array and file datastore
naadiya khudabux
on 11 Oct 2016
Edited: Walter Roberson
on 13 Nov 2016
here is the other analyzed file it is around 2 MB in size
account,phone,XDRID
268451,13637310274,1000077641156912666
90768,18373185301,1000105332868488876
95242,15207485247,1000116352552530729
310253,18797502608,1000390107216725540
312187,15513070156,100040138619028759
237644,13027424020,1000423234229975214
272291,13077304015,1000424625869287301
82124,13297408535,1000590115562800405
313384,15608428878,1000788671238803447
249159,18390986833,1000823205270429980
53951,13875857258,1000937019137751718
260556,13975803629,1000999511586349452
Walter Roberson
on 11 Oct 2016
I examined '160a01184201c500' as a potential hex representation of a phone number. If we look at the second file, the phone numbers are shown in "leading-1" format, which is the format without the leading '+' that would be used for the North American Numbering Plan numbers for Canada, USA, some US territories, some Caribbean countries, and some British Overseas Territories . For each of those numbers the format is effectively '1' followed by a 3-digit area code, followed by a 3 digit Local Area Exchange, followed by a 4 digit local number. People now typically bundle the 3 digit Local Area Exchange and 4 digit local number together into a 7 digit number, and typically work in terms of 1 plus area code plus 7 digit number.
The valid 3 digit area codes are not permitted to begin with '1', so phone numbers range from about 12000000000 to 19999999999 . If you convert 120000000000 to hex you get 02CB417800 and if you convert 19999999999 to hex you get 04A817C7FF. Therefore if phone numbers are converted to hex, they require 5 bytes (10 hex digits.)
We can then ask "But suppose they left off the leading '1' in the hex representation: would that still need 5 bytes?". The answer to that is Yes, in that the hypothetical phone number 9999999999 (with no leading 1) is going to be relatively close to the 5 bytes required for 12000000000. We could then point out that no actual area codes begin with '999' and suggest that maybe it could fit. There is, though, an area code in Texas that is '979', so you need most of the range. Taking it a step further we can ask what the largest phone number without leading 1 is that could be represented in 4 bytes; it is hex2dec('ffffffff') which is only 4294967295, which is very definitely not enough to represent the 18797502608 phone number shown in the second file even if you remove the leading 1.
We have thus established that if the phone numbers are being represented in hex, that we need 5 bytes (10 hex digits).
I then took the sample entry '160a01184201c500' from the first line of your first table, and extracted all possible subsets of 5 bytes from it, all possible arrangements of 5 bytes taken from those 8 bytes, such as 0a01184201 and c501164216. All possible arrangements in order to take into account that I do not know the byte order that was used, or which 5 bytes out of 8 was used. I took all of the possibilities, and I converted to decimal. And then I looked for values that came out in the range 12000000000 to 19999999999, being consistent with the leading 1 shown in the second table. The result was that no arrangement of 5 bytes out of 8 of '160a01184201c500' converts to decimal with a value in the range 12000000000 to 19999999999. Therefore it is not possible that some subset of 5 bytes out of '160a01184201c500' encodes a North American Numbering Plan phone number that begins with a leading 1.
I then checked for values in the range 2000000000 to 9999999999, under the possibility that the encoded phone numbers were missing the leading '1'. 912 out of the 3720 unique possibilities matched that range, from 3305177354 to 7604410390. That is far too many for me to be able to deduce a subset or byte order pattern, so that path is not useful without either a much larger data sample with a known match, or specific information about how the bytes are laid out.
One thing I did not try was the possibility that an odd number of hex digits was used for the phone number encoding.
I proceeded to test the hypothesis that the 16 hex digits were 8 bytes of a standard IEEE 754 binary floating point number, just in some unknown number. I tried all 40320 possible arrangements. I weeded out the results that were negative or were not integers. 120 results remained, of which the smallest was 46188113177648 which is several digits too large to be a phone number.
At the moment, your problem it not that the data is big and that you might possibly have to use a tabular dataset and mapreduce: your problem at the moment is that you do not know how to decode the phone number supposedly encoded by '160a01184201c500' into something you can use to match on the second table.
The second table has the values in the range 10^18 in the third column. We could hypothesize that it is those values that have been encoded as hex to appear in the first table. Re-using the possible conversions as IEEE 754 binary floating point numbers, we find none of the potential arrangements of hex are within range (they were like 46188113177648 to 46386386305538). I then converted all of the possible byte reorderings of the hex into uint64 numbers, and checked to see if any of them are in a plausible range. The smallest came out as 1585268172810371781 and the largest as 4811278259851428096. Although those have the right number of digits to match the 18 digit numbers that begin with 1000*, they are too large to match those.
Are you certain that the match is on phone number, and not on "account"??? The 8th field of the first dataset, with values such as 29547, looks like a plausible match against the account numbers such as 268451 in the second table.
If you are certain that the phone number is encoded in that hex, then I need a sample record from each table in which the phone number is known to match between the two tables, so that I can try to figure out how it might be encoded.
naadiya khudabux
on 11 Oct 2016
Thank u so much for your reply and help.yes.. i am not sure which column should i use to join them . can you explain me a bit more how can i join them on the base of account. and how can i read large text file . suggest me some tutorials please
Walter Roberson
on 11 Oct 2016
Attach the complete 2 megabyte file, and attach a subset of the 3 gigabyte file (such as the first 1000 records, maybe together with another 1000 random records). Use the original dataset format for those, not text files. You can zip the two together and attach the .zip, and I will take a look.
Walter Roberson
on 11 Oct 2016
I will have a look at this later; I have been up all night and it is time for me to rest.
Walter Roberson
on 12 Oct 2016
I know. But it's my bedtime again.
My health is not great these days and I am not always able to do things that need more concentration.
naadiya khudabux
on 13 Oct 2016
can you help me create mapper and reducer function that change the date form UTC to GMT time format please.
naadiya khudabux
on 13 Oct 2016
Edited: Walter Roberson
on 13 Oct 2016
i was trying it but couldnt make it up
dateformated=0;
while hasdata(ds)
T=read(ds);
dateformated =datestr(719529+T.RequestTime/86400000,'yyyy-mm-dd HH:MM:SS');
end
Walter Roberson
on 14 Oct 2016
The examples for datastore use the form
dateformated = [];
dateformated=0;
while hasdata(ds)
T=read(ds);
dateformated(end+1) =datestr(719529+T.RequestTime/86400000,'yyyy-mm-dd HH:MM:SS');
end
naadiya khudabux
on 14 Oct 2016
yeah i have tried it but am getting error "In an assignment A(:) = B, the number of elements in A and B must be the same."
Walter Roberson
on 14 Oct 2016
dateformated(:, end+1) =datestr(719529+T.RequestTime/86400000,'yyyy-mm-dd HH:MM:SS');
Walter Roberson
on 14 Oct 2016
Actually the above will not work, because the final read might be a different size, and the intermediate reads are not always the same size.
newdata = datestr(719529+T.RequestTime/86400000,'yyyy-mm-dd HH:MM:SS');
dateformated(end+1:end+numel(newdata)) = newdata(:));
naadiya khudabux
on 15 Oct 2016
thank you .. yeah it wasn't working. just want to remind u am still looking for ur help to join 2 columns.
Walter Roberson
on 15 Oct 2016
I have been working on that for the last several hours. I have code to parse the files, taking into account missing and invalid values, and I have tried to match the fields. As yet, I see no evidence that it is possible to match the two files.
This would go a lot faster if you have documentation on the file format.
Walter Roberson
on 15 Oct 2016
In the file with only three fields, gaoxiaotest2.txt the third field, there named XDRID, appears to be the field to match on.
In the file with 27 fields, disk1.txt, the 6th field, there named MSISDN, appears to be the field to match on. 412 of the entries match.
arcfmt = ['%u64', '%s', repmat('%u64',1,25)];
fid = fopen('disk1.txt','rt');
dc = textscan(fid, arcfmt, 'HeaderLines', 1, 'Delimiter', ',', 'EmptyValue', uint8(0), 'CollectOutput', 1);
fclose(fid);
arc_rows = [dc{1}, dc{3}];
arc_xdrid = dc{2};
acctfmt = repmat('%u64', 1, 3);
fid = fopen('gaoxiaotest2.txt', 'rt');
for part = 1 : 9
%we read multiple times to account for bugs in the file
ac{part} = textscan(fid, acctfmt, 'HeaderLines', 1, 'Delimiter', ',', 'EmptyValue', uint8(0), 'CollectOutput', 1);
fclose(fid);
acct_rows = cell2mat( vertcat(ac{:}) );
Now you can match on acct_rows(:,3) and arc_rows(:,5) . (Yes, I did say that it was column 6, but column 2 is a string so I have excluded it from the numeric array.)
My actual code for reading the file is more robust than the above.
The telephone numbers in the files sometimes have a hyphen to indicate a missing digit; sometimes it has a slash to indicate a missing digit; sometimes it has a space instead. You should be thinking about how you want to handle that situation.
naadiya khudabux
on 16 Oct 2016
thank you so much.i made u invest a lot of time on it. yeah, i have description file(in Chinese)... now I am going to try it.Thank you once again
naadiya khudabux
on 18 Oct 2016
Edited: Walter Roberson
on 13 Nov 2016
hello .. i want to ask for this kind of file which format specifier I am suppose to use
160a01184201c500 3.63392E+18 6.04919E+18 7.30445E+18
160a011842034200 3.63392E+18 6.04919E+18 7.30445E+18
160a01184204f700 3.63392E+18 6.04919E+18 7.30445E+18
160a01184209c800 7.17463E+18 7.63062E+16 4.97705E+18
160a0118421bcb00 3.63392E+18 6.04919E+18 7.30445E+18
160a0118421bcc00 3.63392E+18 6.04919E+18 7.30445E+18
160a011842376000 1.59888E+18 4.90256E+18 8.61872E+18
160a0118423b6400 8.67273E+18 6.735E+18 3.26382E+18
160a0118424ecd00 8.67273E+18 6.735E+18 3.26382E+18
160a0118424ece00 8.67273E+18 6.735E+18 3.26382E+18
160a0118424f3200 8.67273E+18 6.735E+18 3.26382E+18
160a0118424f4e00 8.67273E+18 6.735E+18 3.26382E+18
160a0118424f6a00 8.67273E+18 6.735E+18 3.26382E+18
160a0118424f7000 8.67273E+18 6.735E+18 3.26382E+18
160a0118424f7300 2.06908E+18 8.49562E+17 8.23739E+18
160a0118424f8d00 8.67273E+18 6.735E+18 3.26382E+18
160a0118424fb700 2.06908E+18 8.49562E+17 8.23739E+18
160a0118424fb800 2.06908E+18 8.49562E+17 8.23739E+18
Walter Roberson
on 18 Oct 2016
The format would be '%s%f%f%f' but you might need 'Delimiter', '\t', 'MultipleDelimsAsOne', true . Or possibly not -- it might work without those options.
naadiya khudabux
on 24 Oct 2016
Edited: naadiya khudabux
on 24 Oct 2016
hello!! can you please help me to read this file . i always find hard to read text file .i need more columns but am getting 0 0 0 values in result after reading it.i need to 10, 14 , 15 , 16 and 17 columns to read
Walter Roberson
on 24 Oct 2016
fmt_cell = repmat({'%*s'}, 1, 61);
fmt_cell{10} = '%s'; % APN a string
fmt_cell{14} = '%f'; % ProtocolType a small integer
fmt_cell{15} = '%f'; % AppType a small integer
fmt_cell{16} = '%f'; % AppSubType a not-big integer
fmt_cell{17} = '%f'; % AppContent a not-big integer
fmt = horzcat(fmt_cell{:});
fid = fopen('half.txt', 'rt');
data_cell = textscan(fid, fmt, 'HeaderLines', 1, 'Delimiter', ',', 'EmptyValue', -1);
fclose(fid);
half.APN = data_cell{1};
half.ProtocolType = data_cell{2};
half.AppType = data_cell{3};
half.AppSubType = data_cell{4};
half.AppContent = data_cell{5};
All of the APN entries are the same. Most of the AppContent are missing, and are replaced by -1
naadiya khudabux
on 7 Nov 2016
hello Walter Roberson.. can you please tell me how can i split the 10 Gb file in to 10 file. it will be very kind of you and my file file sample is attached.
Walter Roberson
on 7 Nov 2016
naadiya khudabux
on 12 Nov 2016
hey thank you..i have a one more question. i have around 120 GB data i have split it down in 120 file size of one GB each.but when i am joining those files with my shell file , am unable to get even one MB data out of it. am missing something i cant figure out what am missing in it. am using inner join.if i wouldn't have been using the right key-value, am not supposed to get any result after joining. would u mind to help me in this
Walter Roberson
on 12 Nov 2016
I do not know how you are doing the joining.
When you do the splitting, you should be splitting by number of lines, not by exact file size: if you split by exact file size then you will end up splitting in the middle of lines.
naadiya khudabux
on 12 Nov 2016
it doesnt matter i can remove the last few column values for the same dimensions.but my problem is to figure out the appropriate key value
Walter Roberson
on 12 Nov 2016
I no longer speculate on this. You keep changing the file format and the field to be matched on, and the matches have not necessarily been on the plain content -- sometimes it has been necessary to make calculations on the content to find the match. There is no point in my thinking about it. Post samples, say explicitly which fields you are matching on, and if the format of the fields is not exactly the same, describe how the contents are encoded.
I am getting pretty tired of this. I put in a lot of hours figuring out matches for you only to discover that your actual files were nothing like what you had originally posted. I have no energy to go through that again. I am not working on it further without sample data and documentation.
naadiya khudabux
on 12 Nov 2016
i have done lastly what exactly u suggested me to join the table on the based of 3 and 5 column. but am not getting the results. i think there is some problem in key
naadiya khudabux
on 12 Nov 2016
Edited: Walter Roberson
on 13 Nov 2016
MSISDN column description: user number(TCBCD code) thorugh Gn/S11 interface signaling access. (is this line is hint to find out the key value?? or do i need to analyze and Interface column values with MSISDN coloum.)
Interface XDRID RAT IMSI IMEI MSISDN ProcedureType RequestTime EndTime Status RequestCause FailureCause MMEUES1APID OldMMEGroupID OldMMECode OldMTMSI MMEGroupID MMECode MTMSI TMSI UserIPv4 MMEIP TAC ECI OtherTAC OtherECI APN
_________ __________________ ___ __________ __________ __________ _____________ ___________ __________ ______ ____________ ____________ ___________ _____________ __________ ________ __________ _______ _____ ____ ________ _____ _____ __________ ________ ________ ___
5 '164c82d268df6200' NaN 1.2351e+18 3.8966e+18 1.9813e+18 3 1.4673e+12 1.4673e+12 0 NaN NaN 3.3738e+07 NaN NaN 0 835 196 0 NaN NaN NaN 29633 1.1829e+08 NaN 0 NaN
5 '164c82d2718abd00' NaN 5.7829e+18 4.0201e+18 1.7189e+18 3 1.4673e+12 1.4673e+12 0 NaN NaN 4.0267e+08 NaN NaN 0 835 202 0 NaN NaN NaN 29521 1.2158e+08 NaN 0 NaN
5 '164c82d294e6f300' NaN 7.7102e+18 5.6221e+18 4.0734e+18 3 1.4673e+12 1.4673e+12 0 NaN NaN 3.356e+08 NaN NaN 0 835 202 0 NaN NaN NaN 29633 1.2148e+08 NaN 0 NaN
5 '164c82d29ca4c700' NaN 1.2351e+18 3.8966e+18 1.9813e+18 3 1.4673e+12 1.4673e+12 0 NaN NaN 3.3624e+07 NaN NaN 0 835 196 0 NaN NaN NaN 29633 1.1829e+08 NaN 0 NaN
5 '164c82d2e39e2e00' NaN 6.8662e+18 1.8811e+17 3.81e+18 3 1.4673e+12 1.4673e+12 0 NaN NaN 3.5237e+08 NaN NaN 0 835 202 0 NaN NaN NaN 29633 1.1829e+08 NaN 0 NaN
5 '164c82d2f290be00' NaN 1.0375e+18 2.1788e+18 8.775e+18 3 1.4673e+12 1.4673e+12 0 NaN NaN 2.8524e+08 NaN NaN 0 835 194 0 NaN NaN NaN 29633 1.1836e+08 NaN 0 NaN
5 '164c82d2f57c6300' NaN 1.3533e+17 4.4186e+18 5.3176e+18 3 1.4673e+12 1.4673e+12 0 NaN NaN 2.8523e+08 NaN NaN 0 832 218 0 NaN NaN NaN 58271 1.1761e+08 NaN 0 NaN
5 '164c82d2f7d05f00' NaN 5.5829e+18 4.8856e+18 2.3275e+18 3 1.4673e+12 1.4673e+12 0 NaN NaN 1.6869e+07 NaN NaN 0 832 218 0 NaN NaN NaN 58271 1.1781e+08 NaN 0 NaN
Walter Roberson
on 13 Nov 2016
I went back to my code, and to the disk1.txt and gaoxiaotest2.txt that you had posted. This time around, I can determine that there are no fields upon which those two files match. Every match was at a location where either the data was missing or the field was 0 or 1. I cross-checked all the combinations.
I do not know now how it is I found matches before; they are not there now when I look.
More Answers (0)
See Also
Categories
Find more on Environment and Settings in Help Center and File Exchange
Tags
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom(English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)