Facebook data utf-8 string decode

11 views (last 30 days)
Addy
Addy on 21 Oct 2018
Commented: Guillaume on 1 Nov 2018

Hi,

I have already asked a question about this. But I could not find a solution here Facebook data decode.

As I was looking in stackoverflow, A person had linked for a javascript function to decode.

It is available in the link UTF-8 and the demo website for that code is.. UTF-8 encode/decode website

If anyone knows javascript, can you create a function in matlab to convert the UTF-8 string to a character?. For example the decoding will be like this..

"\u00f0\u009f\u0098\u009b" to 😛

"\u00f0\u009f\u0091\u008d" to 👍

"\u00e3\u0080\u0082" to 。

"\u00f0\u009f\u0098\u0082" to 😂 ....

Here how I saw that the javascript function worked in the website

UTF-8 encode/decode website

decoded characters

Can anyone come up with an solution? I have been trying to solve this for days..

I have attached a json file that has been pulled out from facebook data for your reference..

Accepted Answer

Guillaume
Guillaume on 21 Oct 2018
Sorry, I never saw that you finally gave the raw json in your original question. Otherwise, I would have responded.
The problem you face is that I don't think that there is a function in matlab that knows how to parse the escape codes in your json properly as UTF-8. jsondecode probably assumes a native character set, not UTF-8. I'll write an enhancement request to mathworks to suggest adding a character set option to jsondecode.
So, instead you have detect these escape codes and replace them by their native equivalent. Thankfully, a regexprep with a dynamic replacement expression can do that in one go:
%read message as text
rawtext = fileread('message.json');
%detect \u escape codes and convert them to native using dynamic replacement regular expressions
nativetext = regexprep(rawtext, '(\\u[A-Za-z0-9]{4})*', '${native2unicode(sscanf($0, ''\\\\u%4x'')'', ''UTF-8'')}');
%decode json
conversation = jsondecode(nativetext);
You can see that the text is now decoded properly (R2018b on Windows, this may not be the case with all versions of matlab or all OSes):
>> cellfun(@(s) s.content, conversation.messages, 'UniformOutput', false)
ans =
4×1 cell array
{'You sent a sticker.' }
{'thanks' }
{'ok。i just wanted to find jian。now i find him。 good luck bro?'}
{'Exams are coming up ?' }
Explanation of the regexp:
  • \\u: match '\u' (literal '\' has to be escaped in regular expressions), followed by
  • [A-Za-z0-9]{4}: match 4 consecutive characters among 'ABCDEFabcdef0123456789'
  • (...)*: match continuous runs of the above as many times as necessary
This is then converted by
native2unicode(sscanf($0, '\u%4x')', 'UTF-8')
in the replacement expression, where $0 is the matched text (a continuous run of \uxxxx, eg. '\u00f0\u009f\u0091\u008d'). sscanf converts each \uxxxx into its decimal equivalent and native2unicode converts the utf8 to whatever matlab expects (utf16?)
  2 Comments
Addy
Addy on 21 Oct 2018
Thank you very much Guillaume. I'm much happy to continue with my project now :)
Guillaume
Guillaume on 1 Nov 2018
As I was going to submit an enhancement request to mathworks to specify the encoding of escape sequences in json, I started to look into the JSON standard.
UTF8 escape sequences are not allowed by the JSON standard. Only UTF16, so matlab jsondecode was correct in decoding the json the way it did and the escape sequences in your JSON are not valid.
If it's JSON produced by facebook, then facebook doesn't follow the JSON standard and you should complain loudly to them.
In the meantime, my answer constitue a workaround for that broken JSON but I'm not submitting an enhancement request since jsondecode does the correct thing.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!