Problme with Text analysis

3 views (last 30 days)
David MERCIER
David MERCIER on 19 Oct 2021
Answered: DGM on 19 Oct 2021
Hi, I try to clean a table containing both latin and non-latin strings to plot a wordcloud. I used regexprep function but not successfully. I can't remove korean strings. Any idea? Here an example of the code and the output:
pathName = 'Keyword Aug. 2020 to Oct. 2021_MatlabSmall.xlsx';
T = readtable(pathName,'Range','A:B');
% Convert all Character Vector to Lowercase
T.Keyword = lower(T.Keyword);
% Remove not useful keywords
T(strcmp(T.Keyword, '(not provided)'), :)=[];
T(strcmp(T.Keyword, '(not set)'), :)=[];
% Set lower case
T.Keyword = lower(T.Keyword);
% Remove links
T(contains(T.Keyword, 'http'), :)=[];
T(contains(T.Keyword, '.'), :)=[];
T.Keyword = strrep(T.Keyword, ' ', '_');
display(head(T));
% Replace non alphanumerics
T.Keyword = regexprep(T.Keyword,'^a-z','');
8×2 table
Keyword Sessions
_________________________________ ________
'stuff' 390
'forum' 128
'student' 76
'재료' 59
'stuff' 56
'uninstall_stuff_license_manager' 52
'stuff_resource_center' 43
'stuff_student_community' 34

Answers (1)

DGM
DGM on 19 Oct 2021
I'm terrible with regex, but this might get you somewhere. Replaces everything but lowercase alpha and underscores.
A = {'9.banana' 'orange-123_juice' 'ン戦国時' 'apple_sauce' 'abcクルミ' 'peach' 'pear' 'ピラミッド' 'cherry'}.'
A = 9×1 cell array
{'9.banana' } {'orange-123_juice'} {'ン戦国時' } {'apple_sauce' } {'abcクルミ' } {'peach' } {'pear' } {'ピラミッド' } {'cherry' }
B = regexprep(A,'[^a-z_]','')
B = 9×1 cell array
{'banana' } {'orange_juice'} {0×0 char } {'apple_sauce' } {'abc' } {'peach' } {'pear' } {0×0 char } {'cherry' }

Products


Release

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!