Mass replace text, then split the replaced text into multiple columns - from multiple csv files

I have multiple csv files (Azure.csv, Blue.csv,...until 50th file) which has roughly this kind of data inside it:
ID, Duration, Team A, TeamAComposition, Team B, TeamBComposition
120404 3442 Azure Ae Bee Cee Dee Ee Blue Ef Gee Heij Ai Jei
302033 233 Azure Ae Bee Cee Dee Zet Cyan El Em En Ou Pee
...and so on
I would like to replace the team composition data into integers, so for example "Ae" will be replaced into "1", "Bee" will be "2", "Cee" into "3" and so on, and if possible convert those into 5 different new columns. So the new table is roughly like this
ID, Duration, Team A, AComp1, AComp2, AComp3, AComp4, AComp5, Team B, BComp1, BComp2, BComp3, BComp4, BComp5,
120404 3442 Azure 1 2 3 4 5 Blue 6 7 8 9 10
How can I do this? Thanks in advance.

6 Comments

How do you go from 50 files to 5 tables? How is the grouping performed?
Is a team always composed of 5 members?
How is the mapping between team member names and numbers defined? Alphanumeric ordering?
Do you need help with the parsing of the files as well or have you got that sorted out? If not, please provide an example file.

The 50 files will stay as 50 files, what I meant was splitting "TeamComposition" table into 5 different tables since it contains 5 different information inside of it.

The team is always composed of 5 members, but some members have one word in its name, some have 2 words (More explained below).

I'll attach one of the file as an example, although the format will be a bit different than the one I asked.. In the CSV file I attached, "TeamComposition" will be named as "Team A/B Heroes" and there's some heroes which will have 2 words as its name (example: Vengeful Spirit, Templar Assassin, Dark Seer) that's why "Team A/B Heroes" will sometimes have more than 5 words in it.

The mapping between hero names and numbers will be defined as the attached "hero_names.csv"

I think that instead of tables you actually mean columns. Otherwise, I'm confused.
Most of what you want is easy (assuming you're on a recent version of matlab). The only complication is the lack of clear separation between the hero names in the team composition. Can this be changed? If not, is it guaranteed that there will never be full overlap between names?
e.g. you'd never come across ... Vengeful Spirit Assassin ... with a list of hero names containing 'Vengeful Spirit' and 'Spirit Assassin'. Otherwise, the line could be parsed either as ... Vengeful and Spirit Assassin or Vengeful Spirit and Asssassin ...
Yes column's what I meant (RIP Engrish), sorry xD
Would putting coma on each of the hero names be enough? I could edit the source database to separate the names with coma..
A comma or any other separator not present in the names would remove any ambiguity indeed. It's possible to do the splitting without that separator but if there's a chance of the problem above, then an explicit separator is safer.
How can I achieve the multiple column conversion above? (assuming I've given comma after each hero names) Also I'm using Matlab R2015b

Sign in to comment.

 Accepted Answer

Solution for when heroes are just separated by a space:
In its own m file:
function varargout = split_heros(heroes, varargin)
%split multiple scalar strings (varargin) into heroes id
varargout = cell(1, numel(varargin)*5);
for col = 1:numel(varargin)
ishero = ~cellfun(@isempty, regexp(varargin{col}, heroes.regex));
varargout(5*col-4 : 5*col) = num2cell(heroes.hero_id(ishero));
end
end
To create the table:
heronames = readtable('hero_names.csv');
heronames.regex = strcat(heronames.localized_name, '( |$)'); %regular expression to help splitting list of names
%to wrap in a loop for each of the 50 files:
intable = readtable('DC vs EG.csv');
outnames = [sprintfc('AComp%d', 1:5), sprintfc('BComp%d', 1:5)]; %using undocumented sprintfc. in R2016b or later replace with compose
outtable = [intable(:, [1:6, 9, 11]), ...
rowfun(@(varargin) split_heros(heronames, varargin{:}), intable, 'InputVariables', {'TeamAHeroes', 'TeamBHeroes'}, 'NumOutputs', 10, 'OutputVariableNames', outnames)]
Solution for when heroes are separated by a comma:
In its own m file:
function varargout = split_heros(heroes, varargin)
[~, row] = ismember(strsplit(strjoin([varargin{:}], ','), ','), heroes.localized_name);
varargout = num2cell(heroes.hero_id(row));
end
To create the table:
heronames = readtable('hero_names.csv');
%to wrap in a loop for each of the 50 files, same code as before
intable = readtable('DC vs EG.csv');
outnames = [sprintfc('AComp%d', 1:5), sprintfc('BComp%d', 1:5)]; %using undocumented sprintfc. in R2016b or later replace with compose
outtable = [intable(:, [1:6, 9, 11]), ...
rowfun(@(varargin) split_heros(heronames, varargin{:}), intable, 'InputVariables', {'TeamAHeroes', 'TeamBHeroes'}, 'NumOutputs', 10, 'OutputVariableNames', outnames)]

6 Comments

Hey Guillaume, thanks for the great support as always. I've encountered an error halfway through the conversion.. I noticed MATLAB renamed Nature's Prophet into "Nature''s Prophet" (double apostrophe). This gave me error
Variable index exceeds table dimensions.
and I couldn't manually rename the value back to "Nature's Prophet", it would say 'Unexpected Matlab expression'.
Hum, no matlab would not change the name of any hero. If you wanted to write the name yourself in code, then yes, you'd have to escape the ' with a second one, but the content of char array would still be just one ':
>> str = 'Nature''s prophet' %requires a ' to escape the '
str =
'Nature's prophet'
>> sum(str == char(39))
ans =
1
The only thing that matlab will rename are the header of your DC vs EG table, to make them valid variable name.
In any case, this doesn't explain the 'Variable index exceeds table dimensions.' error. Which part of the code is giving this error?
Note: there are no error checking in any of the code above. If a hero is not found, or a team does not have exactly 5 members, then it will cause problems. In the 2nd solution if the hero names are not separated by a comma and nothing else, it will also cause problem.
Finally, I forgot to say: you mentioned that you're writing these text files from a database. Why not use matlab to query the database directly?
It turns out the error was that readtable converts some of the tables into one column only, in this case I had Alliance vs CDEC.csv (3x11) converted into a 3x1 table. Enclosed are the files if you want to check them.
I don't have access to the database anymore, so I had to sadly, use the separated by space solution ;(
That Alliance vs ... file has a different format than your DC vs EG file. It does throw off readtable but that's easily corrected.
intable = readtable(filename, 'Delimiter', '\t,');
should work for both formats. At least it works for me in R2017b. On the other hand, without a delimiter option R2017b complains about the format of your Alliance ... file instead of mistakenly returning a 3x1 table, so YMMV in R2015b.
Sorry to bother you again Guillame, I've come to another dead end xD. When I tried to convert another file this wall of error text came up.. (csv attached)
Error using table/rowfun>dfltErrHandler (line 309)
Applying the function '@(varargin)split_heros(heronames,varargin{:})' to the 38th row of A generated the
following error:
In an assignment A(:) = B, the number of elements in A and B must be the same.
Error in table/rowfun>@(s,varargin)dfltErrHandler(grouped,funName,s,varargin{:}) (line 196)
errHandler = @(s,varargin) dfltErrHandler(grouped,funName,s,varargin{:});
Error in table/rowfun (line 214)
[b_data{i,:}] = errHandler(struct('identifier',ME.identifier, 'message',ME.message,
'index',i),inArgs{:});
Error in replacenames (line 7)
rowfun(@(varargin) split_heros(heronames, varargin{:}), intable, 'InputVariables', {'TeamAHeroes',
'TeamBHeroes'}, 'NumOutputs', 10, 'OutputVariableNames', outnames)];
Is this due to no commas present? As always, thanks again!
If you're getting an "In an assignment A(:) = B, ..." error most likely it's because for some reason the code didn't identify exactly 5 heros in a team. It either found more or less. As said, I didn't put any error checking. You could add the line
assert(sum(ishero) == 5, '%d heros found in team %d', sum(ishero), col);
after the ishero = ... line in split_heros.m to make the error message clearer.
When I test your 'CDEC vs VG.csv' file, I get a problem on the first row of the table because Barathrum is not in your 'hero_name.csv'. I suspect a similar problem happens for row 38 (where you're getting the error).
Note that the check for hero names that I've implemented for the 1st case is fairly crude. It also assumes that the hero names do not contain any of the following characters: ()[]{}|^$.+*\?, that is any of the characters that have special meaning in regular expressions.
If you want to check which heros are identified for a particular team of a particular row, e.g. team B of row 38:
heroes = intable{38, 'TeamBHeroes'}
ishero = ~cellfun(@isempty, regexp(heroes, heronames.regex));
foundheros = heronames(ishero, :)
which in my case shows that Monkey King is not found (since not in heronames)

Sign in to comment.

More Answers (0)

Categories

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!