readtable(html file) producing extra empty columns

Question

0 votes

Original question: In another thread, similar question was asked for readtable(csv file). The answer was to set {'delimiter', ','}. Because htmlImportOptions does not have 'delimiter' property, that answer does not work for my problem. I found that {'EmptyColumnRule','skip'} is a solution. Unfortunately, it can't work together with htmlImportOptions, which is used to set up DataRows.

Update: name-value pair does have 'DataRows' option.

opt.ExtraColumnsRule = 'ignore' % readtable only the first column.

% either 
opt = htmlImportOptions;
opt.DataRows = 4; 
% opt.EmptyColumnRule = 'skip' % error, html opt doesn't have this property.
% update
opt.ExtraColumnsRule = 'ignore'; 
readtable(htmlfile, opt) % read in only the first column. The other non-extra columns are ignored.
% or
% orignial post: readtable(htmlfile, 'EmptyColumnRule', 'skip') % {'DataRows', 4} is an error
% update. this works
readtable(htmlfile, 'EmptyColumnRule', 'skip', 'DataRows', 4) 
% but not both
readtable(htmlfile, opt, 'EmptyColumnRule', 'skip') % error

I suppose I can read in the ExtraVar columns first and then delete the empty columns, just that I would rather readtable( ) handle it.

Thanks for any solutions!

6 Comments
Show 4 older comments Hide 4 older comments

Simon on 11 Sep 2023

@dpb I will see if I can create a sample data.

dpb on 11 Sep 2023

It would seem highly unlikely that simply uploading a few files with one or two records would reveal anything terribly damaging. :) Of course, it some rare instance it might be possible for industrial sabatoge to occur with only a handful of numbers or it may be company policy regardless of whether there's any real danger or not, or it could be a case such as in my former employment is part of a classified document which, by those rules makes anything in the document classified whether the specific pieces of data are sensitive or not and so can't release anything (despire our current and former leaders who seem to ignore such rules) without a signoff from a derivative classifier who likely won't declassify it for you just on general principle.

IOW, I'm just suggesting to really consider the actual content and whether it's really of need to not just use the data as is...of course, it should be relatively simple to just readcell, substitute the numeric values with rand of the same size and write back out...

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

dpb on 10 Sep 2023

Open in MATLAB Online

0 votes

Use 'SelectedVariableNames' with the variable(s) desired

I can't tell what you want, specifically, there's a comment to read only the first??? If that is so, then

opt = htmlImportOptions;
opt.DataRows = 4; 
opt.ExtraColumnsRule = 'ignore';
opt.SelectedVariables=opt.VariableNames(1);  % read only the first column
tData=readtable(htmlfile, opt);

5 Comments
Show 3 older comments Hide 3 older comments

Simon on 11 Sep 2023

Edited: Simon on 11 Sep 2023

@dpb "Use 'SelectedVariableNames' with the variable(s) desired"

I think your answer is in the right direction. I suspect the header rows, which I select to be used as variables, cause the problem. The headers of the first two columns are empty, and the other headers are, say X, Y, Z. So the extra variables are created by Matlab as Z_1, Z_2, ... etc.

But the problem is that I have many tables, whose headers may more or less be the same, but not alway be the same. For example, one table may have headers (empty), (empty), X, Y, Z., another table (empty), (empty), X, Y, W, Z. Matlab treats the first empty header as Var1, and the rest as 'extra' columns.

I will give 'SelectedVariableNames' a shot. Thank you.

@dpb I have just now tried it. not working. Matlab insist treats the first column, whose header is empty, as the only variable. Even SelectedVariableNames = 1:2 will produce out-of-range error. It's critical for me to have the headers of the other columns as variables. But I don't know what they might be beforehand. I have to unaccept your answer. Sorry!

dpb on 11 Sep 2023

Edited: dpb on 11 Sep 2023

You can't expect a single solution to be able to handle different file formats.

If you can't standardize the file format, then you'll have to have a catalog of what are possible formats you expect to be able to handle and then build a tool which can discern which is which from some characteristic within the file or have a way to know heuristically which is which a priori.

But, if the header line or data content is such that the import options detection heuristics don't return the expected number of variables/columns, then you'll have to use some other technique to identify what are columns/variables. But again, if we can't see examples to work with, we're also shooting blindly at an ill-defined problem.

The mind-(file)reading toolbox has yet to be released...

ADDENDUM:

If, as the original Q? content seems to suggest, if the "dead ahead" readtable function returns the expected variables plus some that may occasionally be extra/empty, the most expedient solution may well be to simply read the file then delete the extraneous columns instead of going through a lot of gyrations of parsing the changeable input file to handle various cases explicitly -- again, if you can't standardize the input form and can't know a priori from some other source the specifics of the particular file a priori.

In a particular case with some more-or-less free format Excel files, I simply read the variable names row first and parse it for the specific variables of interest. In that case, the variables will always exist; just sometimes there are some others that user may have thrown in ad hoc and sometimes there are some empty columns to parse out, but detectImportOptions will return a correct number of variables. If, in the case of an .html file that can't be assured, then it's going to be a lot more trouble, but I don't have to mess with such and without specific example(s) from which to work, as noted before, we're pretty-much hamstrung.

dpb on 12 Sep 2023

As above I've never had to really mess with parsing HTML much, but it's not set up as a format for scanning by tools such as readtable so it's not at all surprising to me to find you're having difficulties.

While it won't be directly applicable to your case, I'll see if I can strip out the parsing stuff/modifications to the import object I described above into a short piece of example code just as idea generator.

If you can figure out a way to post some examples of what your files actually look like, it would still be the best way to see if somebody can build a better mouse trap.

Simon on 17 Sep 2023

@dpb Thanks for offering the help. I couldn't find a similar sample file to upload here. htmls have all sorts of defects. You were right in your earlier comments not to rely on one function to parse them correctly. I finally used a simple combination of all and ismissing to remove the extra empty columns after readtable(). I greatly appreciate your feedbacks.

Sign in to comment.

readtable(html file) producing extra empty columns

6 Comments
Show 4 older comments Hide 4 older comments

Answers (1)

5 Comments
Show 3 older comments Hide 3 older comments

Categories

Products

Release

Tags

Community Treasure Hunt

readtable(html file) producing extra empty columns

6 Comments Show 4 older comments Hide 4 older comments

Answers (1)

5 Comments Show 3 older comments Hide 3 older comments

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

6 Comments
Show 4 older comments Hide 4 older comments

5 Comments
Show 3 older comments Hide 3 older comments