How should XPath be set in TableSelector for htmlImportOptions so readtable( ) can output the first three tables in an html file?
19 views (last 30 days)
Show older comments
I like to read first three tables in an html file with calling readtable( ) once in order to reduce the html file reading time. However, the readtable( ) function from the database toolbox seems to read only one table at a time. I have tried to manipulate TableSelector right by playing around with a few XPath scripts. They either return error message or only one table. For example, the one below returns the first table, but there is no table 2 or 3.
opts = detectHtmlImportOptions(htmlfile);
opts.TableSelector = "//TABLE[position()<4]";
readtable(htmlfile, opts)
I was wondering that because the output argument of readtable( ) is a table, it can only read only table at a time.
Another related question.
% why is not lowercase 'table' right?
opts.TableSelector = "//table[1]";
readtable(htmlfile, opts)
% ans=
% 0x1 empty table
% When TABLE[1] use upper case letters, readtable( ) output the first table correctly.
opts.TableSelector = "//TABLE[1]";
readtable(htmlfile, opts)
0 Comments
Accepted Answer
Kevin Gurney
on 30 Sep 2022
1. Is there a way to return multiple tables from readtable in a single function call?
Answer:
Currently, readtable does not support returning multiple tables with a single function call. However, your expectations for how readtable should work in this case are very reasonable, so thank you for pointing this out as something to consider for a potential future enhancement to readtable.
Conceptually, the XPath expression //TABLE that you are using, is "correct", in that it does semantically mean "select all of the <table> elements in the HTML file". However, readtable currently only returns the first table in the set of tables selected by an XPath expression.
2. Why does "TABLE" need to be uppercase in the TableSelector XPath expression?
Answer:
This is a consequence of the fact that (1) XPath is case-sensitive and (2) readtable normalizes HTML element names to uppercase on import. This is a limitation of how readtable interacts with HTML files.
0 Comments
More Answers (1)
Christopher Creutzig
on 30 Sep 2022
readtable currently only returns a single table. There has been talk about a function returning multiple tables, but I don't know of any concrete plans. It may be worth letting support@mathworks.com know you are looking for something like that – given the time things can take from inception to release, it may not always be obvious, but user demand does influence priorities.
As for lowercase table selectors … table selectors are XPath expressions, and XPath is, well, case-sensitive. Most HTML versions/variants (maybe in practice all of them) are case agnostic, although their standards differ in what they regard as the “right” casing to use. htmlTree normalizes to uppercase. But that doesn't mean we could simply treat the XPath expression as case agnostic, as it can contain parts where case matters. I'm not sure if your question is simple curiosity or if this is actually a bump in the road to solving your problems. If the latter, please let us know more.
Nitpick: readtable does not require Database Toolbox, it is in core MATLAB.
See Also
Categories
Find more on Text Files in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!