retrieve data from a website with multiple pages

Hi all,
I want to pull the data from this website into a table.
it has 185 pages so I wrote a for loop so it will pass the entire table.
the problam is that I'm using webread, which is seems to read everything into char array.
what I want is that in each itteration of the for loop the data from this table will be read, how can it be done?
thanks

4 Comments

You have to parse the html internally, unless the website offers an API (RESTful for instance).
Tanks for your answer
I am new to HTML, I tried to firure out the structure of the HTML file but my knowledge is in lack.
Do you mayme have a referance to a guide? so I can understand what should I look for?
thanks again
You can search the html for the text in the table and guess the structure from what you see.
I think it is a <table> if I understand correctly, I tried to set weboptions.ContentType to 'table' but it is saying that there is only text.
I'm not sure that this is the way to approach it though

Sign in to comment.

 Accepted Answer

My answer doesn't totally solve your problem, but addresses your main questions (hopefully!). Before parsing the HTML itself, webread doesn't read the content of the URL because the website uses some measures against bot attacks (read more: https://stackoverflow.com/questions/53434555/python-requests-enable-cookies-javascript), so that needs to be fixed first.
url = "https://www.health.gov.il/Subjects/FoodAndNutrition/food/Pages/Manufacturer.aspx?WPID=WPQ8&PN=1";
opts = weboptions('Timeout', 5e3);
htmlraw = webread(url, opts);
% webread cannot read the contents as the website requests cookies =========
% credits: https://stackoverflow.com/a/53435185
htmlraw = string(htmlraw);
top = htmlraw.split('<script>');
top = top(2);
if contains(top, "Challenge=")
Challenge = extractBetween(top, "Challenge=", ";");
challenge_id = extractBetween(top, "ChallengeId=", ";");
arr = char(Challenge);
last_digit = str2double(arr(end));
arr = sort(arr);
min_digit = str2double(arr(1));
subvar1 = (2*str2double(arr(3))) + str2double(arr(2));
subvar2 = string(2 * str2double(arr(3))) + str2double(arr(2));
power = ((str2double(arr(1)) * 1) + 2)^str2double(arr(2));
x = double(Challenge) * 3 + subvar1;
y = cos(pi * subvar1);
answer = x * y;
answer = answer - power;
answer = answer + (min_digit - last_digit);
answer = string(floor(answer)) + subvar2;
hdrs = {'X-AA-Challenge' char(Challenge); ...
'X-AA-Challenge-ID' char(challenge_id); ...
'X-AA-Challenge-Result' char(answer)};
% now read the website contents ===========================================
htmlraw = webwrite(url, hdrs, opts); % content of URL in HTML code
end
% by manually looking at the HTML code
data = htmlTree(htmlraw); % creating an HTML tree from raw content
hdr = findElement(data ,"th").extractHTMLText; % table header
col1 = extractBetween(htmlraw, 'columnLisenceNumber">', "</td>");
col2 = extractBetween(htmlraw, 'HeValue">', "</td>");
col3 = extractBetween(htmlraw, "</table></td><td>", '</td><td class="columnWorkCity');
col4 = extractBetween(htmlraw, 'columnWorkCity">', "</td>");
col5 = extractBetween(htmlraw, 'columnWorkCity">' + ...
wildcardPattern + "</td><td>", ...
'</td><td class="gvDescImg showHideImg"');
% description column (last col)
col6_hdr = extractBetween(htmlraw, '<td class="TenderDescription">', '</td>');
col6_hdr = col6_hdr(1:2);
col6 = extractBetween(htmlraw, '<p class="itemDesc TenderDesc">', '</p>');
col6 = reshape(col6, [], 2);
% reorder as a table
% append the header so column 6 can have descirptive info as well
hdr = string([hdr; hdr(end)]); % 7 columns
hdr(6:7) = hdr(6:7) + "_" + col6_hdr;
tab = array2table([col1, col2, col3, col4, col5, col6], ...
'VariableNames', hdr);
tab = convertvars(tab, 1:width(tab), @string);
tab.(1) = double(tab.(1));
head(tab)
ans = 8×7 table
מספר רישיון שם יצרן כתובת ישוב מחוז פרטים_סוג מזון (מהות היצור): פרטים_קבוצת מזון: ___________ ________________________________ _____________________ _____________ _________ __________________________________________________________________________________________________________________________________________________ ___________________________________ 55678 "א. הקר 2009 גלאט למהדרין בע"מ" "מרכז ספיר 3 ירושלים" "ירושלים" "ירושלים" "ייצור מוצרי בשר קפואים בלבד: בשר בקר טחון, בשר בעלי כנף טחון ומוצריהם, קישקע ממולא, בשר בקר מעובד, בשר בעלי כנף מעובד, ניסור ואריזת בשר בקר קפוא" "הסעדה" 68795 "א. כ. התעשיינים בע"מ" "שד הסנהדרין 3 יבנה" "יבנה" "מרכז" "בשר ומוצריו, לרבות עופות וצייד" "הסעדה (קיטרינג)" 52319 "א.א בורקס ליאון" "איתן 24 ראשון לציון" "ראשון לציון" "מרכז" "אחסנה בקירור" "אחסון מזון בקירור" 69047 "א.א בליסימו בע"מ" "איתן 3 ראשון לציון" "ראשון לציון" "מרכז" "קרחונים אכילים, כולל שרבט וסורבט" "מחסן קרור/מחסן בטמ' מבוקרת" 67457 "א.א מטעמים הכי טעים בע"מ" "מודיעין 8 פתח תקווה" "פתח תקווה" "מרכז" "ייצור בצקים ממולאים, ייצור עוגיות יבשות" "לחם, לחמניות, עוגות שמרים ומאפים" 52312 "א.א. בליסימו בע"מ" "לזרוב 3 ראשון לציון" "ראשון לציון" "מרכז" "מוצרי מאפה, תערובות להכנתם ובצקים" "לחמים ולחמניות מאודים" 50780 "א.א. דרך האוכל (חיפה) בע"מ" "שנקר אריה 47 חיפה" "חיפה" "חיפה" "אחסנת בצקים קפואים" "יצור מוצרי בשר בקר וצאן טחון בלבד" 52587 "א.א. לרנר מוצרי מזון העמק בע"מ" "הפועלים 2 באר שבע" "באר שבע" "דרום" "מחסן קרור/מחסן בטמ' מבוקרת" "בשר ומוצריו, לרבות עופות וצייד"

10 Comments

Thank you very much for your answer!
I'm getting this error:
Array indices must be positive integers or logical values.
Error using webread (line 122)
The connection to URL 'https://www.health.gov.il/Subjects/FoodAndNutrition/food/Pages/Manufacturer.aspx?WPID=WPQ8&PN=5' timed out after 5 seconds. Set options.Timeout to a higher value.
Error in Untitled2 (line 2)
htmlraw = webread(url);
since I do not fully understand this code it is hard for me to correct it, can you please explain what is arr?
and if I understand correctly, are the last 5 rows supposed to extract the data from the table?
I really appreciate your help, thank you!
Ive J
Ive J on 20 Feb 2022
Edited: Ive J on 20 Feb 2022
Ah, you need to increase TimeOut property. See my edited answer.
> and if I understand correctly, are the last 5 rows supposed to extract the data from the table?
Yes, that's correct. In fact you need to look at htmlraw which is the contents of the website; but it's an HTML code, so one should parse it manually. Thankfully, MATLAB offers a bunch of functions to ease this process (e.g. htmlTree which parses a raw HTML code). Next, you need to look for table (see here to understand what tr and td mean in an HTML table: https://www.w3schools.com/html/html_tables.asp).
this is the error I'm getting:
Array indices must be positive integers or logical values.
the problam seems to be with line 10: Challenge = extractBetween(top, "Challenge=", ";");
I inspected the HTML and didn't find the word "challenge", which as far as I understand the script split the HTML file from '<script>' and then extract from "challenge" to ";", is that correct?
since when I ran it arr is an empty array it fails in line 13.
since I didn't find the word "challenge" in the HTML, I wonder how you decided to extract the HTML file by this word?
thank you very much for the help! I really appreciate the efforts :)
Can you try it now?
The error is because the website needs cookies for IPs from other regions (and as you see, online MATLAB runs it successfully), but apparently, this doesn't apply to users within the country (presumably).
Oh that's interesting, didn't know that it has anything to where I ran it from...
now there is a problem with the function htmltree (I don't have this package)
but no worries, I ran in with Matlab online and it works perfectly!
I really appreciate your help!
Glad it worked.
I used it only for header, I guess you can extract it manually, or just simply copy the header from my code above (since the header is always the same for all tables). It was introduced in R2018b (Text Analytics Toolbox).
I actually wanted to read the table in each of the 185 pages, so I did it with a for loop and copied the table in each loop by vertcat to the main table. the header was wtiten only once to the main table.
I'm trying to figure out why it reads only the first 8 lines out of 12, do you have any direction for why it is happenes?
I'm not sure if I get it right; do you mean you tried something like this?
function parseFoodAndNutrition(n)
if nargin < 1
n = 3; % read only 3 pages
end
unitab = cell(n, 1);
for i = 1:n
fprintf('reading page %d of %d\n', i, n)
unitab{i} = readEachPage(i);
end
unitab = vertcat(unitab{:});
unitab = convertvars(unitab, 1:width(unitab), @string);
unitab.(1) = double(unitab.(1));
end % END
%% subfunctions ===========================================================
function tab = readEachPage(n)
url = "https://www.health.gov.il/Subjects/FoodAndNutrition/food/Pages/Manufacturer.aspx?WPID=WPQ8&PN=" + n;
opts = weboptions('Timeout', 5e3);
htmlraw = webread(url, opts);
% webread cannot read the contents as the website requests cookies =========
% credits: https://stackoverflow.com/a/53435185
htmlraw = string(htmlraw);
top = htmlraw.split('<script>');
top = top(2);
if contains(top, "Challenge=")
Challenge = extractBetween(top, "Challenge=", ";");
challenge_id = extractBetween(top, "ChallengeId=", ";");
arr = char(Challenge);
last_digit = str2double(arr(end));
arr = sort(arr);
min_digit = str2double(arr(1));
subvar1 = (2*str2double(arr(3))) + str2double(arr(2));
subvar2 = string(2 * str2double(arr(3))) + str2double(arr(2));
power = ((str2double(arr(1)) * 1) + 2)^str2double(arr(2));
x = double(Challenge) * 3 + subvar1;
y = cos(pi * subvar1);
answer = x * y;
answer = answer - power;
answer = answer + (min_digit - last_digit);
answer = string(floor(answer)) + subvar2;
hdrs = {'X-AA-Challenge' char(Challenge); ...
'X-AA-Challenge-ID' char(challenge_id); ...
'X-AA-Challenge-Result' char(answer)};
% now read the website contents ===========================================
htmlraw = webwrite(url, hdrs, opts); % content of URL in HTML code
end
% by manually looking at the HTML code
data = htmlTree(htmlraw); % creating an HTML tree from raw content
hdr = findElement(data ,"th").extractHTMLText; % table header
col1 = extractBetween(htmlraw, 'columnLisenceNumber">', "</td>");
col2 = extractBetween(htmlraw, 'HeValue">', "</td>");
col3 = extractBetween(htmlraw, "</table></td><td>", '</td><td class="columnWorkCity');
col4 = extractBetween(htmlraw, 'columnWorkCity">', "</td>");
col5 = extractBetween(htmlraw, 'columnWorkCity">' + ...
wildcardPattern + "</td><td>", ...
'</td><td class="gvDescImg showHideImg"');
% description column (last col)
col6_hdr = extractBetween(htmlraw, '<td class="TenderDescription">', '</td>');
col6_hdr = col6_hdr(1:2);
col6 = extractBetween(htmlraw, '<p class="itemDesc TenderDesc">', '</p>');
col6 = reshape(col6, [], 2);
% reorder as a table
% append the header so column 6 can have descirptive info as well
hdr = string([hdr; hdr(end)]); % 7 columns
hdr(6:7) = hdr(6:7) + "_" + col6_hdr;
tab = array2table([col1, col2, col3, col4, col5, col6], ...
'VariableNames', hdr);
% can be done at once in the end
% tab = convertvars(tab, 1:width(tab), @string);
% tab.(1) = double(tab.(1));
end
When I run the above function I get this:
size(unitab)
ans =
36 7
I was actually put your entire script in a for loop, and changed the URL as i increase. Than in each loop I was writing the answer from your script to another tanle using vertcat. If I understand correctly, the answer of size(unitab) = (36,7) is for pages 1-3? If so, this is the dimension I'm expecting to receive.
Yes, that's for 3 pages.
Feel free to use the function above! also be aware that sometimes when you send so many requests to a website, they may block your IP (temporarily).
To track possible parsing bugs, you can also save each table as a mat file. In this way, if you expect let's say 120 rows and you get only 100, you can inspect each table individually. You can do this by adding these lines:
for i = 1:n
fprintf('reading page %d of %d\n', i, n)
tab = readEachPage(i);
save("tab.page." + i + ".mat", "tab") % e.g. tab.page.10.mat contains table for page 10
unitab{i} = tab;
end

Sign in to comment.

More Answers (0)

Categories

Find more on Graph and Network Algorithms in Help Center and File Exchange

Asked:

on 18 Feb 2022

Commented:

on 22 Feb 2022

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!