How to only follow certain links with matlab spider
    3 views (last 30 days)
  
       Show older comments
    
Hi I am struggling with only allowing certain urls to be followed when using a spider to build a web graph. Basically I only want the spider to follow links that point to the uni server( shef.ac.uk), any other urls need to be discarded, opposed to the current state were all links are followed. Probably quiet a simple fix.
U = cell(n,1);
hash = zeros(n,1);
L = logical(sparse(n,n));
m = 1;
U{m} = root;
hash(m) = hashfun(root);
for j = 1:n
    try
        disp(['open ' num2str(j) ' ' U{j}])
        page = urlread(U{j});
    catch
        disp(['fail ' num2str(j) ' ' U{j}])
        continue
    end
      for f = findstr('http:',page);
          e = min(findstr('"',page(f:end)));
          if isempty(e), continue, end
          url = deblank(page(f:f+e-2));
          url(url<' ') = '!';
          if url(end) == '/', url(end) = []; end
          skips = {'.gif','.jpg','.ico'};
          skip = any(url=='!') | any(url=='?');
          k=0;
          while ~skip && (k < length(skips))
              k = k+1;
              skip = ~isempty(findstr(url,skips{k}));
          end
          if skip
              if isempty(findstr(url,'.gif')) & isempty(findstr(url,'.jpg'))
                  disp([' skip' url])
              end
              continue
          end
          i=0;
          for k = find(hash(1:m) == hashfun(url))';
              if isequal(U{k},url)
                  i = k;
                  break
              end
          end
          if (i == 0) & (m < n)
              m = m+1;
              U{m} = url;
              hash(m) = hashfun(url);
              i=m;
          end
          if i > 0
              disp([' link ' int2str(i) ' ' url])
              L(i,j) = 1;
          end
      end
  end
0 Comments
Answers (0)
See Also
Categories
				Find more on Antennas and Electromagnetic Propagation in Help Center and File Exchange
			
	Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!