extract part of a string with an extension

3 views (last 30 days)
Hi, I have a long string and I want to just exctract the names that have "hdf" as an extension:
I want just to get "MOD11C1.A2013001.005.2013015221704.hdf"
My string is:
U.S. GOVERNMENT COMPUTER
This US Government computer is for authorized users only. By accessing this
system you are consenting to complete monitoring with no expectation of privacy.
Unauthorized access or use may subject you to disciplinary action and criminal
prosecution.
********************************************************************************
</pre>
<pre><img src="/icons/blank.gif" alt="Icon "> Name Last modified Size Description<hr><img src="/icons/back.gif" alt="[DIR]"> Parent Directory -
<img src="/icons/image2.gif" alt="[IMG]"> BROWSE.MOD11C1.A2013001.005.2013015221704.1.jpg 15-Jan-2013 16:29 3.2M
<img src="/icons/image2.gif" alt="[IMG]"> BROWSE.MOD11C1.A2013001.005.2013015221704.2.jpg 15-Jan-2013 16:29 3.3M
<img src="/icons/unknown.gif" alt="[ ]"> MOD11C1.A2013001.005.2013015221704.hdf 15-Jan-2013 16:29 46M
<img src="/icons/unknown.gif" alt="[ ]"> MOD11C1.A2013001.005.2013015221704.hdf.xml 16-Jan-2013 02:15 32K
<hr></pre>
</body></html>
Thanks,
Zeinab
  3 Comments
Andrea
Andrea on 3 Dec 2014
Edited: Andrea on 3 Dec 2014
Thanks, It always has the exact same extension "hdf" file. And it always starts with MOD, as you see the name is I am interested in is: MOD11C1.A2013001.005.2013015221704.hdf But it will change in other loops according to the date. for instance: MOD11C1. A2013001.005.2013015221704 .hdf will be MOD11C1.A2013001.005.2013015221705.hdf.
The reason I need it, is I want to read the files in a web address (that will change with a loop) with urlread which gives me the content as string. Now I need to use urlwrite to save the files I want according to their filenames (with have hdf extension).
Please see this: str=urlread(path1);
Many thanks, I really spend more than 6 hours on it so far!
farz

Sign in to comment.

Accepted Answer

per isakson
per isakson on 3 Dec 2014
Edited: per isakson on 4 Dec 2014
Here is a solution(?) based on regexp
>> cac = cssm;
>> cac{:}
ans =
MOD11C1.A2013001.005.2013015221704.hdf
ans =
MOD11C1.A2013001.005.2013015221704.hdf
>>
where
function cac = cssm()
str = fileread( 'cssm.txt' );
name_xpr = '[\w\.]+\.hdf';
cac = regexp( str, name_xpr, 'match' );
end
and cssm.txt contains the text of your question. Two identical name seems to be correct. You might want to apply unique
&nbsp
In response to comments:
My mistake illustrates a problem with regular expressions. Expressions often matches unexpected strings. I missed the case that ".hdf" is part of the base name rather than an extension. Now I have added that ".hdf" should be followed by "\s, Any white-space character; equivalent to [\f\n\r\t\v]". However, that white-space is not included in the output.
>> cssm
ans =
'MOD11C1.A2013001.005.2013015221704.hdf'
function cac = cssm()
str = fileread( 'cssm.txt' );
name_xpr = '[\w\.]+\.hdf(?=\s)'; % <<<<<<< modified
cac = regexp( str, name_xpr, 'match' );
end
&nbsp
Stephen Cobeldick already proposed this modification to the expression. I like Stephen's list, which helps to pinpoint the unique characteristics of the string. It triggers thinking. Does the filename always start with "MOD"? Could "MOD" appear in the middle of the name? It's risky to deduce rules out of small samples. If the name shall always start with "MOD"
name_xpr = '(?<=\s)MOD[\w\.]+\.hdf(?=\s)';
is a better expression.
  4 Comments
Andrea
Andrea on 3 Dec 2014
Thank you I tried the one with "s" as you suggested but it did not work. The previous one worked fine for me but gave me all the files with hdf extension which was not a big problem. The one you suggested seems to give me a unique answer but it isn't working and it gives an empty cell as a result.

Sign in to comment.

More Answers (1)

Stephen23
Stephen23 on 3 Dec 2014
Edited: Stephen23 on 3 Dec 2014
Why not all on one line?
str = fileread('temp.txt');
C = regexp(str,'MOD[\w\.]+\.hdf(?=\s)','match');
C =
'MOD11C1.A2013001.005.2013015221704.hdf'
This matches all substrings that meet the following conditions:
  • starts with 'MOD'
  • ends with '.hdf'
  • contains any combination of alphnumeric characters plus period
  • is followed by a space character (ie excludes '....hdf.xml')
As suggested by per isakson, you might also want to apply unique to the output.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!