Main Content

extractFileText

Read text from PDF, Microsoft Word, HTML, and plain text files

Description

example

str = extractFileText(filename) reads the text data from a file as a string.

example

str = extractFileText(filename,Name,Value) specifies additional options using one or more name-value pair arguments.

Examples

collapse all

Extract the text from sonnets.txt using extractFileText. The file sonnets.txt contains Shakespeare's sonnets in plain text.

str = extractFileText("sonnets.txt");

View the first sonnet.

i = strfind(str,"I");
ii = strfind(str,"II");
start = i(1);
fin = ii(1);
extractBetween(str,start,fin-1)
ans = 
    "I
     
       From fairest creatures we desire increase,
       That thereby beauty's rose might never die,
       But as the riper should by time decease,
       His tender heir might bear his memory:
       But thou, contracted to thine own bright eyes,
       Feed'st thy light's flame with self-substantial fuel,
       Making a famine where abundance lies,
       Thy self thy foe, to thy sweet self too cruel:
       Thou that art now the world's fresh ornament,
       And only herald to the gaudy spring,
       Within thine own bud buriest thy content,
       And tender churl mak'st waste in niggarding:
         Pity the world, or else this glutton be,
         To eat the world's due, by the grave and thee.
     
       "

Extract the text from exampleSonnets.pdf using extractFileText. The file exampleSonnets.pdf contains Shakespeare's sonnets in a PDF file.

str = extractFileText("exampleSonnets.pdf");

View the second sonnet.

ii = strfind(str,"II");
iii = strfind(str,"III");
start = ii(1);
fin = iii(1);
extractBetween(str,start,fin-1)
ans = 
    "II 
      
       When forty winters shall besiege thy brow, 
       And dig deep trenches in thy beauty's field, 
       Thy youth's proud livery so gazed on now, 
       Will be a tatter'd weed of small worth held: 
       Then being asked, where all thy beauty lies, 
       Where all the treasure of thy lusty days; 
       To say, within thine own deep sunken eyes, 
       Were an all-eating shame, and thriftless praise. 
       How much more praise deserv'd thy beauty's use, 
       If thou couldst answer 'This fair child of mine 
       Shall sum my count, and make my old excuse,' 
       Proving his beauty by succession thine! 
         This were to be new made when thou art old, 
         And see thy blood warm when thou feel'st it cold. 
      
       "

Extract the text from pages 3, 5, and 7 of the PDF file.

pages = [3 5 7];
str = extractFileText("exampleSonnets.pdf", ...
    'Pages',pages);

View the 10th sonnet.

x = strfind(str,"X");
xi = strfind(str,"XI");
start = x(1);
fin = xi(1);
extractBetween(str,start,fin-1)
ans = 
    "X 
      
       Is it for fear to wet a widow's eye, 
       That thou consum'st thy self in single life? 
       Ah! if thou issueless shalt hap to die, 
       The world will wail thee like a makeless wife; 
       The world will be thy widow and still weep 
       That thou no form of thee hast left behind, 
       When every private widow well may keep 
       By children's eyes, her husband's shape in mind: 
       Look! what an unthrift in the world doth spend 
       Shifts but his place, for still the world enjoys it; 
       But beauty's waste hath in the world an end, 
       And kept unused the user so destroys it. 
         No love toward others in that bosom sits 
         That on himself such murd'rous shame commits. 
      
       X 
      
       For shame! deny that thou bear'st love to any, 
       Who for thy self art so unprovident. 
       Grant, if thou wilt, thou art belov'd of many, 
       But that thou none lov'st is most evident: 
       For thou art so possess'd with murderous hate, 
       That 'gainst thy self thou stick'st not to conspire, 
       Seeking that beauteous roof to ruinate 
       Which to repair should be thy chief desire. 
     
     
      
       "

If your text data is contained in multiple files in a folder, then you can import the text data into MATLAB using a file datastore.

Create a file datastore for the example sonnet text files. The examples sonnets have file names "exampleSonnetN.txt", where N is the number of the sonnet. Specify the read function to be extractFileText.

readFcn = @extractFileText;
fds = fileDatastore('exampleSonnet*.txt','ReadFcn',readFcn);

Create an empty bag-of-words model.

bag = bagOfWords
bag = 
  bagOfWords with properties:

          Counts: []
      Vocabulary: [1x0 string]
        NumWords: 0
    NumDocuments: 0

Loop over the files in the datastore and read each file. Tokenize the text in each file and add the document to bag.

while hasdata(fds)
    str = read(fds);
    document = tokenizedDocument(str);
    bag = addDocument(bag,document);
end

View the updated bag-of-words model.

bag
bag = 
  bagOfWords with properties:

          Counts: [4x276 double]
      Vocabulary: ["From"    "fairest"    "creatures"    "we"    "desire"    "increase"    ","    "That"    "thereby"    "beauty's"    "rose"    "might"    "never"    "die"    "But"    "as"    "the"    "riper"    "should"    ...    ] (1x276 string)
        NumWords: 276
    NumDocuments: 4

To extract text data directly from HTML code, use extractHTMLText and specify the HTML code as a string.

code = "<html><body><h1>THE SONNETS</h1><p>by William Shakespeare</p></body></html>";
str = extractHTMLText(code)
str = 
    "THE SONNETS
     
     by William Shakespeare"

Input Arguments

collapse all

Name of the file, specified as a string scalar, character vector, or a 1-by-1 cell array containing a character vector.

Data Types: string | char | cell

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'Pages',[1 3 5] specifies to read pages 1, 3, and 5 from a PDF file.

Character encoding to use, specified as the comma-separated pair consisting of 'Encoding' and a character vector or a string scalar. The character vector or string scalar must contain a standard character encoding scheme name such as the following.

"Big5"

"ISO-8859-1"

"windows-874"

"Big5-HKSCS"

"ISO-8859-2"

"windows-949"

"CP949"

"ISO-8859-3"

"windows-1250"

"EUC-KR"

"ISO-8859-4"

"windows-1251"

"EUC-JP"

"ISO-8859-5"

"windows-1252"

"EUC-TW"

"ISO-8859-6"

"windows-1253"

"GB18030"

"ISO-8859-7"

"windows-1254"

"GB2312"

"ISO-8859-8"

"windows-1255"

"GBK"

"ISO-8859-9"

"windows-1256"

"IBM866"

"ISO-8859-11"

"windows-1257"

"KOI8-R"

"ISO-8859-13"

"windows-1258"

"KOI8-U"

"ISO-8859-15"

"US-ASCII"

 

"Macintosh"

"UTF-8"

 

"Shift_JIS"

 

If you do not specify an encoding scheme, then the function performs heuristic auto-detection for the encoding to use. The heuristics depend on your locale. If these heuristics fail, then you must specify one explicitly.

This option only applies when the input is a plain text file.

Data Types: char | string

Extraction method, specified as the comma-separated pair consisting of 'ExtractionMethod' and one of the following:

OptionDescription
'tree'Analyze the DOM tree and text contents, then extract a block of paragraphs.
'article'Detect article text and extract a block of paragraphs.
'all-text'Extract all text in the HTML body, except for scripts and CSS styles.

This option supports HTML file input only.

Password to open the PDF file, specified as the comma-separated pair consisting of 'Password' and a character vector or a string scalar. This option only applies if the input file is a PDF.

Example: 'Password','skroWhtaM'

Data Types: char | string

Pages to read from PDF file, specified as the comma-separated pair consisting of 'Pages' and a vector of positive integers. This option only applies if the input file is a PDF file. The function, by default, reads all pages from the PDF file.

Example: 'Pages',[1 3 5]

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Tips

  • To read text directly from HTML code, use extractHTMLText.

  • To read text separated by lines in a text file, use readlines.

Version History

Introduced in R2017b

expand all