This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

addDocument

Add documents to bag-of-words or bag-of-n-grams model

Syntax

newBag = addDocument(bag,documents)

Description

example

newBag = addDocument(bag,documents) adds documents to the bag-of-words or bag-of-n-grams model bag.

Examples

collapse all

Create a bag-of-words model from an array of tokenized documents.

documents = tokenizedDocument([
    "an example of a short sentence"
    "a second short sentence"]);
bag = bagOfWords(documents)
bag = 
  bagOfWords with properties:

          Counts: [2x7 double]
      Vocabulary: [1x7 string]
        NumWords: 7
    NumDocuments: 2

Create another array of tokenized documents and add it to the same bag-of-words model.

documents = tokenizedDocument([ 
    "a third example of a short sentence" 
    "another short sentence"]);
newBag = addDocument(bag,documents)
newBag = 
  bagOfWords with properties:

          Counts: [4x9 double]
      Vocabulary: [1x9 string]
        NumWords: 9
    NumDocuments: 4

If your text data is contained in multiple files in a folder, then you can import the text data into MATLAB using a file datastore.

Create a file datastore for the example sonnet text files. The examples sonnets have file names "exampleSonnetN.txt", where N is the number of the sonnet. Specify the read function to be extractFileText.

readFcn = @extractFileText;
fds = fileDatastore('exampleSonnet*.txt','ReadFcn',readFcn)
fds = 
  FileDatastore with properties:

                       Files: {
                              ' .../tp395bd4b6/textanalytics-ex73762432/exampleSonnet1.txt';
                              ' .../tp395bd4b6/textanalytics-ex73762432/exampleSonnet2.txt';
                              ' .../tp395bd4b6/textanalytics-ex73762432/exampleSonnet3.txt'
                               ... and 1 more
                              }
                 UniformRead: 0
                    ReadMode: 'file'
                   BlockSize: Inf
                  PreviewFcn: @extractFileText
                     ReadFcn: @extractFileText
    AlternateFileSystemRoots: {}

Create an empty bag-of-words model.

bag = bagOfWords
bag = 
  bagOfWords with properties:

          Counts: []
      Vocabulary: [1x0 string]
        NumWords: 0
    NumDocuments: 0

Loop over the files in the datastore and read each file. Tokenize the text in each file and add the document to bag.

while hasdata(fds)
    str = read(fds);
    document = tokenizedDocument(str);
    bag = addDocument(bag,document);
end

View the updated bag-of-words model.

bag
bag = 
  bagOfWords with properties:

          Counts: [4x276 double]
      Vocabulary: [1x276 string]
        NumWords: 276
    NumDocuments: 4

Input Arguments

collapse all

Input bag-of-words or bag-of-n-grams model, specified as a bagOfWords object or a bagOfNgrams object.

Input documents, specified as a tokenizedDocument array, a string array of words, or a cell array of character vectors. If documents is a string array or a cell array of character vectors, then it must be a row vector representing a single document, where each element is a word.

Output Arguments

collapse all

Output model, returned as a bagOfWords object or a bagOfNgrams object. The type of newBag is the same as the type of bag.

Introduced in R2017b