Analyze Text Data Using Multiword Phrases

This example shows how to analyze text using n-gram frequency counts.

An n-gram is a tuple of $n$ consecutive words. For example, a bigram (the case when $n = 2$ ) is a pair of consecutive words such as "heavy rainfall". A unigram (the case when $n = 1$ ) is a single word. A bag-of-n-grams model records the number of times that different n-grams appear in document collections.

Using a bag-of-n-grams model, you can retain more information on word ordering in the original text data. For example, a bag-of-n-grams model is better suited for capturing short phrases which appear in the text, such as "heavy rainfall" and "thunderstorm winds".

To create a bag-of-n-grams model, use bagOfNgrams. You can input bagOfNgrams objects into other Text Analytics Toolbox functions such as wordcloud and fitlda.

Load and Extract Text Data

Load the example data. The file factoryReports.csv contains factory reports, including a text description and categorical labels for each event. Remove the rows with empty reports.

filename = "factoryReports.csv";
data = readtable(filename,TextType="string");

Extract the text data from the table and view the first few reports.

textData = data.Description;
textData(1:5)

ans = 5×1 string array
    "Items are occasionally getting stuck in the scanner spools."
    "Loud rattling and banging sounds are coming from assembler pistons."
    "There are cuts to the power when starting the plant."
    "Fried capacitors in the assembler."
    "Mixer tripped the fuses."

Prepare Text Data for Analysis

Create a function which tokenizes and preprocesses the text data so it can be used for analysis. The function preprocessText listed at the end of the example, performs the following steps:

Convert the text data to lowercase using lower.
Tokenize the text using tokenizedDocument.
Erase punctuation using erasePunctuation.
Remove a list of stop words (such as "and", "of", and "the") using removeStopWords.
Remove words with 2 or fewer characters using removeShortWords.
Remove words with 15 or more characters using removeLongWords.
Lemmatize the words using normalizeWords.

Use the example preprocessing function preprocessText to prepare the text data.

documents = preprocessText(textData);
documents(1:5)

ans = 
  5×1 tokenizedDocument:

    6 tokens: item occasionally get stuck scanner spool
    7 tokens: loud rattling bang sound come assembler piston
    4 tokens: cut power start plant
    3 tokens: fry capacitor assembler
    3 tokens: mixer trip fuse

Create Word Cloud of Bigrams

Create a word cloud of bigrams by first creating a bag-of-n-grams model using bagOfNgrams, and then inputting the model to wordcloud.

To count the n-grams of length 2 (bigrams), use bagOfNgrams with the default options.

bag = bagOfNgrams(documents)

bag = 
  bagOfNgrams with properties:

          Counts: [480×921 double]
      Vocabulary: ["item"    "occasionally"    "get"    "stuck"    "scanner"    "loud"    "rattling"    "bang"    "sound"    "come"    "assembler"    "cut"    "power"    "start"    "fry"    "capacitor"    "mixer"    "trip"    "burst"    "pipe"    …    ]
          Ngrams: [921×2 string]
    NgramLengths: 2
       NumNgrams: 921
    NumDocuments: 480

Visualize the bag-of-n-grams model using a word cloud.

figure
wordcloud(bag);
title("Text Data: Preprocessed Bigrams")

Fit Topic Model to Bag-of-N-Grams

A Latent Dirichlet Allocation (LDA) model is a topic model which discovers underlying topics in a collection of documents and infers the word probabilities in topics.

Create an LDA topic model with 10 topics using fitlda. The function fits an LDA model by treating the n-grams as single words.

mdl = fitlda(bag,10,Verbose=0);

Visualize the first four topics as word clouds.

figure
tiledlayout("flow");
for i = 1:4
    nexttile
    wordcloud(mdl,i);
    title("LDA Topic " + i)
end

The word clouds highlight commonly co-occurring bigrams in the LDA topics. The function plots the bigrams with sizes according to their probabilities for the specified LDA topics.

Analyze Text Using Longer Phrases

To analyze text using longer phrases, specify the NGramLengths option in bagOfNgrams to be a larger value.

When working with longer phrases, it can be useful to keep stop words in the model. For example, to detect the phrase "is not happy", keep the stop words "is" and "not" in the model.

Preprocess the text. Erase the punctuation using erasePunctuation, and tokenize using tokenizedDocument.

cleanTextData = erasePunctuation(textData);
documents = tokenizedDocument(cleanTextData);

To count the n-grams of length 3 (trigrams), use bagOfNgrams and specify NGramLengths to be 3.

bag = bagOfNgrams(documents,NGramLengths=3);

Visualize the bag-of-n-grams model using a word cloud. The word cloud of trigrams better shows the context of the individual words.

figure
wordcloud(bag);
title("Text Data: Trigrams")

View the top 10 trigrams and their frequency counts using topkngrams.

tbl = topkngrams(bag,10)

tbl=10×3 table
    "in","the","mixer"    14    3
    "in","the","scanner"    13    3
    "blown","in","the"    9    3
    "the","robot","arm"    7    3
    "stuck","in","the"    6    3
    "is","spraying","coolant"    6    3
    "from","time","to"    6    3
    "time","to","time"    6    3
    "heard","in","the"    6    3
    "on","the","floor"    6    3

Example Preprocessing Function

The function preprocessText performs the following steps in order:

Convert the text data to lowercase using lower.
Tokenize the text using tokenizedDocument.
Erase punctuation using erasePunctuation.
Remove a list of stop words (such as "and", "of", and "the") using removeStopWords.
Remove words with 2 or fewer characters using removeShortWords.
Remove words with 15 or more characters using removeLongWords.
Lemmatize the words using normalizeWords.

function documents = preprocessText(textData)

% Convert the text data to lowercase.
cleanTextData = lower(textData);

% Tokenize the text.
documents = tokenizedDocument(cleanTextData);

% Erase punctuation.
documents = erasePunctuation(documents);

% Remove a list of stop words.
documents = removeStopWords(documents);

% Remove words with 2 or fewer characters, and words with 15 or greater
% characters.
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);

% Lemmatize the words.
documents = addPartOfSpeechDetails(documents);
documents = normalizeWords(documents,Style="lemma");

end