Text Analytics Toolbox

Analyze and model text data


Text Analytics Toolbox™ provides algorithms and visualizations for preprocessing, analyzing, and modeling text data. Models created with the toolbox can be used in applications such as sentiment analysis, predictive maintenance, and topic modeling.

Text Analytics Toolbox includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. You can extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models.

Using machine learning techniques such as LSA, LDA, and word embeddings, you can find clusters and create features from high-dimensional text datasets. Features created with Text Analytics Toolbox can be combined with features from other data sources to build machine learning models that take advantage of textual, numeric, and other types of data.

Get Started:

Import and Visualize Text Data

Extract text data from sources such as social media, news feeds, equipment logs, reports, and surveys.

Extract Text Data

Import text data into MATLAB® from single files or large collections of files, including PDF, HTML, and Microsoft® Word® and Excel® files.

Extracting text from a collection of Microsoft Word documents.

Visualize Text

Visually explore text datasets using word clouds and text scatter plots.

Text scatter plot showing the relative frequency of words using font size and color.

Language Support

Text Analytics Toolbox provides language specific preprocessing capabilities for English, Japanese, and German. Most functions also work with text in other languages.

Import, prepare, and analyze Japanese text.

Preprocess Text Data

Extract meaningful words from raw text.

Clean Text Data

Apply high-level filtering functions to remove extraneous content such as URLs, HTML tags, and punctuation.

Simplify raw text (left) to work with the most meaningful words (right).

Filter Stop Words and Normalize Words to Root Form

Prioritize meaningful text data in your analysis by filtering out common words, words that appear too frequently or infrequently, and very long or very short words. Reduce the vocabulary and focus on the broader sense or sentiment of a document by stemming words to their root form or lemmatizing them to their dictionary form.

Removing stop words like “a” and “of” from documents.

Identify Tokens, Sentences, and Parts-of-Speech

Automatically split raw text into a collection of words using a tokenization algorithm. Add sentence boundaries, part-of-speech details, and other relevant information for context.

Financial charts and technical indicators.

Convert Text to Numeric Formats

Convert text data to numeric form for use in machine learning and deep learning.

Word and N-Gram Counting

Calculate word frequency statistics to represent text data numerically.

Identify and visualize the most frequently occurring words in a model.

Word Embedding and Encoding

Train word-embedding models such as word2vec continuous bag-of-words (CBOW) and skip-gram models. Import pretrained models including fastText and GloVe.

Visualize clusters in a text scatter plot using word embedding. 

Machine Learning with Text Data

Perform topic modeling, classification, and dimensionality reduction with machine learning algorithms such as latent Dirichlet allocation (LDA) and latent semantic analysis (LSA).

Topic Modeling

Discover and visualize underlying patterns, trends, and complex relationships in large sets of text data.

Identifying topics in storm report data.

Deep Learning with Text Data

Perform sentiment analysis and classification with deep learning networks such as long short-term memory networks (LSTMs).

Sentiment Analysis

Identify the attitudes and opinions expressed in text data to categorize statements as being positive, neutral, or negative. Build models that can predict sentiment in real time.

Identifying words that predict positive and negative sentiment. 

Text Classification

Classify text descriptions using word embeddings that can identify categories of text through deep learning.

Training a deep neural network to classify text data.

Text generation using Jane Austen’s Pride and Prejudice and a deep learning LSTM network. 

Latest Features

Sentiment Analysis

Evaluate sentiment in text data using sentiment scoring algorithms including VADER

Korean Language Support

Perform text analytics on Korean language text including tokenization, lemmatization, part-of-speech tagging, and named entity recognition

Japanese and Korean Tokenization

Customize tokenization options including MeCab and user dictionaries

Deep Learning

Initialize word embedding layers with pretrained word embeddings

See release notes for details on any of these features and corresponding functions.

Sentiment Analysis with Deep Learning

Analyze the sentiment of live Twitter data to understand how a given term is perceived.

Have Questions?

Contact Sohini Sarkar, Text Analytics Toolbox Technical Expert

Additional Text Analytics Toolbox Resources