Extract Text Data
Import text data into MATLAB® from single files or large collections of files, including PDF, HTML, and Microsoft® Word® and Excel® files.
Visually explore text datasets using word clouds and text scatter plots.
Text Analytics Toolbox provides language specific preprocessing capabilities for English, Japanese, German, and Korean. Most functions also work with text in other languages.
Clean Text Data
Apply high-level filtering functions to remove extraneous content such as URLs, HTML tags, and punctuations, and correct spellings.
Filter Stop Words and Normalize Words to Root Form
Prioritize meaningful text data in your analysis by filtering out common words, words that appear too frequently or infrequently, and very long or very short words. Reduce the vocabulary and focus on the broader sense or sentiment of a document by stemming words to their root form or lemmatizing them to their dictionary form.
Extract Linguistic Features
Automatically split raw text into a collection of words using a tokenization algorithm. Add sentence boundaries, part-of-speech details, and other relevant information for context.
Word Embedding and Encoding
Train word-embedding models such as word2vec continuous bag-of-words (CBOW) and skip-gram models. Import pretrained models including fastText and GloVe.
Discover and visualize underlying patterns, trends, and complex relationships in large sets of text data using machine learning algorithms such as latent Dirichlet allocation (LDA) and latent semantic analysis (LSA).
Document Summarization and Keyword Extraction
Extract summary and relevant keywords from one or more documents automatically and evaluate similarity and importance of documents.
Identify the attitudes and opinions expressed in text data to categorize statements as being positive, neutral, or negative. Build models that can predict sentiment in real time.
Leverage transformer models such as BERT, FinBERT, and GPT-2 to perform transfer learning with text data for tasks such as sentiment analysis, classification, and summarization.
Classify text descriptions using word embeddings that can identify categories of text through deep learning.
Use deep learning to generate new text based on observed text.