textrankKeywords

Extract keywords using TextRank

Syntax

tbl = textrankKeywords(documents)

tbl = textrankKeywords(documents,Name,Value)

Description

tbl = textrankKeywords(documents) extracts keywords and respective scores using TextRank. The function supports English, Japanese, German, and Korean text. For other languages, try using the rakeKeywords function instead.

example

tbl = textrankKeywords(documents,Name,Value) specifies additional options using one or more name-value pair arguments.

example

Examples

collapse all

Extract Keywords Using TextRank

Open Live Script

Create an array of tokenized document containing the text data.

textData = [
    "MATLAB provides really useful tools for engineers. Scientists use many useful tools in MATLAB."
    "MATLAB and Simulink have many features. Use MATLAB and Simulink for engineering workflows."
    "Analyze text and images in MATLAB. Analyze text, images, and videos in MATLAB."];
documents = tokenizedDocument(textData);

Extract the keywords using the textrankKeywords function.

tbl = textrankKeywords(documents)

tbl=7×3 table
                 Keyword                 DocumentNumber    Score 
    _________________________________    ______________    ______

    "many"      "useful"      "tools"          1           5.2174
    "useful"    "tools"       ""               1           3.8778
    "many"      "features"    ""               2           4.0815
    "text"      ""            ""               3                1
    "images"    ""            ""               3                1
    "MATLAB"    ""            ""               3                1
    "videos"    ""            ""               3                1

If a keyword contains multiple words, then the ith element of the string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string "".

For readability, transform the multi-word keywords into a single string using the join and strip functions.

if size(tbl.Keyword,2) > 1
    tbl.Keyword = strip(join(tbl.Keyword));
end
tbl

tbl=7×3 table
          Keyword          DocumentNumber    Score 
    ___________________    ______________    ______

    "many useful tools"          1           5.2174
    "useful tools"               1           3.8778
    "many features"              2           4.0815
    "text"                       3                1
    "images"                     3                1
    "MATLAB"                     3                1
    "videos"                     3                1

Specify Maximum Number of Keywords Per Document

Open Live Script

Create an array of tokenized documents containing the text data.

textData = [
    "MATLAB provides really useful tools for engineers. Scientists use many useful MATLAB toolboxes."
    "MATLAB and Simulink have many features. Use MATLAB and Simulink for engineering workflows."
    "Analyze text and images in MATLAB. Analyze text, images, and videos in MATLAB."];
documents = tokenizedDocument(textData);

Extract the top two keywords using the textrankKeywords function and setting the 'MaxNumKeywords' option to 2.

tbl = textrankKeywords(documents,'MaxNumKeywords',2)

tbl=5×3 table
    "useful","MATLAB","toolboxes"    1    4.8695
    "useful","",""    1    2.3612
    "many","features",""    2    4.0815
    "text","",""    3    1
    "images","",""    3    1

For readability, transform the multi-word keywords into a single string using the join and strip functions.

if size(tbl.Keyword,2) > 1
    tbl.Keyword = strip(join(tbl.Keyword));
end
tbl

tbl=5×3 table
    "useful MATLAB toolboxes"    1    4.8695
                     "useful"    1    2.3612
              "many features"    2    4.0815
                       "text"    3         1
                     "images"    3         1

Input Arguments

collapse all

`documents` — Input documents
`tokenizedDocument` array | string array | cell array of character vectors

Input documents, specified as a tokenizedDocument array, a string array of words, or a cell array of character vectors. If documents is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To specify multiple documents, use a tokenizedDocument array.

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: textrankKeywords(documents,'MaxNumKeywords',20) returns at most 20 keywords per document.

`MaxNumKeywords` — Maximum number of keywords to return per document
`Inf` (default) | positive integer

Maximum number of keywords to return per document, specified as a positive integer or Inf.

If MaxNumKeywords is Inf, then the function returns all identified keywords.

`Window` — Size of co-occurrence window
2 (default) | positive integer | `Inf`

Size of co-occurrence window, specified as a the comma-separated pair consisting of 'Window' and a positive integer or Inf.

When the window size is 2, the function considers a co-occurrence between two candidate keywords only when they appear consecutively in a document. When the window size is Inf, then the function considers a co-occurrence between two candidate keywords when they both appear in the same document.

Increasing the window size enables the function to find more co-occurrences between keywords which increases the keyword importance scores. This can result in finding more relevant keywords at the cost of potentially over-scoring less relevant keywords.

For more information, see TextRank Keyword Extraction.

`PartOfSpeech` — Part-of-speech tags
`["noun" "proper-noun" "adjective"]` (default) | string array | cell array of character vectors | character vector | categorical array

Part-of-speech tags to use to extract candidate keywords, specified as the comma-separated pair consisting of 'PartOfSpeech' and a string array, cell array of character vectors, or a categorical array containing one or more of the following class names:

adjective — Adjective
adposition — Adposition
adverb — Adverb
auxiliary-verb — Auxiliary verb
coord-conjunction — Coordinating conjunction
determiner — Determiner
interjection — Interjection
noun — Noun
numeral — Numeral
particle — Particle
pronoun — Pronoun
proper-noun — Proper noun
punctuation — Punctuation
subord-conjunction — Subordinating conjunction
symbol — Symbol
verb — Verb
other — Other

If PartOfSpeech is a character vector, then it must correspond to a single part-of-speech tag.

For more information, see TextRank Keyword Extraction.

Data Types: char | string | cell | categorical

Output Arguments

collapse all

`tbl` — Extracted keywords and scores
table

Extracted keywords and scores, returned as a table with the following variables:

Keyword – Extracted keyword, specified as a 1-by-maxNgramLength string array, where maxNgramLength is the number of words in the longest keyword.
DocumentNumber – Document number containing the corresponding keyword.
Score – Score of keyword.

The function merges multiple keywords into a single keyword when they appear consecutively in the corresponding document.

If a keyword contains multiple words, then the ith element of the corresponding string array corresponds to the ith word of the keyword. If the keyword has fewer words that the longest keyword, then remaining entries of the string array are the empty string "".

For more information, see TextRank Keyword Extraction.

More About

collapse all

Language Considerations

The textrankKeywords function supports English, Japanese, German, and Korean text only.

The textrankKeywords function extracts keywords by identifying candidate keywords based on their part-of-speech tag. The function uses part-of-speech tags given by the addPartOfSpeechDetails function which supports English, Japanese, German, and Korean text only.

For other languages, try using the rakeKeywords instead and specify an appropriate set of delimiters using the 'Delimiters' and 'MergingDelimiters' options.

Tips

You can experiment with different keyword extraction algorithms to see what works best with your data. Because the TextRank keywords algorithm uses a part-of-speech tag-based approach to extract candidate keywords, the extracted keywords can be short. Alternatively, you can try extracting keywords using RAKE algorithm which extracts sequences of tokens appearing between delimiters as candidate keywords. To extract keywords using RAKE, use the rakeKeywords function. To learn more, see Extract Keywords from Text Data Using RAKE.

Algorithms

collapse all

TextRank Keyword Extraction

For each document, the textrankKeywords function extracts keywords independently using the following steps based on [1]:

Determine candidate keywords:
- Extract tokens with part-of-speech specified by the 'PartOfSpeech' option.
Calculate scores for each candidate:
- Create an undirected, unweighted graph with nodes corresponding to the candidate keywords.
- Add edges between nodes where candidate keywords appear within a window of tokens, where the window size is given by the 'Window' option.
- Compute the centrality of each node using the PageRank algorithm and weight the scores according to the number of candidate keywords. For more information, see centrality.
Extract top keywords from candidates:
- Select the top third of the candidate keywords according to their scores.
- If any of the candidate keywords appear consecutively in a document, then merge them into a single keyword and sum the corresponding scores.
- Return the top k keywords, where k is given by the 'MaxNumKeywords' option.

Language Details

tokenizedDocument objects contain details about the tokens including language details. The language details of the input documents determine the behavior of textrankKeywords. The tokenizedDocument function, by default, automatically detects the language of the input text. To specify the language details manually, use the Language option of tokenizedDocument. To view the token details, use the tokenDetails function.

References

[1] Mihalcea, Rada, and Paul Tarau. "TextRank: Bringing Order into Text." In Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 404-411. 2004.

Version History

Introduced in R2020b

textrankKeywords

Syntax

Description

Examples

Extract Keywords Using TextRank

Specify Maximum Number of Keywords Per Document

Input Arguments

`documents` — Input documents
`tokenizedDocument` array | string array | cell array of character vectors

Name-Value Arguments

`MaxNumKeywords` — Maximum number of keywords to return per document
`Inf` (default) | positive integer

`Window` — Size of co-occurrence window
2 (default) | positive integer | `Inf`

`PartOfSpeech` — Part-of-speech tags
`["noun" "proper-noun" "adjective"]` (default) | string array | cell array of character vectors | character vector | categorical array

Output Arguments

`tbl` — Extracted keywords and scores
table

More About

Language Considerations

Tips

Algorithms

TextRank Keyword Extraction

Language Details

References

Version History

See Also

Topics

textrankKeywords

Syntax

Description

Examples

Extract Keywords Using TextRank

Specify Maximum Number of Keywords Per Document

Input Arguments

documents — Input documents tokenizedDocument array | string array | cell array of character vectors

Name-Value Arguments

MaxNumKeywords — Maximum number of keywords to return per document Inf (default) | positive integer

Window — Size of co-occurrence window 2 (default) | positive integer | Inf

PartOfSpeech — Part-of-speech tags ["noun" "proper-noun" "adjective"] (default) | string array | cell array of character vectors | character vector | categorical array

Output Arguments

tbl — Extracted keywords and scores table

More About

Language Considerations

Tips

Algorithms

TextRank Keyword Extraction

Language Details

References

Version History

See Also

Topics

`documents` — Input documents
`tokenizedDocument` array | string array | cell array of character vectors

`MaxNumKeywords` — Maximum number of keywords to return per document
`Inf` (default) | positive integer

`Window` — Size of co-occurrence window
2 (default) | positive integer | `Inf`

`PartOfSpeech` — Part-of-speech tags
`["noun" "proper-noun" "adjective"]` (default) | string array | cell array of character vectors | character vector | categorical array

`tbl` — Extracted keywords and scores
table