tokenDetails
Details of tokens in tokenized document array
Description
Examples
View Token Details of Documents
Create a tokenized document array.
str = [ ... "This is an example document. It has two sentences." "This document has one sentence and an emoticon. :)" "Here is another example document. :D"]; documents = tokenizedDocument(str);
View the token details of the first few tokens.
tdetails = tokenDetails(documents); head(tdetails)
ans=8×5 table
Token DocumentNumber LineNumber Type Language
__________ ______________ __________ ___________ ________
"This" 1 1 letters en
"is" 1 1 letters en
"an" 1 1 letters en
"example" 1 1 letters en
"document" 1 1 letters en
"." 1 1 punctuation en
"It" 1 1 letters en
"has" 1 1 letters en
The type
variable contains the type of each token. View the emoticons in the documents.
idx = tdetails.Type == "emoticon";
tdetails(idx,:)
ans=2×5 table
Token DocumentNumber LineNumber Type Language
_____ ______________ __________ ________ ________
":)" 2 1 emoticon en
":D" 3 1 emoticon en
Add Sentence Details to Documents
Create a tokenized document array.
str = [ ... "This is an example document. It has two sentences." "This document has one sentence." "Here is another example document. It also has two sentences."]; documents = tokenizedDocument(str);
Add sentence details to the documents using addSentenceDetails
. This function adds the sentence numbers to the table returned by tokenDetails
. View the updated token details of the first few tokens.
documents = addSentenceDetails(documents); tdetails = tokenDetails(documents); head(tdetails)
ans=8×6 table
Token DocumentNumber SentenceNumber LineNumber Type Language
__________ ______________ ______________ __________ ___________ ________
"This" 1 1 1 letters en
"is" 1 1 1 letters en
"an" 1 1 1 letters en
"example" 1 1 1 letters en
"document" 1 1 1 letters en
"." 1 1 1 punctuation en
"It" 1 2 1 letters en
"has" 1 2 1 letters en
View the token details of the second sentence of the third document.
idx = tdetails.DocumentNumber == 3 & ...
tdetails.SentenceNumber == 2;
tdetails(idx,:)
ans=6×6 table
Token DocumentNumber SentenceNumber LineNumber Type Language
___________ ______________ ______________ __________ ___________ ________
"It" 3 2 1 letters en
"also" 3 2 1 letters en
"has" 3 2 1 letters en
"two" 3 2 1 letters en
"sentences" 3 2 1 letters en
"." 3 2 1 punctuation en
Add Part-of-Speech Details to Documents
Load the example data. The file sonnetsPreprocessed.txt
contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt
, split the text into documents at newline characters, and then tokenize the documents.
filename = "sonnetsPreprocessed.txt";
str = extractFileText(filename);
textData = split(str,newline);
documents = tokenizedDocument(textData);
View the token details of the first few tokens.
tdetails = tokenDetails(documents); head(tdetails)
ans=8×5 table
Token DocumentNumber LineNumber Type Language
___________ ______________ __________ _______ ________
"fairest" 1 1 letters en
"creatures" 1 1 letters en
"desire" 1 1 letters en
"increase" 1 1 letters en
"thereby" 1 1 letters en
"beautys" 1 1 letters en
"rose" 1 1 letters en
"might" 1 1 letters en
Add part-of-speech details to the documents using the addPartOfSpeechDetails
function. This function first adds sentence information to the documents, and then adds the part-of-speech tags to the table returned by tokenDetails
. View the updated token details of the first few tokens.
documents = addPartOfSpeechDetails(documents); tdetails = tokenDetails(documents); head(tdetails)
ans=8×7 table
Token DocumentNumber SentenceNumber LineNumber Type Language PartOfSpeech
___________ ______________ ______________ __________ _______ ________ ______________
"fairest" 1 1 1 letters en adjective
"creatures" 1 1 1 letters en noun
"desire" 1 1 1 letters en noun
"increase" 1 1 1 letters en noun
"thereby" 1 1 1 letters en adverb
"beautys" 1 1 1 letters en noun
"rose" 1 1 1 letters en noun
"might" 1 1 1 letters en auxiliary-verb
Input Arguments
documents
— Input documents
tokenizedDocument
array
Input documents, specified as a tokenizedDocument
array.
Output Arguments
tdetails
— Table of token details
table
Table of token details. tdetails
has the following
variables:
Name | Description |
---|---|
Token | Token text, returned as a string scalar. |
DocumentNumber | Index of document that the token belongs to, returned as a positive integer. |
SentenceNumber | Sentence number of token in document, returned as a
positive integer. If these details are missing, then
first add sentence details to
documents using the addSentenceDetails function. |
LineNumber | Line number of token in document, returned as a positive integer. |
Type | The type of token, returned as one of the following:
If these details are
missing, then first add type details to
|
Language | Language of the token, returned as one of the following:
These language details determine the behavior of the If these details are missing, then first
add language details to
For more information about language support in Text Analytics Toolbox™, see Language Considerations. |
PartOfSpeech | Part of speech tag, specified as a categorical from one of the following class names:
If these details are missing, then
first add part-of-speech details to
|
Entity | Entity tag, specified as one of the following:
If these details are
missing, then first add entity details to
|
Lemma | Lemma form. If these details are missing, then
first lemma details to
|
Version History
Introduced in R2018aR2018b: tokenDetails
returns token type emoji
for emoji characters
Behavior changed in R2018b
Starting in R2018b, tokenizedDocument
detects emoji characters and the tokenDetails
function reports these tokens with type
"emoji"
. This makes it easier to analyze text containing emoji
characters.
In R2018a, tokenDetails
reports emoji characters with type "other"
.
To find the indices of the tokens with type "emoji"
or
"other"
, use the indices idx = tdetails.Type == "emoji" |
tdetails.Type == "other"
, where tdetails
is a table of
token details.
Open Example
You have a modified version of this example. Do you want to open this example with your edits?
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)