CNN for text classification
    4 views (last 30 days)
  
       Show older comments
    
Hi, I am working with textual data and the main objective is to classify text strings into different classes. I have referred to the example provided in the link  Classify Text Data Using Convolutional Neural Network - MATLAB & Simulink - MathWorks Italia.
I have a question regarding the role of the wordembedding layer when the input is already represented as numbers due to the encoding process. Since the input is numerical, I am uncertain about how the wordembedding layer contributes to the classification task. Could you help me to understand this? 
Thank you in advance for your reply !
0 Comments
Accepted Answer
  Pratyush
      
 on 7 Jul 2023
        
      Edited: Pratyush
      
 on 7 Jul 2023
  
      Hi Christian. The reason for applying a word embedding layer to the texts is to provide a semantic meaning to those texts. You are right that the workds are already represented as numbers due to encoding. But those encodings are simply tokens and do not have any semantic sense. 
For example we can represent the words "apple" and "fruit" using numbers. But there is no way to represent the similarity between the two words just through the encoded numbers. Word embedding provides a vector for each word such that if two words are semantically similar, the distance between the vectors is less and vice-versa. 
So after the word embedding layer, the vectors assigned to words "fruit" and "apple" will be such that the distance between the vectors will be small. On the other hand if two words are semantically not very close, like "fruit" and "car" will be assigned vectors such that the distance between them will be large. 
Therefore applying the word embedding layer makes the training easier for the network. Hope this resolves your doubt. 
You may refer the below resources to know more about the Word Embedding Layer.
3 Comments
  Pratyush
      
 on 7 Jul 2023
				Yes you are right. The embedding layer will provide a semantically meaningful vector for each word even if they are in encoded format. Encoding alone does not provide meaningful semantic representations.
The vocabulary is nothing but the unique words or tokens in the training dataset. Let's say the training dataset has 1023 unique words. Now the word embedding layer will take the number of unique words as a parameter (which is 1023), and for each word index from 1 to 1023 it will try to learn a semantically meaningful vector. Here is an example from the documentation you had provided earlier.
wordEmbeddingLayer(embeddingDimension,numWords,Name="emb")]
The second parameter numWords was computed earlier as the number of unique words or tokens in the training dataset as shown below. 
enc = wordEncoding(documentsTrain); % Encoding of words
numWords = enc.NumWords; % Total number of unique words
So the required information of the vocabulary is provided to the network as the numWords parameter.
More Answers (0)
See Also
Categories
				Find more on Deep Learning Toolbox in Help Center and File Exchange
			
	Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!
