Training a custom Word Embedding
Term “Word Embedding” sounds too fancy to hear right, but it is a simple concept with really wonder applications. Before getting into what is Word Embedding we should know why do we need this Word Embedding.
Humans are pretty good in understanding the textual sentences and the context of the sentences than computers. But where as computers are much faster than humans in numerical computations and working on data represented in form of matrices. So we need a way to map the textual data to numerical form so computers can easily work on it. So then the concept of Word Vectorisation comes into picture. Word Vectorisation is a simple method to map word to vectors.
One of the naive of doing this is by simply mapping collection of sorted word into a fixed number. This seems quite easy, then why do we need word embedding. If we carefully look into this method, this does not give much information about text data, it is just an alternate form to represent the text, so a better approach is to use Word Embedding.
Word Embedding is mainly classified into two categories:
- Frequency based Embedding
- Prediction based Embedding
Frequency based Embedding: Here the text is mapped into vector based on the number of times that word appears in that text. For example if we consider a text “Man made computer. Computer helps man” , using frequency based embedding this can be represented as
{ “computer”:2, “helps”:1, “made”:1, “man”:2 } => [2,1,1,2]
But the problem with this kind of representation is it does not keep track of context of the sentence. So in order to preserve the context of the sentences methods like COBC , Skip-grams etc were introduced.
Prediction based Embedding: Here machine learning algorithms are used to map the word to vector form. By preserving the context of the data that is been giving to the algorithm. Let us consider two sentences “John love playing with his dog” and “John plays a lot with is cat”. In these two sentences both words Cat and Dog are related to each other as John loves to play with his pet. But in computers context word Cat and Dog are far from each other when we try to measure the distance between them when non predictive based embedding is used. If we simply consider the first approach of mapping word to number in an order, words like Cat and Car are more closer than Cat and Dog. Because the mapping is based on lexicographical order. But it is totally wrong and our computer will never learn in a right way if we represent using this method. So in Prediction based Embedding machine learning model tries to assign closer vectors for word Cat and Dog than for word Cat and Car.
Before making it too boring let me dive into implementation part …. , but if you are interested to know more about prediction based word embedding you can refer this link .
So i will be training Word2Vec Embedding using Indian Food 101 data set. The data can downloaded from this link.
And we will be using Python and Gensim library for trainign Word2Vec model.
1.First lets install all libraries required:
pip install pandas
pip install gensim
pip install nltk
Here pandas is read our data in csv file and nltk is for some text processing.
2. Lets import the libraries
import pandas as pd
from nltk import word_tokenize
from string import punctuation
from time import time
from gensim.models import Word2Vec
3. Read the “Indian Food 101” data which we have already downloaded
df_food=pd.read_csv('indian_food.csv')
df_food.head()
4. Now its time for some pre-processing…..
train_data=[]
for i in range(len(df_food)):
row=df_food.iloc[i]
statement=" ".join([row['course'],row['state'],
row['name'],row['ingredients']])
words=word_tokenize(statement)
words_final=[]
for word in words:
if word not in punctuation
and not word.isdigit()
and not word[1:].isdigit(): words_final.append(word.lower())
train_data.append(words_final)
Here we are we are combining data in name, course, state and ingredients column of our dataframe into a single sentence. As they are related to one another. So as a result we get a sentences which represent each row of our dataframe. Then word_tokenize() method break down this sentences into list of words. Then we are removing the non alphabetic data and punctuations.
5. Finally its time for training our Word2Vec model
import logging # Setting up the loggings to monitor gensimlogging.basicConfig(format="%(levelname)s - %(asctime)s: %. (message)s", datefmt= '%H:%M:%S', level=logging.INFO)#initialize our model
w2v_model = Word2Vec(min_count=2,window=3)
build the vocabulary with our data set
t = time()w2v_model.build_vocab(train_data, progress_per=10000)print('Time to build vocab: {} mins'
.format(round((time() - t) / 60, 2)))
start training ……
t = time()w2v_model.train(train_data,
total_examples=w2v_model.corpus_count,
epochs=100, report_delay=1)print('Time to train the model: {} mins'
.format(round((time() - t) / 60, 2)))
Now lets check how our Word2Vec model is trained
w2v_model.wv.most_similar("rajasthan",topn=10)
Now lets play some games
Pick the odd one out
We can clearly see that our Word2Vec is trained enough to identify the word that is out of the context.
Now it’s time for some math
Kachori-Gujarath+Kerala=?
Khaman- Gujarath+Punjab=?
lets see what our model predicts ……
Thanks for reading….
Manthan M Kulakarni