This repository contains the pre-trained Assamese GloVe embedding models generated using a large Assamese corpus. The repository includes three different files with varying vector sizes:
This work was inspired by the GloVe project, an unsupervised learning algorithm for obtaining vector representations for words, developed by the Stanford Natural Language Processing Group. I would like to acknowledge the GloVe project for their significant contributions to the field of Natural Language Processing.
To use the pre-trained Assamese GloVe embedding models in your project, you can easily import them using the gensim library in Python.
you will first need to convert it to Word2Vec format. Here are the steps to do so:
- Install the
gensimlibrary using pip:pip install gensim
- Import the
glove2word2vecfunction from thegensim.scriptsmodule:from gensim.scripts.glove2word2vec import glove2word2vec
- Define the input and output file paths:
glove_input_file = 'GloVe.400k.300d.txt' word2vec_output_file = 'word2vec.400k.300d.txt'
- Use the
glove2word2vecfunction to convert the GloVe model to Word2Vec format:glove2word2vec(glove_input_file, word2vec_output_file)
Once you have converted the GloVe model to Word2Vec format, you can load it into Gensim using the Word2Vec.load_word2vec_format method.
- Load the Assamese GloVe embedding model using the
KeyedVectors.load_word2vec_formatmethod:from gensim.models import KeyedVectors model = KeyedVectors.load_word2vec_format('GloVe.400k.100d', binary=False)
Now you can use the loaded model to perform various NLP tasks, such as word similarity, analogy, and more.
Converting a GloVe model to a word2vec model using the gensim.scripts.glove2word2vec function does not change the embedding values themselves. It only changes the format of the file, from the GloVe format to the word2vec format, so that it can be loaded with the KeyedVectors.load_word2vec_format() method.
To illustrate how to use the model, here are two examples:
- Compute cosine similarity between some Assamese words
word1 = 'মেকুৰী'
word2 = 'কুকুৰ'
cos_sim = model.similarity(word1, word2)
print(f"Cosine similarity between '{word1}' and '{word2}': {cos_sim:.4f}")Output:
Cosine similarity between 'মেকুৰী' and 'কুকুৰ': 0.5876- Find 10 most similar words to a given word
# Find 10 most similar words to a given word
word = 'আহাৰ'
topn = 10
similar_words = model.similar_by_word(word, topn=topn)
print(f"{topn} most similar words to '{word}':")
for i, (w, sim) in enumerate(similar_words):
print(f"{i+1}. {w} ({sim:.4f})")Output:
10 most similar words to 'আহাৰ':
1. দুপৰীয়াৰ (0.6657)
2. আহাৰৰ (0.6602)
3. জুমিয়ে (0.6398)
4. সুষম (0.6215)
5. আহাৰো (0.5948)
6. খাদ্য (0.5947)
7. শাওণমহীয়া (0.5946)
8. নিৰামিষ (0.5889)
9. তিনিসাজ (0.5861)
10. ভিটামিনযুক্ত (0.5848)When utilizing the Assamese GloVe file from this repository, please ensure that you properly cite and acknowledge the work of the creators by utilizing the appropriate link. The pre-trained embedding model has the ability to be used for various natural language processing tasks in Assamese, and its use is highly encouraged. Feedback is also welcomed to promote research and development of NLP tools for under-resourced languages.
This work is licensed under the MIT License. Please refer to the LICENSE file for more information.
If you have any questions or suggestions, feel free to create an issue or contact me directly.
Thank you!