Skip to content

ManasBarman229/Assamese-GloVe-Embedding-Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

Pre-trained Assamese GloVe Embedding Model

This repository contains the pre-trained Assamese GloVe embedding models generated using a large Assamese corpus. The repository includes three different files with varying vector sizes:

Acknowledgment

This work was inspired by the GloVe project, an unsupervised learning algorithm for obtaining vector representations for words, developed by the Stanford Natural Language Processing Group. I would like to acknowledge the GloVe project for their significant contributions to the field of Natural Language Processing.

How to Use

To use the pre-trained Assamese GloVe embedding models in your project, you can easily import them using the gensim library in Python.

you will first need to convert it to Word2Vec format. Here are the steps to do so:

  1. Install the gensim library using pip:
    pip install gensim
  2. Import the glove2word2vec function from the gensim.scripts module:
    from gensim.scripts.glove2word2vec import glove2word2vec
  3. Define the input and output file paths:
    glove_input_file = 'GloVe.400k.300d.txt'
    word2vec_output_file = 'word2vec.400k.300d.txt'
  4. Use the glove2word2vec function to convert the GloVe model to Word2Vec format:
    glove2word2vec(glove_input_file, word2vec_output_file)

Once you have converted the GloVe model to Word2Vec format, you can load it into Gensim using the Word2Vec.load_word2vec_format method.

  1. Load the Assamese GloVe embedding model using the KeyedVectors.load_word2vec_format method:
    from gensim.models import KeyedVectors
    
    model = KeyedVectors.load_word2vec_format('GloVe.400k.100d', binary=False)

Now you can use the loaded model to perform various NLP tasks, such as word similarity, analogy, and more.

Converting a GloVe model to a word2vec model using the gensim.scripts.glove2word2vec function does not change the embedding values themselves. It only changes the format of the file, from the GloVe format to the word2vec format, so that it can be loaded with the KeyedVectors.load_word2vec_format() method.

Example Usage

To illustrate how to use the model, here are two examples:

  1. Compute cosine similarity between some Assamese words
word1 = 'মেকুৰী'
word2 = 'কুকুৰ'
cos_sim = model.similarity(word1, word2)
print(f"Cosine similarity between '{word1}' and '{word2}': {cos_sim:.4f}")

Output:

Cosine similarity between 'মেকুৰী' and 'কুকুৰ': 0.5876
  1. Find 10 most similar words to a given word
# Find 10 most similar words to a given word
word = 'আহাৰ'
topn = 10
similar_words = model.similar_by_word(word, topn=topn)
print(f"{topn} most similar words to '{word}':")
for i, (w, sim) in enumerate(similar_words):
    print(f"{i+1}. {w} ({sim:.4f})")

Output:

10 most similar words to 'আহাৰ':
1. দুপৰীয়াৰ (0.6657)
2. আহাৰৰ (0.6602)
3. জুমিয়ে (0.6398)
4. সুষম (0.6215)
5. আহাৰো (0.5948)
6. খাদ্য (0.5947)
7. শাওণমহীয়া (0.5946)
8. নিৰামিষ (0.5889)
9. তিনিসাজ (0.5861)
10. ভিটামিনযুক্ত (0.5848)

When utilizing the Assamese GloVe file from this repository, please ensure that you properly cite and acknowledge the work of the creators by utilizing the appropriate link. The pre-trained embedding model has the ability to be used for various natural language processing tasks in Assamese, and its use is highly encouraged. Feedback is also welcomed to promote research and development of NLP tools for under-resourced languages.

License

This work is licensed under the MIT License. Please refer to the LICENSE file for more information.

If you have any questions or suggestions, feel free to create an issue or contact me directly.

Thank you!

About

Assamese GloVe Embedding Model: Pre-trained models generated from a large corpus for NLP tasks. Includes 3 files with varying vector sizes.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors