Skip to content

Luc0-0/learning-nlp-from-scratch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NLP Learning Material - Beginner-Friendly Course Notes

🎯 Overview

This repository contains comprehensive, beginner-friendly Jupyter notebooks covering Natural Language Processing (NLP) fundamentals. Each notebook was originally taken during a Udemy course and has been upgraded with:

Detailed explanations - Not just what, but why
Inline comments - Every non-trivial code line explained
Real-world context - When and why you'd use each technique
Common mistakes - What beginners get wrong and how to fix it
Trade-off analysis - Speed vs. accuracy, simplicity vs. power
Practice exercises - Hands-on learning with solutions


📚 Notebook Descriptions

✅ COMPLETED (7 notebooks)

1. Tokenization - Breaking Text into Words

  • What tokenization is and why it's essential
  • Sentence vs. word tokenization
  • How NLTK handles edge cases (contractions, punctuation)
  • When simple .split() fails

Learn: The foundational first step of all NLP pipelines

from nltk.tokenize import word_tokenize, sent_tokenize
tokens = word_tokenize("Emma's cat is named Luna.")
# Result: ['Emma', "'s", 'cat', 'is', 'named', 'Luna', '.']

2. Lowercasing - Text Normalization Basics

  • Why lowercasing reduces vocabulary
  • When NOT to lowercase (NER, sentiment analysis, acronyms)
  • Batch processing with list comprehensions
  • Trade-offs between different normalization approaches

Learn: Text preparation fundamentals and when rules apply

sentence = "Her Cat's Name is Luna"
normalized = sentence.lower()  # "her cat's name is luna"

3. Stemming - Fast Word Reduction

  • Porter Stemmer algorithm and how it works
  • Over-stemming and under-stemming problems
  • Why it produces non-words
  • When to use stemming vs. lemmatization

Learn: Speed-focused normalization for bag-of-words models

from nltk.stem import PorterStemmer
ps = PorterStemmer()
ps.stem("connecting")  # Returns "connect"
ps.stem("ponies")      # Returns "poni" (not a real word!)

4. Lemmatization - Smart Word Reduction

  • Dictionary-based approach using WordNet
  • Always produces real English words
  • Why it's more accurate but slower than stemming
  • POS-tag assisted lemmatization

Learn: Accurate normalization for production systems

from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
lem.lemmatize("better", pos="a")  # Returns "good" (semantically correct!)

5. N-Grams - Capturing Context

  • Unigrams, bigrams, trigrams explained
  • Why context matters ("dog bites man" ≠ "man bites dog")
  • Frequency analysis and visualization
  • Data sparsity problem and solutions
  • Real applications: auto-complete, spell checking, plagiarism detection

Learn: How to model sequential word patterns

import nltk
ngrams = nltk.ngrams(tokens, 2)  # Bigrams: word pairs
# Results show "natural language" appears 5 times, "language processing" 3 times

6. Parts of Speech (POS) Tagging - Grammar Analysis

  • All POS tag types (NOUN, VERB, ADJ, ADV, PROPN, etc.)
  • spaCy vs. NLTK comparison
  • Using POS for sentiment analysis and NER
  • 97%+ accuracy with pre-trained models

Learn: Grammatical role identification for downstream tasks

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Emma loves reading novels")
for token in doc:
    print(f"{token.text}{token.pos_}")  # Emma→PROPN, loves→VERB, etc.

7. Named Entity Recognition (NER) - Entity Extraction

  • Entity types: PERSON, ORG, DATE, NORP, GPE, PERCENT, etc.
  • How neural NER models work
  • Visualizing entities with displacy
  • Common errors and limitations
  • Real applications: knowledge graphs, question answering, information extraction

Learn: Automatic identification and classification of proper nouns

from spacy import displacy
doc = nlp("Google was founded by Larry Page in 1998")
for ent in doc.ents:
    print(f"{ent.text}{ent.label_}")
# Google → ORG, Larry Page → PERSON, 1998 → DATE

⏳ IN PROGRESS (4 notebooks)

These notebooks need anchor sections, detailed comments, and practice exercises:

8. NER Practical - Real-world entity extraction

9. Sentiment Analysis - Opinion mining and polarity detection

10. Practical Sentiment Analysis - Real applications

11. Practical NLP - Complete end-to-end pipeline


🚀 Quick Start

Prerequisites

# Python 3.8+ with NLP libraries installed
pip install nltk spacy pandas matplotlib jupyter

# Download spaCy model
python -m spacy download en_core_web_sm

# Download NLTK data
python -c "import nltk; nltk.download('punkt_tab'); nltk.download('wordnet')"

Running Notebooks

cd Notes
jupyter notebook

# Open any .ipynb file and run cells
# Each notebook is self-contained and executable

📊 Learning Path

Beginner (Start here)

  1. Tokenization - Learn how to split text
  2. Lowercasing - First normalization step
  3. Stemming - Quick word reduction
  4. Lemmatization - Accurate word reduction

Intermediate

  1. N-Grams - Model word sequences
  2. Parts of Speech - Understand grammar roles
  3. Named Entity Recognition - Extract entities

Advanced (Coming soon)

  1. Sentiment Analysis - Classify opinions
  2. Complete NLP Pipeline - Combine all techniques
  3. Custom NER - Train for your domain

💡 Key Concepts Covered

Concept Notebook What You Learn
Text Normalization 1-4 Preparing text for analysis
Feature Engineering 5-6 Creating ML-ready features
Entity Extraction 7-8 Finding structured data in text
Sentiment Analysis 9-10 Opinion mining and classification
End-to-End Pipeline 11 Combining all techniques

🎓 Learning Standards

Every notebook includes:

Anchor Section (at top)

  • What you'll learn
  • Why it matters
  • Real-world applications

Theory First

  • Clear concept explanations
  • Visual examples
  • Comparisons and trade-offs

Code & Comments

  • Every non-trivial line explained
  • Variable names clarified
  • Results interpreted

Common Mistakes

  • What beginners do wrong
  • Why it matters
  • How to fix it

Real Applications

  • When you'd use this technique
  • Production considerations
  • Limitations to know

Practice Exercises

  • 2-3 hands-on exercises per notebook
  • Build on concepts
  • With solution hints

📈 Statistics

Total Notebooks: 11
Completed: 7 (64%)
In Progress: 4 (36%)

Total Markdown Content: 2,436+ lines
Code Comments: 150+
Practice Exercises: 15+
Reference Tables: 25+

🔗 Resources & Libraries

Library Purpose Installation
NLTK Classic NLP toolkit pip install nltk
spaCy Modern production NLP pip install spacy
Pandas Data organization pip install pandas
Matplotlib Visualization pip install matplotlib

❓ FAQ

Q: Should I use NLTK or spaCy?

A:

  • NLTK: Educational value, fine-grained control, older approach
  • spaCy: Production use, speed, accuracy, modern architecture
  • Learn both: NLTK teaches fundamentals, spaCy shows how professionals do it

Q: Do I need machine learning background?

A: No! These notebooks assume only basic Python. ML concepts are explained as needed.

Q: Can I use these notebooks for a course?

A: Yes! Feel free to share, remix, or adapt for teaching. Just mention the source.

Q: What's the difference between stemming and lemmatization?

A:

  • Stemming: Fast, rule-based, may produce non-words ("poni")
  • Lemmatization: Slow, dictionary-based, always real words ("pony")

Use stemming for speed (search engines), lemmatization for accuracy (NER, sentiment).


📝 Notebook Quality Checklist

Each notebook meets these standards:

  • Anchor section with objectives and "Why this matters"
  • Theory explained before code
  • Every non-trivial line has inline comments
  • Variable names are clear and descriptive
  • Common beginner mistakes highlighted with ❌ and ✅
  • Trade-offs section explaining when to use/not use
  • Real-world applications described
  • Key takeaways summary
  • 2-3 practice exercises with hints
  • Code is executable and produces expected output

🛠️ How These Notebooks Were Created

  1. Started with: Raw Udemy instructor notes (minimal explanation)
  2. Added: Anchor sections explaining why each topic matters
  3. Enhanced: Inline comments on every non-trivial line
  4. Explained: Trade-offs, limitations, and best practices
  5. Organized: Into learning progression (basic → advanced)
  6. Tested: Ensured all code runs and produces expected results
  7. Documented: Common mistakes and solutions
  8. Committed: To git with semantic commit messages

🎯 Next Steps

  1. Choose your starting point (usually Tokenization)
  2. Run cells step-by-step (don't skip - understanding matters)
  3. Modify code and experiment (best way to learn)
  4. Complete practice exercises (hands-on reinforcement)
  5. Build something (apply to your own text data)

📞 Contributing

Found an error? Want to improve explanations? Have a suggestion?

  1. Fork the repository
  2. Create a branch (git checkout -b feature/improvement)
  3. Make your changes
  4. Commit with clear message
  5. Push and create a Pull Request

📜 License

These notebooks are provided as educational material. Feel free to use, modify, and share while crediting the original work.


🙏 Acknowledgments

  • Original Udemy instructor: For the foundational course material
  • spaCy & NLTK teams: For excellent NLP libraries
  • Jupyter: For interactive learning environment

📈 Project Progress

[████████████████████░░░░░░░░░░░░░░░░░░░░░░] 64% Complete

✅ Core NLP Concepts (7/11 notebooks)
⏳ Advanced Applications (4/11 notebooks coming soon)

Last updated: December 18, 2025
Status: On track for completion


Happy Learning! 🚀

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors