🤖 Sarcasm Detection Using Tokenization and Neural Networks

This project builds a sarcasm detection system using a neural network trained on real news headlines. It processes text by turning words into numbers and feeds them into a model that improves itself through learning.

📊 Dataset

Source: Kaggle - News Headlines Dataset for Sarcasm Detection
Format: JSON (each line = one JSON object)
Fields:
- headline: the text of the news headline
- is_sarcastic: label (1 for sarcastic, 0 for not sarcastic)

🧪 How It Works (Step-by-Step)

1. Load and Prepare Data

Headlines and sarcasm labels are loaded from a JSON file.
First 20,000 examples are used for training, and the rest for testing.

2. Tokenization: Text → Numbers

A Keras Tokenizer is created with a vocabulary size limit of 10,000.
Words in the training data are mapped to unique integers (tokens).
Words not in the vocabulary are replaced with <OOV> (out-of-vocabulary).

3. Padding Sequences

Headline sequences are converted to the same length (100) using post-padding and post-truncation.
This makes the data uniform in size, which is required for input to the neural network.

4. Convert to Numpy Arrays

Both padded sequences and labels are converted into NumPy arrays.
This step is necessary because TensorFlow models require numerical array input.

5. Build Neural Network Model

The model is created using tf.keras.Sequential() with the following layers:

Embedding: Converts each word index into a dense 16-dimensional vector.
GlobalAveragePooling1D: Reduces the sequence into a single feature vector.
Dense (24, relu): Learns hidden patterns.
Dense (1, sigmoid): Outputs a probability score between 0 and 1 for sarcasm.

6. Train the Model

The model is compiled using binary cross-entropy loss and the Adam optimizer.
It is trained for 30 epochs.
During training, the model evaluates itself on the test set after each epoch to improve performance.

7. Evaluate and Test

After training, the model's accuracy is measured on the testing dataset.
You can input new headlines to see the model predict sarcasm probability.

8. Save the Model and Tokenizer

The trained model is saved as sarcasm_model.keras.
The tokenizer is saved using Python’s pickle module as tokenizer.pickle.

🧠 Example Prediction

# Input:
test_sentences = [
    "granny starting to fear spiders in the garden might be real haha crazy",
    "game of thrones season finale showing this sunday night"
]

# Output:
Sentence: "granny starting to fear spiders in the garden might be real haha crazy"
Predicted Sarcasm Probability: 98.55%

Sentence: "game of thrones season finale showing this sunday night"
Predicted Sarcasm Probability: 4.87%

Running in action:

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.idea		.idea
Data		Data
README.md		README.md
model.py		model.py
sarcasm_model.keras		sarcasm_model.keras
tokenizer.pickle		tokenizer.pickle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 Sarcasm Detection Using Tokenization and Neural Networks

📊 Dataset

🧪 How It Works (Step-by-Step)

1. Load and Prepare Data

2. Tokenization: Text → Numbers

3. Padding Sequences

4. Convert to Numpy Arrays

5. Build Neural Network Model

6. Train the Model

7. Evaluate and Test

8. Save the Model and Tokenizer

🧠 Example Prediction

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🤖 Sarcasm Detection Using Tokenization and Neural Networks

📊 Dataset

🧪 How It Works (Step-by-Step)

1. Load and Prepare Data

2. Tokenization: Text → Numbers

3. Padding Sequences

4. Convert to Numpy Arrays

5. Build Neural Network Model

6. Train the Model

7. Evaluate and Test

8. Save the Model and Tokenizer

🧠 Example Prediction

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages