Skip to content

Cata1022/Data-analysis-and-text-generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data Analysis and Text Generation

About the Project

This project is meant to implement fundamental Machine Learning and Data Analysis algorithms from scratch. My goal with the project was to dive deep into the mathematics and numerical methods behind modern AI models, moving away from high-level libraries to build core algorithms directly.

The toolkit is divided into three main modules: Anomaly Detection, Kernel Regression, and Stochastic Text Generation.


1. Anomaly Detection

In any machine learning pipeline, maintaining a clean and consistent dataset is crucial. My goal with this module was to implement an automated system capable of identifying outliers (anomalies) in large datasets.

How it works:

  • Gaussian Distribution: The system calculates the mean vector and the variance matrix to estimate the probability density function for each element in the dataset.
  • Threshold Optimization: Anomaly identification requires finding an optimal probability threshold. Any data point with a probability lower than this threshold is flagged as an outlier.
  • Performance Metrics: To find the absolute best threshold, the model iteratively evaluates different threshold values by minimizing the F1 score, which perfectly balances precision (true positives over total identified positives) and recall (true positives over actual real outliers).

2. Kernel Regression

For the second module, I wanted to tackle non-linear data. While simple linear regression is useful, real-world data is rarely strictly linear. I implemented a Supervised Machine Learning algorithm known as Kernel Regression to estimate outputs for complex, multi-dimensional inputs.

Key Features:

  • Kernel Functions: To avoid computing complex transformations directly, I implemented the "kernel trick", defining three specific kernel functions that compute the dot product in a higher-dimensional space. The supported kernels are:
    • Linear Kernel
    • Polynomial Kernel
    • Gaussian / Radial-Basis Function (RBF) Kernel
  • Gram Matrix & Regularization: The relationships between data points are stored in a Gram matrix, which is positive semi-definite. I also included a regularization parameter to control model bias and prevent overfitting.
  • Conjugate Gradient Method: Inverting large matrices is computationally expensive. To optimize the system and scale it efficiently, I integrated the Conjugate Gradient Method, an iterative algorithm that solves systems of linear equations without needing to explicitly invert the large Gram matrix.

3. Stochastic Text Generation

My goal for the final module was to build a simplified generative text model using Markov Chains, exploring the foundational concepts behind Natural Language Processing tools like GPT.

Architecture:

  • Tokenization: The model processes text files (a training corpus) by extracting unique words and punctuation marks as tokens.
  • Sliding Window (N-grams): Instead of generating words based purely on the last seen word (which results in poor context), I implemented a sliding window approach to extract features and labels (the next subsequent word). This ensures the model understands word context better.
  • Numerical Encoding: Since algorithms process numbers faster than strings, I built mapping dictionaries that assign a unique integer index to every token and sequence.
  • Stochastic Matrix Generation: The core of the Markov Chain is stored in a stochastic transition matrix. For any given context (matrix row), the matrix holds the probabilities of all possible upcoming words.
  • Prediction Engine: Text is generated using a weighted random number generator based on the computed stochastic probabilities, resulting in organic, natural-sounding sentences derived directly from the training corpus.

About

A from-scratch implementation of fundamental Machine Learning algorithms (Anomaly Detection, Kernel Regression, Markov Chains) focusing on core numerical methods

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages