Streaming Weak-SINDy: Sparse Identification of Non-Linear Dynamics for High-Dimensional Data Compression
This repository implements a proprietary framework for the Sparse Identification of Non-linear Dynamics (SINDy) using a Weak Formulation to achieve radical dimensionality reduction in streaming data environments.
⸻
Weak_sindy_compression/
├── src/
│ └── Reduction_with_POD/
│ └── Sample_Data.py # Core logic for loading data and performing POD
├── sample_data.csv # Input dataset (n variables × m time steps)
└── README.md # Project documentation
In frontier AI systems, the primary bottleneck is often the data movement between high-bandwidth memory (HBM) and the compute cores. Traditional lossy compression (quantization) sacrifices numerical stability and "physical" fidelity.
Instead of treating data as a collection of bits, this project treats data as the output of a dynamic physical system. By applying the Weak Form of SINDy, we recover the underlying governing equations
- Weak Formulation Integration: Utilizing the integral form of the SINDy equation to eliminate the need for numerical differentiation of noisy data.
- Sparse Regression via STLSQ: Implementing Sequentially Thresholded Least Squares to identify the parsimonious model that represents the "Physics" of the data stream.
- Streaming Optimization: Designed for low-latency execution, allowing for real-time dimensionality reduction of model activations or KV-cache states.
The framework implements a two-stage pipeline to compress high-dimensional data by identifying its latent physical manifolds.
To handle high-dimensional AI state data, we first project the raw data
By retaining the
Once in the reduced space, we identify the governing equations for the coefficients
Where:
-
$g(t)$ is a compactly supported test function (e.g., a bell-shaped polynomial). -
$\Theta(\mathbf{a}(t))$ is a library of candidate nonlinearities (monomials, interaction terms). -
$\Xi$ is the sparse matrix of coefficients that represents the "Physics" of the stream.
We solve for
This combined POD-SINDy approach allows us to represent millions of parameters as a small system of differential equations, providing a path toward zero-latency KV-cache reconstruction.
Prerequisites • Python 3.8+ • NumPy
Run the Analysis
python3 src/Reduction_with_POD/Sample_Data.py
Ensure your sample_data.csv is formatted with rows as variables and columns as time steps.
⸻
• Integrate symbolic test functions \psi(t)
• Construct feature library \Theta(\mathbf{a})
• Implement streaming regression update logic
• Reconstruct original system state from compressed form
⸻
• Russo et al., Streaming Compression of Scientific Data via Weak-SINDy, arXiv:2308.14962
⸻
Catherine Earl
MIT-style license © 2026