Skip to content

TheFausap/RIEMANN

Repository files navigation

Sculpting the Latent Space: Riemannian Neural ODEs for Language Modeling

Abstract Standard Transformer architectures and Euclidean Neural ODEs assume a flat, uniform latent space where state transitions occur along straight lines. However, human language is highly structured, non-linear, and geometrically complex. In this work, we introduce a continuous-time language model that replaces the standard Euclidean latent space with a learned, position-dependent Riemannian manifold. We define the ODE flow as $dz/dt = g^{-1}(z) f_\theta(z)$, where $g(z)$ is a low-rank perturbed metric tensor and $f_\theta(z)$ is a pure vector field.

Through dual regularization—penalizing vector field magnitude and preserving local metric volume—we force the optimizer to actively terraform the latent space. We document the training dynamics of this system on the TinyStories dataset, detailing an initial chaotic "terraforming phase" restricted by geometric capacity, followed by a successful equilibrium using a 64-dimensional metric rank. Finally, we map the trajectory endpoints of the converged manifold, revealing that the Riemannian metric successfully carves out distinct, physically separated topographical basins for specific semantic and syntactic clusters (e.g., distinct valleys for physical nouns and active verbs).

1. Introduction: The Euclidean Problem

In standard deep learning approaches to natural language processing, the latent space is implicitly treated as a flat, Euclidean vector space ($\mathbb{R}^d$). When a neural network processes a sequence of tokens, or when a standard Neural ODE integrates a continuous latent path, the flow between semantic states naturally takes the "lazy" path of least resistance: a straight line.

While computationally convenient, a flat geometry fundamentally ignores the complex, non-linear reality of human language. Words and grammatical structures do not exist at uniform distances from one another. A robust language model should theoretically traverse a complex topology—speeding up through generic transitions and settling deeply into specific, high-friction basins representing concrete semantic concepts.

The Proposition: A Riemannian Latent Space

To force the network to respect the geometry of language, we discard the Euclidean assumption. Instead, we introduce a position-dependent Riemannian metric $g(z)$ that actively warps the latent space, creating a landscape of "friction hills" and "acceleration valleys."

We define the metric as a low-rank perturbation of the identity:

$$g(z) = \epsilon I + \alpha U(z)^T U(z)$$

Where $\epsilon$ provides a base stiffness allowing for flow acceleration, and $U(z)$ is parameterized by a small multi-layer perceptron mapping $\mathbb{R}^d \to \mathbb{R}^{r \times d}$. By applying the Woodbury matrix identity, we can efficiently compute the inverse metric $g^{-1}(z)$ without explicitly forming or inverting a massive $d \times d$ matrix.

The Architecture

Our continuous-time language model operates in three distinct phases:

  1. Encoding: A standard Transformer encoder maps an input sequence of tokens into a starting latent state, $z_0$.
  2. Integration: The state evolves through a continuous ODE governed by our metric and a learned vector field $f_\theta(z)$:

$$\frac{dz}{dt} = g^{-1}(z) f_\theta(z)$$

Crucially, $f_\theta(z)$ is initialized to zero at its final layer, ensuring the ODE begins as an identity map and must actively learn to flow. 3. Decoding: The final integrated state, $z_T$, is passed through an output projection layer to generate the vocabulary logits for cross-entropy loss.

By detaching the ODE from a strict target-driven internal gradient and instead learning a pure vector field, we allow the optimizer to focus entirely on sculpting the underlying geometry of the manifold.

2. Methodology: Forcing the Curve

A neural network naturally defaults to "lazy" representations. If we simply inject a Riemannian metric $g(z)$ into the ODE without constraints, the optimizer will likely collapse $g(z)$ to the identity matrix, effectively reverting to a Euclidean space to avoid complex spatial gradients.

To force the model to actively terraform the latent space, we introduced two competing regularizers. This creates a geometric "tug-of-war" that sculpts the manifold.

The Fuel Cost: Penalizing the Vector Field

The core of our approach is treating the pure vector field $f_\theta(z)$ as "fuel." We apply an L2 penalty to the magnitude of the vector field throughout the integration path:

$$\mathcal{L}_{field} = \lambda_{field} \sum_{t} ||f_\theta(z_t)||^2$$

By penalizing the network for generating large vectors, we force the ODE to seek out computational shortcuts. Because the state transition is governed by $dz/dt = g^{-1}(z) f_\theta(z)$, the network realizes it can achieve a large displacement $dz$ with a very small fuel expenditure $f_\theta(z)$ if it travels through regions where $g^{-1}(z)$ has large eigenvalues. In physical terms, the network is incentivized to dig low-friction "valleys" in the latent space where the flow is naturally accelerated.

The Volume Constraint: Preventing Manifold Collapse

If constrained only by the fuel cost, the network would simply turn the entire latent space into one infinite valley, collapsing the metric into a globally tiny matrix. To prevent this, we introduce a strict volume-preservation penalty:

$$\mathcal{L}_{metric} = \lambda_{metric} (\log \det g(z))^2$$

This penalty demands that the average volume of the latent space remains constant ($\log \det \approx 0$). This is the catalyst for the topography. For every deep, low-friction valley ($\log \det < 0$) the network digs to save fuel, it is mathematically forced to construct a corresponding massive, high-friction hill ($\log \det > 0$) to balance the equation. The trajectories must then learn to weave gracefully through the valleys to avoid the devastating fuel costs of crossing the newly formed hills.

The Engineering Reality: Navigating Hardware Bottlenecks

Implementing this continuous Riemannian flow requires significant computational overhead. Computing the gradients and determinants of the metric requires $O(N^3)$ operations. For a sequence length of 192 and a batch size of 32, the ODE solver must evaluate and invert $6,144$ distinct metric matrices per forward step.

During development on Apple Silicon (M4), we encountered a severe hardware bottleneck: the native Metal Performance Shaders (MPS) backend struggled catastrophically with the kernel launch overheads for thousands of batched solves. Attempts to keep the math strictly on the GPU resulted in massive step latency (over 540 seconds per ODE step).

To bypass this limitation, we implemented a brute-force CPU fallback. By explicitly transferring the metric matrices off the GPU to execute standard torch.linalg.solve and slogdet operations on the CPU, we bypassed the Metal driver bottleneck entirely. Surprisingly, this CPU offloading resulted in a 5x speedup over the native GPU implementation, reducing step times to $\sim110$ seconds. This allowed us to successfully scale the metric rank to 64 dimensions and allowed the complex topography to converge overnight.

3. The Terraforming Phase: Building the Friction Cliff

To observe the network's geometric learning process, we captured telemetry at step 3,000 of training. In this initial phase, the model's geometric capacity was artificially restricted by setting the metric rank to $24$ (for a latent dimension of $d=384$).

The results were mathematically violent. Desperate to satisfy the volume-preservation penalty ($\log \det \approx 0$), but lacking the high-dimensional freedom to distribute the curvature, the optimizer was forced into an extreme topographical solution. It constructed a single, massive "friction cliff"—a dense plateau of high geometric penalty that abruptly dropped into a deep, low-friction trench.

This topography forced the ODE to abandon Euclidean logic entirely. Telemetry from the 3,000-step checkpoint revealed:

  • The 50% Detour Tax: The average straight-line displacement for trajectories was $\mu = 53.45$, but the actual path length traveled by the ODE was $\mu = 83.76$. The vector field actively chose to travel 50% further along the trench floor rather than pay the "fuel cost" of crossing the high-friction plateau.
  • Plummeting Linearity: The linearity metric—defined as the straight-line displacement divided by the actual path length—dropped precipitously to $0.646$.
  • Active Instability: The terrain was not yet balanced. The volume penalty ($\log \det$) was still climbing from $-800$, hovering around $-297$. The network was actively fighting itself, occasionally causing massive spikes in the vector field penalty as trajectories smashed into newly formed topographical walls.

This phase proved that the continuous flow could be forced to curve, but the restricted rank of $24$ lacked the dimensional freedom necessary for an elegant solution.

4. High-Dimensional Equilibrium: The Saddle and the Flow

To provide the network with the necessary canvas to resolve the volume penalty, we restarted the flow and expanded the metric rank to $64$. By step 12,000, the latent space achieved a state of absolute geometric equilibrium.

With nearly triple the dimensional capacity, the optimizer no longer needed to build brutal cliffs. Instead, it smoothed the landscape into a massive, sweeping saddle shape. Visualizations of the latent space at this stage reveal a perfectly functioning non-Euclidean manifold:

  • Absolute Volume Equilibrium: By step 12,300, the $\log \det g(z)$ penalty completely stabilized, oscillating perfectly around $0.003$. The model had successfully balanced the high-friction hills with the low-friction valleys, satisfying the constraint.
  • Deep Attractor Basins: The vector field decay (fDecay) dropped to a highly consistent $0.02$. By the end of the ODE integration steps, the flow had virtually stopped, dropping to 2% of its initial energy. The model was no longer throwing vectors aimlessly; it had dug highly specific, deep attractor basins that reliably caught and held the trajectories.
  • Smooth Geodesic Flow: With the topography balanced, the chaotic 50% detours vanished. Linearity stabilized at $\sim 0.80$. The straight-line displacement was $\mu = 57.84$, while the path length was $\mu = 73.17$. The trajectories were taking calculated, efficient $\sim 25%$ detours, hooking gracefully down the topographical gradients and sliding along established valley floors to reach their semantic destinations.

5. Semantic Cartography: Mapping the Manifold

To prove that this mathematically stable geometry actively correlates with the rules of English grammar, we mapped the semantic meaning of the latent space.

In a standard Transformer, the latent space is static. In our model, however, the ODE physically transports the vectors from a starting point ($z_0$) to an ending point ($z_T$). Crucially, our output projection is only trained to read the destinations ($z_T$).

By sampling 200 trajectories and plotting the specific vocabulary tokens decoded at their exact endpoints, we generated a high-resolution semantic map of the Riemannian manifold. The results (see semantic_map_hr.jpg) demonstrate that the metric did not just group words randomly; it terraformed specific topographical basins for distinct parts of speech:

  • The Noun Valley: The map reveals a distinct, low-elevation basin in the bottom right quadrant. Trajectories plunging into this specific valley terminate in concrete physical nouns, clustering words like "town", "tree", "garden", "boat", "car", and "friends". The manifold physically pulls semantic objects into a shared geographical sink.
  • The Action Peak: Conversely, the trajectories sweeping upwards to the top right of the map carve out a completely separate topographical zone dedicated entirely to active verbs. Here, the flow terminates in tokens like "smiled", "realized", "decided", "started", and "played".
  • The Ejection Zone: Perhaps the most remarkable mathematical artifact is the handling of sequence-terminal states. On the far left of the map, a completely isolated cluster of trajectories shoots away from the main topography, terminating exclusively in <|endoftext|> tokens. The ODE learned to physically eject terminal states completely out of the active semantic manifold, treating the end of a narrative as a mathematical singularity.

6. Conclusion and Future Work

By replacing a standard Euclidean latent space with a learned, position-dependent Riemannian metric, we successfully forced a language model to perform differential geometry.

Through the competing constraints of vector field penalization (fuel cost) and metric volume preservation, the model terraformed a flat 384-dimensional space into a complex topology of high-friction hills and low-friction valleys. Despite the aggressive spatial warping, the model perfectly maintained causal narrative coherence, reliably generating stable TinyStories grammar.

More importantly, the network proved that English syntax has a learnable physical geometry. By routing trajectories through calculated detours to minimize computational "friction," the ODE naturally partitioned the latent space into distinct topographical zones for nouns, verbs, and terminal states.

Future Work: While this experiment successfully terraformed a 64-rank metric on a ~30M parameter model, future iterations will explore scaling this geometry to larger parameter counts. Additionally, exploring complex geometric constraints—such as enforcing Kähler metrics to introduce a structured symplectic form—could yield even tighter semantic clustering and more efficient continuous flows.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors