Abstract
Standard Transformer architectures and Euclidean Neural ODEs assume a flat, uniform latent space where state transitions occur along straight lines. However, human language is highly structured, non-linear, and geometrically complex. In this work, we introduce a continuous-time language model that replaces the standard Euclidean latent space with a learned, position-dependent Riemannian manifold. We define the ODE flow as
Through dual regularization—penalizing vector field magnitude and preserving local metric volume—we force the optimizer to actively terraform the latent space. We document the training dynamics of this system on the TinyStories dataset, detailing an initial chaotic "terraforming phase" restricted by geometric capacity, followed by a successful equilibrium using a 64-dimensional metric rank. Finally, we map the trajectory endpoints of the converged manifold, revealing that the Riemannian metric successfully carves out distinct, physically separated topographical basins for specific semantic and syntactic clusters (e.g., distinct valleys for physical nouns and active verbs).
In standard deep learning approaches to natural language processing, the latent space is implicitly treated as a flat, Euclidean vector space (
While computationally convenient, a flat geometry fundamentally ignores the complex, non-linear reality of human language. Words and grammatical structures do not exist at uniform distances from one another. A robust language model should theoretically traverse a complex topology—speeding up through generic transitions and settling deeply into specific, high-friction basins representing concrete semantic concepts.
To force the network to respect the geometry of language, we discard the Euclidean assumption. Instead, we introduce a position-dependent Riemannian metric
We define the metric as a low-rank perturbation of the identity:
Where
Our continuous-time language model operates in three distinct phases:
-
Encoding: A standard Transformer encoder maps an input sequence of tokens into a starting latent state,
$z_0$ . -
Integration: The state evolves through a continuous ODE governed by our metric and a learned vector field
$f_\theta(z)$ :
Crucially,
By detaching the ODE from a strict target-driven internal gradient and instead learning a pure vector field, we allow the optimizer to focus entirely on sculpting the underlying geometry of the manifold.
A neural network naturally defaults to "lazy" representations. If we simply inject a Riemannian metric
To force the model to actively terraform the latent space, we introduced two competing regularizers. This creates a geometric "tug-of-war" that sculpts the manifold.
The core of our approach is treating the pure vector field
By penalizing the network for generating large vectors, we force the ODE to seek out computational shortcuts. Because the state transition is governed by
If constrained only by the fuel cost, the network would simply turn the entire latent space into one infinite valley, collapsing the metric into a globally tiny matrix. To prevent this, we introduce a strict volume-preservation penalty:
This penalty demands that the average volume of the latent space remains constant (
Implementing this continuous Riemannian flow requires significant computational overhead. Computing the gradients and determinants of the metric requires
During development on Apple Silicon (M4), we encountered a severe hardware bottleneck: the native Metal Performance Shaders (MPS) backend struggled catastrophically with the kernel launch overheads for thousands of batched solves. Attempts to keep the math strictly on the GPU resulted in massive step latency (over 540 seconds per ODE step).
To bypass this limitation, we implemented a brute-force CPU fallback. By explicitly transferring the metric matrices off the GPU to execute standard torch.linalg.solve and slogdet operations on the CPU, we bypassed the Metal driver bottleneck entirely. Surprisingly, this CPU offloading resulted in a 5x speedup over the native GPU implementation, reducing step times to
To observe the network's geometric learning process, we captured telemetry at step 3,000 of training. In this initial phase, the model's geometric capacity was artificially restricted by setting the metric rank to
The results were mathematically violent. Desperate to satisfy the volume-preservation penalty (
This topography forced the ODE to abandon Euclidean logic entirely. Telemetry from the 3,000-step checkpoint revealed:
-
The 50% Detour Tax: The average straight-line displacement for trajectories was
$\mu = 53.45$ , but the actual path length traveled by the ODE was$\mu = 83.76$ . The vector field actively chose to travel 50% further along the trench floor rather than pay the "fuel cost" of crossing the high-friction plateau. -
Plummeting Linearity: The linearity metric—defined as the straight-line displacement divided by the actual path length—dropped precipitously to
$0.646$ . -
Active Instability: The terrain was not yet balanced. The volume penalty (
$\log \det$ ) was still climbing from$-800$ , hovering around$-297$ . The network was actively fighting itself, occasionally causing massive spikes in the vector field penalty as trajectories smashed into newly formed topographical walls.
This phase proved that the continuous flow could be forced to curve, but the restricted rank of
To provide the network with the necessary canvas to resolve the volume penalty, we restarted the flow and expanded the metric rank to
With nearly triple the dimensional capacity, the optimizer no longer needed to build brutal cliffs. Instead, it smoothed the landscape into a massive, sweeping saddle shape. Visualizations of the latent space at this stage reveal a perfectly functioning non-Euclidean manifold:
-
Absolute Volume Equilibrium: By step 12,300, the
$\log \det g(z)$ penalty completely stabilized, oscillating perfectly around$0.003$ . The model had successfully balanced the high-friction hills with the low-friction valleys, satisfying the constraint. -
Deep Attractor Basins: The vector field decay (
fDecay) dropped to a highly consistent$0.02$ . By the end of the ODE integration steps, the flow had virtually stopped, dropping to 2% of its initial energy. The model was no longer throwing vectors aimlessly; it had dug highly specific, deep attractor basins that reliably caught and held the trajectories. -
Smooth Geodesic Flow: With the topography balanced, the chaotic 50% detours vanished. Linearity stabilized at
$\sim 0.80$ . The straight-line displacement was$\mu = 57.84$ , while the path length was$\mu = 73.17$ . The trajectories were taking calculated, efficient$\sim 25%$ detours, hooking gracefully down the topographical gradients and sliding along established valley floors to reach their semantic destinations.
To prove that this mathematically stable geometry actively correlates with the rules of English grammar, we mapped the semantic meaning of the latent space.
In a standard Transformer, the latent space is static. In our model, however, the ODE physically transports the vectors from a starting point (
By sampling 200 trajectories and plotting the specific vocabulary tokens decoded at their exact endpoints, we generated a high-resolution semantic map of the Riemannian manifold. The results (see semantic_map_hr.jpg) demonstrate that the metric did not just group words randomly; it terraformed specific topographical basins for distinct parts of speech:
- The Noun Valley: The map reveals a distinct, low-elevation basin in the bottom right quadrant. Trajectories plunging into this specific valley terminate in concrete physical nouns, clustering words like
"town","tree","garden","boat","car", and"friends". The manifold physically pulls semantic objects into a shared geographical sink. - The Action Peak: Conversely, the trajectories sweeping upwards to the top right of the map carve out a completely separate topographical zone dedicated entirely to active verbs. Here, the flow terminates in tokens like
"smiled","realized","decided","started", and"played". - The Ejection Zone: Perhaps the most remarkable mathematical artifact is the handling of sequence-terminal states. On the far left of the map, a completely isolated cluster of trajectories shoots away from the main topography, terminating exclusively in
<|endoftext|>tokens. The ODE learned to physically eject terminal states completely out of the active semantic manifold, treating the end of a narrative as a mathematical singularity.
By replacing a standard Euclidean latent space with a learned, position-dependent Riemannian metric, we successfully forced a language model to perform differential geometry.
Through the competing constraints of vector field penalization (fuel cost) and metric volume preservation, the model terraformed a flat 384-dimensional space into a complex topology of high-friction hills and low-friction valleys. Despite the aggressive spatial warping, the model perfectly maintained causal narrative coherence, reliably generating stable TinyStories grammar.
More importantly, the network proved that English syntax has a learnable physical geometry. By routing trajectories through calculated detours to minimize computational "friction," the ODE naturally partitioned the latent space into distinct topographical zones for nouns, verbs, and terminal states.
Future Work: While this experiment successfully terraformed a 64-rank metric on a ~30M parameter model, future iterations will explore scaling this geometry to larger parameter counts. Additionally, exploring complex geometric constraints—such as enforcing Kähler metrics to introduce a structured symplectic form—could yield even tighter semantic clustering and more efficient continuous flows.