Title: Refactor: Delegate GRPO and Evasion capabilities to OSS Alternatives (OpenRLHF, browserforge) #187
Body:
Background
In alignment with our "Borrow, Do Not Build" engineering doctrine, we have identified two major sub-systems in the CoReason platform that feature custom proprietary math and logic which can be natively handled by maintained OSS libraries:
- GRPO & PRMs: We currently track RLHF/GRPO advantage scores and Process Reward Model evaluations manually via
EpistemicRewardGradientPolicy, CognitiveRewardEvaluationReceipt, and ProcessRewardContract. Calculating PPO/GRPO policy gradients and managing KL-divergence penalties internally across distributed GPUs is unstable at scale.
- Browser Evasion: We built
AdversarialEmulationProfile, KinematicNoiseProfile, and EnvironmentalSpoofingProfile for 1/f pink noise mouse movements, JA3 TLS fingerprint spoofing, and WebGL canvas hashing. The cat-and-mouse game of evading CDNs requires daily updates, which makes an internal implementation fragile.
Proposed Solution
Rip out the internal logic and delegate to OSS primitives:
- OpenRLHF / HuggingFace TRL: We will use
coreason-manifest to label data (creating EpistemicGroundedTaskManifest), but offload actual backpropagation and step-level PRM verification to OpenRLHF.
- browserforge / curl-impersonate: Delete custom TLS and Canvas spoofing math and delegate browser instantiation to maintained OSS libraries. This keeps CoReason logic strictly focused on deterministic navigation/clicking, not environment spoofing.
Tasks
Title: Refactor: Delegate GRPO and Evasion capabilities to OSS Alternatives (OpenRLHF, browserforge) #187
Body:
Background
In alignment with our "Borrow, Do Not Build" engineering doctrine, we have identified two major sub-systems in the CoReason platform that feature custom proprietary math and logic which can be natively handled by maintained OSS libraries:
EpistemicRewardGradientPolicy,CognitiveRewardEvaluationReceipt, andProcessRewardContract. Calculating PPO/GRPO policy gradients and managing KL-divergence penalties internally across distributed GPUs is unstable at scale.AdversarialEmulationProfile,KinematicNoiseProfile, andEnvironmentalSpoofingProfilefor 1/f pink noise mouse movements, JA3 TLS fingerprint spoofing, and WebGL canvas hashing. The cat-and-mouse game of evading CDNs requires daily updates, which makes an internal implementation fragile.Proposed Solution
Rip out the internal logic and delegate to OSS primitives:
coreason-manifestto label data (creatingEpistemicGroundedTaskManifest), but offload actual backpropagation and step-level PRM verification to OpenRLHF.Tasks
EpistemicRewardGradientPolicy,CognitiveRewardEvaluationReceipt, andProcessRewardContractfromcoreason-manifest.AdversarialEmulationProfile,KinematicNoiseProfile, andEnvironmentalSpoofingProfilefromcoreason-manifest.universal_ontology_compiler.pyto regenerate JSON and Language bindings.coreason-runtime.