This file gathers formal demonstrations under explicit hypotheses. The proofs below do not claim to prove the theory in full generality; they prove precise results inside minimal models that capture the central mechanism of the concept.
Consider two stationary policies in the same game:
pi_exploit: always choosesexploit;pi_preserve: always choosespreserve.
Assume:
r_exploit > r_preserve;o_t in [0,1];o_{t+1}^{exploit} = kappa_E o_t, with0 <= kappa_E < 1;o_{t+1}^{preserve} = 1 - kappa_P (1 - o_t), with0 <= kappa_P < 1;U_t = r_t + lambda o_t - eta c_t;c_0 = chi >= 0underpreserveandc_t = 0otherwise;0 < delta < 1.
Define:
Gamma = 1/(1-delta) - (1-o_0)/(1-delta kappa_P) - o_0/(1-delta kappa_E).
Then pi_exploit is structurally worse than pi_preserve if:
Delta r + eta chi (1-delta) < lambda (1-delta) Gamma,
where Delta r = r_exploit - r_preserve,
provided that Gamma > 0.
Under pi_exploit, optionality evolves as:
o_t^E = o_0 kappa_E^t.
Under pi_preserve, it evolves as:
o_t^P = 1 - (1-o_0) kappa_P^t.
Hence:
J^E = sum_{t=0}^\infty delta^t [r_exploit + lambda o_0 kappa_E^t]
and
J^P = - eta chi + sum_{t=0}^\infty delta^t [r_preserve + lambda (1 - (1-o_0) kappa_P^t)].
Separating terms:
J^E = r_exploit/(1-delta) + lambda o_0/(1-delta kappa_E)
and
J^P = - eta chi + r_preserve/(1-delta) + lambda [1/(1-delta) - (1-o_0)/(1-delta kappa_P)].
Subtracting:
J^E - J^P = Delta r/(1-delta) + eta chi - lambda Gamma.
Therefore J^E < J^P if and only if
Delta r/(1-delta) + eta chi < lambda Gamma.
Multiplying by (1-delta) > 0, we obtain:
Delta r + eta chi (1-delta) < lambda (1-delta) Gamma.
QED.
Under the hypotheses of Theorem 1, and assuming Gamma > 0, there exists
lambda_* = [Delta r + eta chi (1-delta)] / [(1-delta) Gamma]
such that J^P > J^E for every lambda > lambda_*.
From Theorem 1:
J^E - J^P = Delta r/(1-delta) + eta chi - lambda Gamma.
Thus J^P > J^E is equivalent to
lambda Gamma > Delta r/(1-delta) + eta chi.
Since Gamma > 0, divide both sides by Gamma and obtain:
lambda > [Delta r + eta chi (1-delta)] / [(1-delta) Gamma] = lambda_*.
QED.
In the model of Theorem 1, defining
Delta J(lambda) = J^P - J^E,
we have d Delta J / d lambda = Gamma.
In particular, if Gamma > 0, then Delta J is strictly increasing in lambda.
From the expression obtained in Theorem 1:
J^P - J^E = lambda Gamma - Delta r/(1-delta) - eta chi.
Differentiating with respect to lambda,
d Delta J / d lambda = Gamma.
If Gamma > 0, it follows immediately that Delta J grows strictly with lambda.
QED.
In the same model,
d Delta J / d chi = - eta.
Therefore, for eta > 0, the structural advantage of preserve is strictly decreasing in chi.
From:
Delta J = lambda Gamma - Delta r/(1-delta) - eta chi,
differentiating with respect to chi,
d Delta J / d chi = - eta.
If eta > 0, the sign is strictly negative.
QED.
Let r'(x,a,g) = a r(x,a,g) + b, with a > 0. Then:
argmax_a r'(x,a,g) = argmax_a r(x,a,g).
For any actions a_1, a_2,
r'(x,a_1,g) >= r'(x,a_2,g)
if and only if
a r(x,a_1,g) + b >= a r(x,a_2,g) + b.
Subtract b and divide by a > 0, obtaining:
r(x,a_1,g) >= r(x,a_2,g).
Therefore the order induced by r' coincides with the order induced by r, and the maximizer sets are identical.
QED.
If U'_t = a U_t + b, with a > 0, then
J'^pi = a J^pi + b/(1-delta).
Hence, for any policies pi_1, pi_2,
J^{pi_1} > J^{pi_2} if and only if J'^{pi_1} > J'^{pi_2}.
By definition:
J'^pi = sum_{t=0}^\infty delta^t (a U_t + b).
Separating terms:
J'^pi = a sum_{t=0}^\infty delta^t U_t + b sum_{t=0}^\infty delta^t.
Thus:
J'^pi = a J^pi + b/(1-delta).
Since a > 0, the transformation is strictly increasing in J^pi. Therefore it preserves ordering across policies.
QED.
Let:
p_stay(r) = sigma(alpha + beta r + gamma e),
with sigma(z) = 1/(1+e^{-z}) and beta > 0. Then p_stay is strictly increasing in r.
The derivative of the logistic function is:
sigma'(z) = sigma(z)(1-sigma(z)).
By the chain rule:
dp_stay/dr = beta sigma(z)(1-sigma(z)).
Since beta > 0 and 0 < sigma(z) < 1, it follows that:
dp_stay/dr > 0.
Therefore higher local reward implies a higher probability of staying in the game.
QED.
Assume the competence proxy q in [0,1] determines expected reward in the bad game:
E[r | q] = q r_H + (1-q) r_L,
with r_H > r_L. Assume further that:
p_stayis increasing inr;- permanence in the bad game follows a geometric distribution with parameter
1-p_stay.
Then expected time in the bad game,
E[tau_bad | q] = 1 / (1 - p_stay(q)),
is increasing in q.
Since r_H > r_L, we have:
dE[r|q]/dq = r_H - r_L > 0.
Because p_stay grows with r, composition implies that p_stay(q) grows with q.
Now consider the function:
f(p) = 1/(1-p), for 0 <= p < 1.
We have:
f'(p) = 1/(1-p)^2 > 0.
Therefore f is increasing in p. Since p_stay(q) increases with q, then:
E[tau_bad | q] = f(p_stay(q))
also increases with q.
QED.
The proofs above do not close the theory. They do something more important at this stage: they show that the central thesis already has reproducible formal instances, with clear and checkable conditions, without depending on unbounded optionality.