This is an experiment to have the LLM do its own quantitative research across multiple parallel strategies that can:
- combine signals across multiple timeframes (1h base + 4h + 1d)
- reference multiple assets (5-pair universe with cross-asset signal references)
- declare their own basket of pairs to trade (subset of the whitelist)
- declare their own test timeranges for cross-regime evaluation
- use dynamic position sizing via
custom_stake_amount - compare against a buy-and-hold benchmark computed for the same period
Decision metric (v0.4.1): robust_sharpe = min(sharpe across declared timeranges),
flanked by profit_floor, min_position_size, and pareto_dominated_by gates.
The progression so far:
- v0.2.0 added multi-strategy → resisted single-paradigm anchoring
- v0.3.0 added MTF + multi-asset + per-pair reporting → hit clean Sharpe 1.07
- v0.4.0 extended timerange to include 2022 winter + opened sizing affordance → real-edge clean Sharpe 1.122 / +232% on 5-year regime mix; surfaced the Sharpe-as-single-oracle degeneracy boundary
- v0.4.1 addresses the v0.4.0 surfacing directly:
- Portfolio basket (
pair_basket): strategies declare which pairs to trade, no longer forced through all 5 - Multi-timerange (
test_timeranges): each strategy backtested across multiple regime segments in one round, withrobust_sharpe= worst-case timerange Sharpe as the headline - Multi-objective oracle: profit-floor, min-position-size, and Pareto-dominance gates flank the Sharpe number — directly counters the v0.4.0 "tighten vol_target until Sharpe → ∞ but profit → 0" degeneracy
- Buy-and-hold benchmark: per-timerange BaH portfolio Sharpe + return
- DD, computed from 1d feathers and reported alongside strategy metrics
- Portfolio basket (
The v0.4.1 honesty bar (more demanding than v0.4.0):
- A strategy is "real" only if
robust_sharpeis good across ALL its declared timeranges, ANDprofit_floorPASS, ANDmin_position_sizePASS, AND it isn'tpareto_dominated_bya prior keep - Headline Sharpe by itself is no longer enough — the gates must clear
To set up a new experiment, work with the user to:
-
Agree on a run tag: propose a tag based on today's date (e.g.
may1). The branchautoresearch/<tag>must not already exist. -
Create the branch:
git checkout -b autoresearch/<tag>from currentmaster. -
Read the in-scope files. The repo is small. Read these files for full context:
README.md— repository contextconfig.json— fixed FreqTrade config (pairs, timeframe, fees). Do not modify.prepare.py— data download. Do not modify.run.py— the batch backtest oracle. Do not modify.user_data/strategies/_template.py.example— skeleton for new strategies. Note: the folder may also contain__pycache__; ignore it.versions/<v>/retrospective.md— prior runs' findings. All three are valuable as design context:- v0.1.0: single-paradigm anchoring + 3 Goodhart exploits agent self-reversed
- v0.2.0: multi-strategy resolution of anchoring; 5 paradigms / 3 kept
- v0.3.0: MTF + portfolio + per-pair → Sharpe 1.07 clean; first fork + isolation experiment; explicitly flagged single-regime data as blocking several findings (cross-pair macro gates, bear robustness) — exactly what v0.4.0 addresses.
-
Verify data exists: Check that all fifteen data files exist under
user_data/data/— 5 pairs × 3 timeframes:BTC_USDT-{1h,4h,1d}.featherETH_USDT-{1h,4h,1d}.featherSOL_USDT-{1h,4h,1d}.featherBNB_USDT-{1h,4h,1d}.featherAVAX_USDT-{1h,4h,1d}.feather
If any are missing, tell the user to run
uv run prepare.py. -
Initialize results.tsv: Create
results.tsvwith just the header row:commit event strategy_name sharpe max_dd noteTab-separated. Do not commit this file — it's gitignored on purpose.
-
Create 1-3 starting strategies. This is the most important setup step.
- Each strategy goes in its own file:
user_data/strategies/<YourName>.py - Class name MUST match filename stem (FreqTrade requirement)
- Each strategy's docstring MUST fill all 6 metadata fields (Paradigm, Hypothesis, Parent, Created, Status, Uses MTF)
- Each strategy MUST target a different paradigm. Don't create 3 mean-reversion variants as a "safe start" — that defeats the whole point of v0.2.0+. Pick from: mean-reversion, trend-following, volatility, breakout, other. At least 2 different categories.
- Strongly encouraged: at least one of the starting strategies should use the multi-timeframe affordance (see "Multi-timeframe" section below). Otherwise v0.3.0 doesn't exercise the new capability and we'll have learned nothing new vs v0.2.0. This is encouragement not mandate — if you have strong reasoning to make all 3 single-TF, write that reasoning in the notes.
- Keep each strategy minimal initially. You'll iterate in the loop.
- Each strategy goes in its own file:
-
Confirm and go: Confirm setup looks good with the user.
Once you get confirmation, kick off the experimentation.
Each round runs a backtest on ALL active strategies on a fixed timerange
(20210101-20251231, 5 years including the 2022 bear regime) across the
5-pair portfolio (BTC, ETH, SOL, BNB, AVAX) at 1h base. run.py emits
one --- summary block per strategy, containing both portfolio-aggregate
metrics AND per-pair breakdown.
Backtest time note (v0.4.0): 5 years × 5 pairs × 3 timeframes ≈ 1.7× slower than v0.3.0. Each round of 3 strategies takes roughly 5-8 minutes. Plan iterations accordingly — about 8-12 rounds per hour.
- Modify any file under
user_data/strategies/(that isn't prefixed_) - Create a new strategy file
- Delete a strategy file (via
git rm) - Copy an existing strategy to create a variant (fork)
- Modify
prepare.py,run.py, orconfig.json. These are the evaluation contract. uv addnew dependencies. Use what's already inpyproject.toml.- Call the
freqtradeCLI directly. The only way to run backtests is viauv run run.py. - Modify the timerange, pair list, or
_template.py.example. - Have more than 3 active strategies at any time (see hard cap below).
- Request timeframes other than
1h,4h,1dOR pairs other than the 5 in the whitelist in@informativedecorators. Anything else will crash the backtest with a missing-data error.
Data is pre-downloaded for three timeframes × five pairs = 15 combinations:
| Timeframe | Pairs |
|---|---|
| 1h (base) | BTC/USDT, ETH/USDT, SOL/USDT, BNB/USDT, AVAX/USDT |
| 4h | BTC/USDT, ETH/USDT, SOL/USDT, BNB/USDT, AVAX/USDT |
| 1d | BTC/USDT, ETH/USDT, SOL/USDT, BNB/USDT, AVAX/USDT |
Strategies are always evaluated on the 1h base across ALL five pairs in one
backtest run. You cannot change the base TF or pair list. But you can pull
additional context along TWO axes via FreqTrade's @informative decorator:
higher-TF data from the same pair, and same-TF data from different pairs.
Basic higher-timeframe usage (most common):
from freqtrade.strategy import IStrategy, informative
class YourStrategy(IStrategy):
timeframe = "1h"
@informative("4h")
def populate_indicators_4h(self, dataframe, metadata):
dataframe["rsi"] = ta.RSI(dataframe, 14)
return dataframe
@informative("1d")
def populate_indicators_1d(self, dataframe, metadata):
dataframe["ema200"] = ta.EMA(dataframe, 200)
return dataframe
def populate_indicators(self, dataframe, metadata):
# Merged columns are auto-available: rsi_4h, ema200_1d
dataframe["rsi"] = ta.RSI(dataframe, 14)
return dataframe
def populate_entry_trend(self, dataframe, metadata):
# MTF confluence: 1h oversold + 4h not overbought + 1d bull regime
dataframe.loc[
(dataframe["rsi"] < 20)
& (dataframe["rsi_4h"] < 60)
& (dataframe["close"] > dataframe["ema200_1d"]),
"enter_long",
] = 1
return dataframeCross-pair usage (reference another asset's data):
@informative("1h", "BTC/USDT")
def populate_btc_1h(self, dataframe, metadata):
dataframe["close_ma"] = ta.SMA(dataframe, 50)
return dataframe
# In populate_indicators on, say, ETH — you now have `btc_usdt_close_ma_1h`
# Column naming: `{base}_{quote}_{col}_{tf}`, lowercase, underscore-separated.Cross-pair asymmetry — important subtlety: @informative('1h', 'ETH/USDT')
always pulls ETH data regardless of which pair the strategy is currently
processing. When processing BTC, that gives you BTC main + ETH context
(useful). When processing ETH itself, you get ETH's data alongside itself
(redundant). For truly symmetric cross-pair strategies (e.g., BTC/ETH ratio
that means something on BOTH pairs), use informative_pairs() with a
metadata['pair']-conditional branch inside populate_indicators.
Key properties (FreqTrade handles these for you):
- Column naming:
rsiin a@informative('4h')method →rsi_4hin 1h dataframe. For cross-pair:rsiin@informative('1h', 'BTC/USDT')→btc_usdt_rsi_1h. - Look-ahead safe: FreqTrade shifts merged data by 1 period so current 1h bar never sees future higher-TF bars.
- Forward-filled: at any 1h bar, the merged
rsi_4hvalue is the last fully-closed 4h bar's RSI.
When to use higher TFs:
- Regime filters (
close > ema200_1dfor bull regime) - Trend confirmation (
ema9_4h > ema21_4h) - Volatility context (
atr_4hfor relative-vol positioning)
When to use cross-pair:
- Relative value / ratio plays (
close / btc_usdt_close_1h) - Leader/follower dynamics (BTC often leads ETH/altcoins on 4h)
- Diversification checks ("only enter if BTC isn't crashing")
When NOT to use either:
- If the paradigm doesn't have an intuitive MTF/cross-pair analog, don't force it. v0.2.0's MeanRevBB was pure 1h single-pair and hit 0.52 Sharpe.
startup_candle_count — bump up for slow indicators on higher TFs. EMA200
on 1d needs 200 daily bars = 4800 hourly bars of warmup. Starting at 250-300
is usually safe for most MTF configurations.
By default each trade is sized at wallet * tradable_balance_ratio / max_open_trades — equal-weight across the 5-pair universe. v0.4.0 unlocks
dynamic per-trade sizing via FreqTrade's custom_stake_amount method:
def custom_stake_amount(self, pair, current_time, current_rate,
proposed_stake, min_stake, max_stake,
leverage, entry_tag, side, **kwargs) -> float:
# Return a number within [min_stake, max_stake].
# `proposed_stake` is the equal-weight default.
df, _ = self.dp.get_analyzed_dataframe(pair, self.timeframe)
atr_pct = df["atr"].iloc[-1] / df["close"].iloc[-1]
vol_target = 0.02
scale = min(1.0, vol_target / max(atr_pct, 1e-6))
return max(min_stake or 0.0, min(max_stake, proposed_stake * scale))When to use it:
- Vol-targeting (smaller positions on more volatile pairs/regimes)
- Signal-strength weighting (bigger position when entry signal is stronger)
- Regime-conditional sizing (smaller in bear, normal in bull). With v0.4.0's regime-mixed timerange, this is especially relevant — equal-weight in 2022 winter is often a research liability.
When NOT to use it:
- If your paradigm doesn't have a natural sizing logic, default equal-weight is fine. Forcing sizing into a strategy that doesn't need it adds noise.
- v0.3.0's MTFTrendStack and BTCLeaderBreakX both used equal-weight default and reached Sharpe > 0.7.
Honesty consideration: in v0.4.0's regime-mixed data, sizing-aware strategies are likely to look better than equal-weight equivalents because they can de-risk in 2022. That doesn't mean sizing is "the secret sauce" — it means equal-weight in regime mix is structurally exposed. When comparing, distinguish "this strategy has real edge" from "this strategy survives regime mix BECAUSE it sizes down in bear".
run.py output now includes a per_pair: section after the aggregate
metrics. Example:
---
strategy: YourStrategy
sharpe: 0.45 # aggregate across all 5 pairs
...
pairs: BTC/USDT,ETH/USDT,SOL/USDT,BNB/USDT,AVAX/USDT
per_pair:
BTC/USDT: sharpe=0.62 trades=45 profit_pct=18.5 dd_pct=-3.2 wr=58.0 pf=1.72
ETH/USDT: sharpe=0.38 trades=50 profit_pct=12.1 dd_pct=-5.1 wr=52.0 pf=1.35
SOL/USDT: sharpe=0.12 trades=35 profit_pct=5.3 dd_pct=-8.1 wr=48.6 pf=1.08
BNB/USDT: sharpe=0.71 trades=40 profit_pct=22.0 dd_pct=-2.9 wr=62.5 pf=1.93
AVAX/USDT: sharpe=-0.05 trades=30 profit_pct=-2.8 dd_pct=-7.4 wr=46.7 pf=0.92
Use per-pair metrics aggressively — they're the main new information surface. Things to look for:
- Does the strategy work on ALL pairs or just some? A paradigm that's great on BTC but negative on SOL/AVAX is either (a) BTC-specific (interesting, worth understanding why) or (b) noise (worth killing).
- Are DDs asymmetric? Some pairs may carry most of the portfolio DD.
- Trade count balance: if one pair has 200 trades and another has 3, that's a sample-size problem you should note.
- Cross-pair correlations in edge: BTC+BNB doing well while ETH+SOL+AVAX flat tells you something about what kind of regime the strategy exploits.
In your results.tsv notes, when a result varies substantially across pairs,
call it out explicitly — e.g., "Sharpe 0.45 aggregate but SOL=-0.10 and
BNB=+0.80; signal is BNB-heavy, trade count 40 not enough". These are the
observations that make the run's knowledge output per-asset-profile-shaped
(the original project goal).
By default a strategy is evaluated on ALL 5 pairs in the whitelist. This forces every paradigm through every asset, which is sometimes wrong: trend may shine on alts but not on BTC, MR may be BNB-skewed, etc. v0.4.1 lets a strategy declare its trade universe via a class attribute:
class YourStrategy(IStrategy):
timeframe = "1h"
pair_basket = ["SOL/USDT", "BNB/USDT", "AVAX/USDT"] # alts only
...The strategy is then only evaluated on its declared basket, both for
trade execution and for per-pair reporting. The aggregate metrics
(sharpe, profit_total_pct, etc.) are over the basket, not the full
whitelist.
When to declare a basket:
- The paradigm clearly fits some assets better than others (per-pair report shows wide dispersion across the 5 pairs)
- v0.4.0 surfaced patterns like "MR is BNB-skewed" or "trend has AVAX drag" — basket declaration lets you act on those findings as a first-class design choice, not a post-hoc note
When NOT to declare a basket:
- The paradigm is universal (e.g., a regime filter that should apply to all major crypto)
- You haven't yet seen evidence of asset-specific fit — start with full basket, observe per-pair dispersion, then prune if warranted
The basket survives across timeranges within a strategy (you can't declare different baskets per timerange — that would be two separate strategies).
A strategy can declare a list of timeranges to test across. Each declared
timerange runs as its own backtest; results are emitted as separate
--- blocks; a final SUMMARY block reports robust_sharpe = min over declared timeranges.
class YourStrategy(IStrategy):
test_timeranges = [
("bull_2021", "20210101-20211231"), # 2021 bull regime
("winter_2022", "20220101-20221231"), # 2022 winter (BTC -75%)
("recovery_23_25", "20230101-20251231"), # 2023-25 recovery
("full_5y", "20210101-20251231"), # full window
]If unset, defaults to a single backtest over 20210101-20251231 (full).
Why use multi-timerange:
- Cross-regime robustness: a strategy that gets Sharpe 1.5 on bull but -0.5 on winter is NOT a Sharpe 1.0 strategy; it's an over-fit one. Multi-timerange surfaces this immediately.
- Out-of-sample validation: declare
("train_21_24", "20210101-20241231")and("test_25", "20250101-20251231")to get a clean OOS check. - Mechanism understanding: which regime carries the edge? which kills it? The per-timerange dispersion is a research output in itself.
robust_sharpe is the headline metric for a strategy in v0.4.1 —
it's the worst-timerange Sharpe across all declared ranges. Use this
when judging keep/kill — single-timerange Sharpe can hide regime
overfit.
Trade-off: backtest time scales linearly with number of timeranges. 4 declared timeranges = 4× backtest time per round. Choose meaningfully — you don't need 8 timeranges. 2-4 is usually plenty.
Each timerange's --- block now reports:
bah_sharpe: 0.93
bah_profit_pct: 187.3
bah_dd_pct: -65.2
These are the equal-weight buy-and-hold portfolio metrics over the same pairs and timerange. Compare your strategy directly:
- If
sharpe < bah_sharpeANDprofit < bah_profit, your strategy is strictly worse than doing nothing — kill it - If your strategy beats BaH on Sharpe but not profit, you're trading return for risk-adjustment (might be intentional)
- If your strategy beats BaH on both, you have real alpha
In results.tsv notes, always cite BaH for context — e.g., "Sharpe 1.12
beats BaH 0.93 on full timerange; profit 232% beats BaH 187%". Without
the BaH reference, the agent can over-celebrate strategies that just track
the market.
In addition to robust_sharpe, the per-strategy SUMMARY block reports
three pass/fail gates:
profit_floor: PASS (threshold ≥ 20% per timerange)
min_position_size: PASS (threshold ≥ 5%)
pareto_dominated_by: none (non-dominated)
- profit_floor: each declared timerange must clear ≥20% portfolio profit. Catches "Sharpe-via-tightening-stake" Pareto degeneracy from v0.4.0.
- min_position_size: average trade stake / wallet must be ≥5%. Same defense — prevents sizing → 0 from inflating Sharpe.
- pareto_dominated_by: the SUMMARY's robust_sharpe + worst-DD pair
is checked against all prior commits' rows in
results.tsv. If any prior strategy hassharpe ≥ yoursANDdd ≥ yours(less negative), this row is marked dominated. A dominated current strategy is a kill candidate — there's no reason to keep it when a prior is strictly better.
A FAIL gate isn't an automatic kill — agent decides. But it's a strong signal that the headline number is misleading. When deciding keep/kill, weigh gates as inputs alongside the metrics.
Rule 1: Hard cap — 3 active strategies.
At any moment, user_data/strategies/ must contain at most 3 non-underscore
.py files. To add a 4th, you must first git rm one of the existing.
Rule 2: Stagnation gate — 3 stable rounds.
Each round, every strategy gets one of these events logged in results.tsv:
create— you added it this roundevolve— you modified it this roundstable— it existed, got measured, but you didn't touch itfork— you copied it to create a derivative (logged on the child, withparent→childin the strategy_name field)kill— you removed it this round
If a strategy has accumulated 3 consecutive stable events with no evolve
or fork, the next round it MUST receive one of: evolve, fork, or kill.
Cannot sit idle for a 4th stable round. You decide which treatment. This rule
exists because the cap is 3 — we can't afford a slot sitting still.
Rule 3: Every round must touch at least one strategy.
A round where all events are stable is not an experiment — it's wasted time.
At minimum, evolve one strategy per round. (Exception: the very first backtest
round right after setup, where you log create events for what you built.)
Rule 4: Paradigm diversity at setup. See setup step 6 above. First 1-3 strategies must target different paradigms. After that, you're free to create same-paradigm variants (e.g. two mean-reversion approaches with different signals) — but sparingly. Diversity is more valuable than depth in this run.
Once run.py finishes, stdout has one --- block per (strategy, timerange)
combination, plus a final SUMMARY block per strategy. If a strategy declares
4 timeranges, you get 4 backtest blocks + 1 SUMMARY = 5 blocks for that
strategy.
Per-timerange block:
---
strategy: YourStrategy
timerange_label: bull_2021
timerange: 20210101-20211231
commit: abc1234
basket: BTC/USDT,ETH/USDT,SOL/USDT,BNB/USDT,AVAX/USDT
sharpe: 1.2300
sortino: 2.1500
calmar: 1.8900
total_profit_pct: 67.5
max_drawdown_pct: -7.4
trade_count: 201
win_rate_pct: 58.0
profit_factor: 1.85
bah_sharpe: 1.45
bah_profit_pct: 132.1
bah_dd_pct: -22.3
per_pair:
BTC/USDT: sharpe=0.62 trades=45 profit_pct=18.5 dd_pct=-3.2 wr=58.0 pf=1.72 (bah_profit=98.3 bah_dd=-15.1)
...
Per-strategy SUMMARY block:
---
strategy: YourStrategy
timerange_label: SUMMARY
commit: abc1234
robust_sharpe: 0.6500 # min across declared timeranges
worst_profit_pct: 22.4
worst_dd_pct: -28.1
avg_position_pct: 18.7
profit_floor: PASS (threshold ≥ 20% per timerange)
min_position_size: PASS (threshold ≥ 5%)
pareto_dominated_by: none (non-dominated)
robust_sharpe is the headline metric for v0.4.1 keep/kill decisions —
it's the worst-timerange Sharpe. Single-timerange Sharpe is no longer
sufficient; cross-regime robustness must clear too.
If a strategy crashes on any timerange, that timerange's block looks like:
---
strategy: SomeBrokenStrategy
commit: abc1234
status: ERROR
error_type: NameError
error_msg: name 'foo' is not defined
traceback:
...
Extract all strategies' metrics at once:
grep "^---\|^strategy:\|^sharpe:\|^trade_count:\|^max_drawdown_pct:" run.logFull per-strategy block:
awk '/^---$/,/^$/' run.logAfter each round, append one row to results.tsv per strategy touched.
Tab-separated, 6 columns:
commit event strategy_name sharpe max_dd note
Rules:
commitis the short git hash of the round's commiteventis one ofcreate | evolve | stable | fork | kill- For
fork,strategy_nameusesparent→childformat (e.g.MeanRevRSI→MRVolGate) - For
kill, leavesharpeandmax_ddas-(dash). The strategy is gone. noteis your reasoning in free text. This is load-bearing — when you later decide keep vs kill, you re-read these notes. Be specific:- Bad:
"tried MACD, didn't work" - Good:
"replaced RSI entry with MACD cross-up. wr 68→51, sharpe 0.82→0.31. MACD crossovers on 1h crypto trigger inside ongoing drops, catching knives. Discarding paradigm."
- Bad:
- Every strategy that exists this round gets a row, even if
stable. This is how the stagnation counter stays visible.
Do NOT commit results.tsv. It is gitignored on purpose — the log
survives git reset --hard, which is essential so you don't forget what
you've already tried.
The experiment runs on the dedicated branch (e.g. autoresearch/may1).
LOOP FOREVER:
-
Look at state: read
results.tsv(tail ~30 rows), note the current active strategies and their stagnation counters (how many consecutivestableevents each has). -
Decide this round's action. Your toolkit per round:
evolve <strategy>: modify an existing strategy filecreate <name>: add a new strategy (if cap has room)fork <parent>→<child>: cp a strategy file to a new name, then modify the childkill <strategy>:git rmthe file- You can combine: e.g. "kill A and create B in the same commit" (make room for something new), or "fork A→A' and evolve B" (two strategies touched)
-
Respect the rules. In particular:
- Cap: max 3 active strategies after this round's changes
- Stagnation: any strategy with 3 prior consecutive
stableevents must be evolved, forked, or killed THIS round - Every round touches ≥ 1 strategy
-
Make the code changes. Write/modify files under
user_data/strategies/. -
git commit -am "<short summary of this round>" -
Run the backtest:
uv run run.py > run.log 2>&1 -
Read the summary:
awk '/^---$/,/^$/' run.log(shows all blocks) orgrep "^---\|^strategy:\|^sharpe:\|^trade_count:" run.log(compact). -
Check for crashes: a strategy with
status: ERRORneeds to be fixed (if the error is trivial — syntax, typo) OR killed (if the hypothesis is broken). Don't leave ERROR strategies around. -
Log to results.tsv: one row per strategy that existed this round. Fill in the event, metrics (or
-for kills), and your reasoning note. -
Decide keep vs rollback.
- Common case: per-strategy decisions happen inline (you either evolved to something better, or the change was bad and you git-reset only that strategy's commit). The whole round doesn't have one "keep/discard" decision — individual strategies do.
- If the whole round was a mistake (broke everything, wrong direction),
git reset --hard HEAD~1to undo all changes. - If some strategies improved and others didn't: keep the commit, log
stablefor the unchanged ones, logevolveorkilletc. for the changed ones.
-
Loop.
A strategy deserves to stay if (in priority order):
robust_sharpeis meaningfully positive (> 0.3 as a soft bar) AND the worst-timerange numbers aren't catastrophic- All multi-objective gates PASS: profit_floor, min_position_size, pareto_dominated_by = none
- Beats BaH at least on Sharpe across the declared timeranges (or has a clear non-Sharpe edge: e.g., much lower DD with comparable profit)
- Its paradigm/basket is distinct from other active strategies
- Recent evolutions have moved it in the right direction
A strategy deserves to die if:
robust_sharpeis below 0 (worst regime is genuinely bad — not a recoverable strategy)- A gate FAILs (especially
pareto_dominated_by— there's a strict prior better than this one) - Sharpe is below BaH on every declared timerange (you're losing to doing nothing)
- Stable for 3 rounds with no improvement and no new ideas
- Paradigm/basket overlaps strongly with a better-performing active strategy
The v0.4.1 honesty bar: a strategy that hits Sharpe 1.5 on bull_2021 but
-0.5 on winter_2022 has robust_sharpe = -0.5 — that's NOT a Sharpe-1.5
strategy. Keep/kill decisions weight robust_sharpe, not best-case.
Always log your reasoning. These notes become the retrospective — future you (and the meta-analysis layer) will read them to extract what this run actually learned. Cite BaH for context: "Sharpe 1.12 beats BaH 0.93; robust_sharpe 0.85 (winter_2022 is the floor, still positive)" is a useful note. "Sharpe 1.12" alone is not.
From v0.1.0 we learned the agent can inadvertently game the metric:
exit_profit_only=True→ 100% win rate by never realizing losses (regime-dependent, breaks in bear markets)- Tight
minimal_roiclipping → tiny uniform returns → low stddev → huge Sharpe (profit goes DOWN even as Sharpe goes UP)
v0.2.0 added zero new Goodhart exploits over 81 rounds — try to keep that streak.
If you find a Sharpe jump that comes with a profit drop or a DD collapse to ~0, that's a gaming signal, not real edge. Log it, document the mechanism, then either kill the strategy or explicitly note "this is an oracle artifact, not edge" in the description.
Multi-strategy helps here: if strategy A's Sharpe jumps while B and C stay flat on the same data, A's jump is more likely a real discovery. If ALL three strategies' Sharpe jumped on the same commit — you probably modified something shared, or the oracle itself has a hole.
Timeout: each full round (3 strategies × backtest) should take under 3 minutes. If a single run exceeds 10 minutes, kill it and treat it as a failure (revert the commit, skip the round).
Once the experiment loop has begun (after initial setup), do NOT pause to ask the human if you should continue. Do NOT ask "should I keep going?" or "is this a good stopping point?". The human may be asleep, or away from the computer, and expects you to continue working indefinitely until manually stopped.
If you run out of ideas:
- Re-read
versions/0.1.0/retrospective.mdandversions/0.2.0/retrospective.md— v0.1.0 listed directions it never tried (multi-timeframe was one!); v0.2.0 attempted 5 paradigms and identified specific plateau ceilings per paradigm - Apply multi-timeframe to a stagnant strategy (if not using MTF yet) — that's literally the new affordance of v0.3.0
- Look at your stagnant strategies — can you fork them with a bolder change?
- Try combining winners from different paradigms (e.g. a volatility-gated version of a winning mean-reversion strategy)
- Try completely new indicator families you haven't touched
- Check v0.2.0's comparative findings (volume-filter universal, ATR paradigm-specific, regime-window paradigm-specific, ADX-lag universal) — see if any transfer to your current strategies in ways v0.2.0 never tested (e.g., 4h volume expansion instead of 1h)
The loop runs until the human interrupts you, period.
As an example use case, a user might leave you running while they sleep. Each round of 3-strategy backtests takes ~2-3 minutes, so you can run several dozen per hour. The user then wakes up to a rich multi-strategy research trace ready for meta-analysis.