methodologyeducation

Walk-forward validation — why most stock-prediction backtests lie

The 'tested on 5 years of data, returned 70% a year' pitch you see in finance Twitter ads is almost always smoke. Here's what's actually happening, why even a rigorous walk-forward backtest still flatters itself, and why we publish our live forward-tested record instead — gap and all.

By William Wu19 March 20267 min read

Walk-forward validation — why most stock-prediction backtests lie

If you've spent any time on finance Twitter or YouTube, you've seen the pitch.

"I backtested this strategy on 5 years of data. Returned 70% per year."

Maybe it's a moving-average crossover. Maybe it's a fancy "AI signal". The result is always the same: a beautiful upward-sloping equity curve, a few exotic indicators, and a paid newsletter at the end.

I want to walk through why those numbers are almost always smoke — and then I want to be honest about something harder: even our own careful backtests flatter us. That's the whole reason we publish a live record instead.

The honest version of "backtesting"

A backtest is supposed to answer a simple question: if I had run this strategy in the past, how would it have done?

The right way: pretend you don't know the future, decide trades using only the data available at the time, and see how it turns out.

The lazy way (and the way most "AI prediction" pitches do it): take the entire dataset, find the parameters that would have worked best in hindsight, and report those numbers.

If you torture a stock-price dataset long enough, it will confess to anything. There are always some parameters that would have made you rich. The question is whether those parameters will work going forward — and almost always, they won't.

There are three common ways this goes wrong:

1. Look-ahead bias

The model accidentally uses information from the future. Maybe a feature accidentally includes tomorrow's close price. Maybe the indicator is calculated over the whole dataset before being split. The model looks like a genius — because it can see the answer.

This sounds dumb but is staggeringly common in retail strategies. Anyone who's coded their own indicator in Python has done it at least once.

2. Overfitting

You try 50 strategies. You report the best one. The other 49 lost money. Statistically, even pure noise will produce at least one strategy that looks brilliant.

A famous version of this: a guy demonstrated you can "predict" the S&P 500 using the butter production in Bangladesh. Not because butter actually predicts stocks — because if you try enough random series, one of them will line up.

3. Survivorship bias

You test your strategy on today's S&P 500 list. But that list excludes every company that went bankrupt or was delisted over the last 20 years. So your backtest only ever sees the winners.

A "buy and hold forever" strategy looks fantastic on today's S&P 500 because Lehman Brothers isn't on the list anymore. It would look much worse on the actual list of companies that existed in 2007.

What walk-forward actually does

Walk-forward validation is mostly about discipline.

You split your data into chunks. Say each chunk is 12 months. You train the model on chunk 1 only, then test it on chunk 2 — without ever letting the model see chunk 2 during training. Then you re-train on chunks 1 + 2 and test on chunk 3. And so on.

At each step, the model only knows what would have actually been known on that date.

It sounds boring. It is boring. It's also the only way to get backtest numbers that have any chance of holding up in real trading.

Walk-forward typically cuts retail "70%/year" claims down to something between 0% and 12% per year — usually with much bigger drawdowns than the original number suggests. That's the honest version.

The part nobody wants to admit: even a clean walk-forward lies a little

Here's the uncomfortable thing I've learned running Trading Agent across 16 markets — US, Canada, UK, Taiwan, China, Singapore, Malaysia, Vietnam, Australia, New Zealand, Korea, Japan, India, Hong Kong, Indonesia, and Thailand.

We do walk-forward properly. No look-ahead, no peeking at the test window, models re-fit on a rolling basis. And on some markets and horizons, those careful offline backtests came back around 54% directional accuracy. Genuinely encouraging.

Then we shipped those same cells and watched them live. The public forward-tested record on those exact markets and horizons settled around 49–50%.

That's a roughly 5-percentage-point haircut — and there was no bug. The walk-forward was clean. So where did the 5 points go?

They went into the research process itself. Before any single backtest runs, you've already made a hundred quiet decisions: which markets to cover, which tickers to anchor on, which horizons to forecast, which features to keep, which model to use. Every one of those choices was made by looking at historical data. Even if no individual backtest peeks at its own test window, the portfolio of things you chose to test was selected because it looked good on history. The optimism leaks in through the selection, not through the test.

So a walk-forward backtest isn't a track record. It's a hypothesis. A disciplined one, an honest one — but still a statement about what might hold up, not proof that it did.

The only thing that settles the question is letting time pass with the predictions locked in public, where you can't quietly drop the cells that disappointed you. That's a live forward-test, and it's the gold standard. The gap between the backtest and the live number isn't noise to be explained away — it's the measurement of how much the backtest overfit. For us, that gap has been about five points, and now we budget for it.

Why we publish the live number, not the nicer one

People ask me this directly: why not just put the 54% backtest figure on the homepage? It's real, you computed it honestly, it looks better.

Because it isn't a track record. It's the hypothesis I just described. Publishing it as if it were a result would be exactly the move I spend this whole essay warning you about — dressed up in slightly more respectable clothing.

So here's what we actually report:

The headline number is our blended live record across all 16 markets: about 49.7% directional accuracy — measured on forecasts we made in real time, not reconstructed after the fact.
Our strongest markets are Canada and the US, around 53%. Better than a coin flip on a directional call, and nowhere near "AI will make you rich" territory.
The "Live log" at /predictions is the same forecasts we published in real time, timestamped. You can verify them against your own broker history.
We don't backfill, and we don't quietly delete the cells that went against us. A model that can't reliably call direction is itself important information — so it stays on the page.

On each prediction you'll see a directional read — Bullish, Neutral, or Bearish — never "Buy" or "Sell", and never a probability dressed up as a promise. We will never tell you a forecast is "90% certain." Markets don't work that way, and anyone who says otherwise is selling you the Bangladesh butter chart.

Why this matters for you

If somebody is selling you a system and the only number they show is a backtest, ask one question: what test method?

If they can't explain it in two sentences, or the answer is "we used the last 5 years", run.

If they say "walk-forward, with these holdout chunks, and here are the per-window stats", you can take it more seriously — but now you know to ask the follow-up: do you have a live forward-tested record, and how far below the backtest did it land? If the live number is missing, or it suspiciously matches the backtest, be skeptical. An honest operator's live number is almost always a little worse than their backtest. Ours is. We tell you by exactly how much.

Some of our forecasts will still be wrong. We say so on every page.

See the evidence for yourself — download the full resolved-prediction dataset, read the live public self-audit (hit-rate confidence intervals, live-vs-backfill split), inspect every model card, or run the research tools on your own data. No hype, just the receipts.

This article is educational content. Nothing here is financial advice, a recommendation to buy or sell any security, or directed at any individual's circumstances. Trading Agent is a quantitative research tool operated by WU Capital Limited (New Zealand). See our Methodology page for technical detail on how the models are built, validated, and shipped.

Walk-forward validation — why most stock-prediction backtests lie

The honest version of "backtesting"

1. Look-ahead bias

2. Overfitting

3. Survivorship bias

What walk-forward actually does

The part nobody wants to admit: even a clean walk-forward lies a little

Why we publish the live number, not the nicer one

Why this matters for you

More insights

NZX AI forecast: can machine learning predict New Zealand stocks?

Australia stock prediction (ASX): the honest numbers

India stock prediction (NSE): ~52% on ~1,964 verified calls — a real but modest edge, honestly framed