I wouldn't draw conclusions from random benchmarks like this without at least opening godbolt to see what's going on.
It really could be anything. e.g. final
may have enabled inlining in more places, but this may have inlined a very uncommon branch in a hot loop, causing way more cache misses when fetching instructions. Writing compilers is hard, and all optimisation passes are using imperfect heuristics.
Compiling with PGO might make the results more compelling, if that wasn't already tried.