This all sounds awesome to me. I don't have a very strong opinion about the format of the benchmarks. I personally have found cargo bench
-style benchmarks actually kind of hard to write -- it often seem like they have to be so short that they come up quite unreliable, but maybe I'm doing it wrong in other ways. But I don't think it represents "best practices" in any particular way.
I do agree though that having some way to hide "setup" costs and exclude them from the measurement would be useful.