For web use cases at least, the single most important thing is to provide a very fast path for ASCII chars, because that’s what web content is mostly composed of (due not only to the dominance of English content, but also HTML tags, URLs, CSS and Javascript syntax… you get the idea). So a first benchmark on ASCII-only contents would be good. As a first approximation, you can just generate a big bunch of random ASCII chars.
After that, you would also need to demonstrate that your changes do not affect true Unicode content negatively, which can be done by starting from a plain text document in any language you like that has lots of Unicode characters. A Chinese text for example would be a good “pure Unicode” benchmark, since Chinese writing is mostly composed of “extended” Unicode characters, which have variable length when converted to UTF-8 to top it off.
Finally, as a mixed workload, you could dump the source code of a Chinese web page (Unicode contents + ASCII code) and see how well your algorithm deals with it. Results of a Baidu search would be one easy possibility.
Although my point of view here is somewhat web-centric, I am convinced that most real-world big text files are a combination of ASCII and “true” Unicode with infrequent switches from one to the other. Think configuration files, system logs, rich text documents, text-based network protocols, computer programs with strings in them… All of these basically integrate Unicode text inside of an ASCII-only substrate.