Pre-ACP: Un-specialize impl ToString

Proposal

Problem statement

"&str".to_string() is such a fundamental part of the Rust language. It's the second function introduced by "the book" to construct String right after the String::new(). And still is one of the most preferable methods to construct owned String for many experienced Rust developers. But the method is based on the Display trait, which implies std::fmt infrastructures with unavoidable dynamic dispatches and optimization blocker.. which is not. Since it's such a common function we've especially optimized them to ideal code with specialization.

But specialization is not a silver bullet. Since it "specialize" certain types it doesn't compose naturally. For example, (&"&str").to_string() doesn't take specialized route and perform dynamic dispatch. Same applies for Arc<str>, arrayvec::ArrayString etc. and with those also-specialized primitive types like bool, u8 but with some indirections.

Motivating examples or use cases

These .to_string() calls invokes std::fmt infrastructure with dynamic dispatch. Specialization optimizes few chosen cases but the condition is subtle which can be surprising.

fn cond(s: String) -> bool {...}
let b = vec!["foo", "bar"].iter().any(|s| cond(s.to_string()));

let s: Arc<str> = Arc::from("foo");
s.to_string();

// external crates defined types

let s: arrayvec::ArrayString = "foo".into();
s.to_string();

let s = FourCC::new(r"mpeg").unwrap();
s.to_string();

// and more types without touching specializations

42_i32.to_string();

std::net::Ipv4Addr::LOCALHOST.to_string()

Solution sketch

The goal of this ACP is to un-specialize impl ToString without affecting performance, by adding a new method to the trait Display. Currently handful of types are specialized with ToString trait and they can be divided into 3 groups:

  • Delegate to str - str, String, Cow<'_, str> etc.
    • Including bool which is either "true" or "false"
  • Need small fixed sized buffer - u8, char etc.
  • fmt::Arguments<'_>

To address all those cases I propose to add another method to trait Display:

trait Display {
	...
	fn try_to_str<'a>( // <- New!
		&'a self,
		fixed_buf: &'a mut [u8],
	) -> Result<&'a str, TryToStrError> {
		Err(TryToStrError::default()) // <- With default implementation
	}
}

struct TryToStrError {...}

impl TryToStrError {
	fn new() -> Self {...}
	fn with_capacity_hint(self, capacity_hint: Option<usize>) -> Self {...}
	fn capacity_hint(&self) -> Option<usize> {...}
}

// And replace existing specialization-based `impl ToString`s
impl<T: Display> ToString for T {
	fn to_string(&self) -> String {
		const FIXED_BUF_CAPACITY: usize = 64;
		let mut fixed_buf = [0_u8; FIXED_BUFFER_CAPACITY];
		
		let mut buf = match self.try_to_str(&mut fixed_buf) {
			Ok(s) => return s.to_owned(),
			Err(err) => match err.capacity_hint() {
				Some(capacity) => String::with_capacity(capacity),
				None => String::new(),
			}
		};

        let mut formatter = core::fmt::Formatter::new(&mut buf);
        // Bypass format_args!() to avoid write_str with zero-length strs
        fmt::Display::fmt(self, &mut formatter)
            .expect("a Display implementation returned an error unexpectedly");
        buf
	}
}

// Example `impl Display` code

impl fmt::Display for str {
	fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {...}

	fn try_to_str<'a>(
		&'a self,
		fixed_buf: &'a mut [u8],
	) -> Result<&'a str, TryToStrError> {
		Ok(self)
	}
}

impl fmt::Display for u8 {
	fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {...}

	fn try_to_str<'a>(
		&'a self,
		fixed_buf: &'a mut [u8],
	) -> Result<&'a str, TryToStrError> {
		const REQUIRED_BUF_CAPACITY: usize = 3;

		if fixed_buf.len() >= REQUIRED_BUF_CAPACITY {
			// write to fixed_buf
			let mut n = *self;
			let mut offset = 0;
			if n >= 10 {
				if n >= 100 {
					fixed_buf[offset] = b'0' + n / 100;
					offset += 1;
					n = n % 100;
				}
				fixed_buf[offset] = b'0' + n / 10;
				offset += 1;
				n = n % 10;
			}
			fixed_buf[offset] = b'0' + n;
			offset += 1;

			Ok(unsafe {
				std::str::from_utf8_unchecked(&fixed_buf[..offset])
			})
		} else {
			return Err(TryToStrError::new())
		}
	}
}

Buffer capacity

To make this change really zero cost it's recommended to have both FIXED_BUF_CAPACITY(from the caller side) and REQUIRED_BUF_CAPACITY (from the callee side) as constant, to let the branch optimized out when inlined.

The callee should know its required buffer capacity, but callers need to decide the capacity that is large enough for practical cases. For example f64 may take up to 17 bytes to print, and SocketAddr may take 58 bytes. I suggest to use 64 as a base value since it's small enough for small stack buffer and still large enough for most leaf types.

Composite types

While the fmt method is designed to be composited, it's not generally recommended for composite types to override try_to_str method. This method is designed for directly consuming leaf types without any additional formatter arguments like .to_string(). Wrapper types(like Arc<T>) may override this method to delegate to its inner type.

Alternatives

We can keep the current specialization based structure. It's proven to be good enough for most use cases and we may still add another specialized impl when needed.

Technically this behavior can be implemented with 3rd party crate with its own ToString and DisplayExt traits. But its .to_string() method can't be applied to other crate's types that only implements standard Display trait not the 3rd party DisplayExt trait.

Links and related work

Please help me to fill this section!

4 Likes

Open an ACP

This adds an API only for the purpose of generating optimized code.

Could formatting be improved overall with smarter code generation or dedicated optimization pass?

For example, if an implementation of Display::fmt only called formatter.write_str(expr), then the compiler could transform it to String::from(expr), throwing away the formatter entirely. This is probably too complex for LLVM, but maybe a dedicated MIR opt pass could do it?

5 Likes

Arguments::as_str is a similar "just for the purpose of optimization" API, so there is some precedent for adding shortcuts for the "essentially just a string" case.

The main reason this is difficult AIUI is the dynamicism involved. Even if we ignore potential difficulties around reference validity guarantees[1], while it could be somewhat straightforward to replace fn fmt(&self, &mut Formatter<'_>) -> Result with fn fmt(&self, &mut dyn Write + '_) -> Result, the devirtualization to remove the dyn dispatch (required to actually DCE the fmt machinery) is much less straightforward.

That said, an MIR pass which attempts some amount of devirtualization would be an interesting project. AIUI most MIR opts have been focused on reducing the amount of IR passed to LLVM, and devirtualization would usually move in the other direction, but perhaps there's a heuristic that rustc could use that could remain a net positive?

After current MIR inlining, calling <&String as Display>::fmt(s, f) or <str as Display>::fmt(s, f) with s: &&String, f: &mut Formatter look the same, modulo debug information. A call to <&String as ToString>::to_string is just a call, whereas <str as ToString>::to_string is fully inlined. (playground links)

ToString::to_string is already marked #[inline] with a note that while unconventional for a generic impl, it has significant perf impact (ref: #74852). <&_ as Display>::fmt does not have such an #[inline] annotation; perhaps adding it would enable <&String as ToString>::to_string to be inlined?

I recall seeing that the heuristic for auto-#[inline] is roughly that no MIR call statements exist in the optimized MIR. <str as ToString>::to_string obviously does include call ops (into allocation, as well as Vec::deref, interestingly[2]).

Subobservation: Vec::deref isn't known to not unwind. I would've hoped it'd just've been that MIR always includes unwind edges, but Vec::deref has -> [unwind continue] where a #[rustc_nounwind] call has -> [unwind unreachable]. An MIR pass/opt to record cross-crate functions known to never unwind could potentially unblock some hidden optimizations, if not at the LLVM level, then at least at the MIR level.

Edit to add: reported Vec::deref MIR inlining regression as an issue


  1. I'm not sure exactly how relevant it is here, but it can be difficult to automatically optimize fn(&Scalar) into fn(Scalar) because while it's a validity requirement for the reference to be dereferenceable to sufficient bytes, there's no validity requirement for the bytes to be a valid instance of the scalar (currently; disclaimer: undecided, my own non-normative recollection, etc) even if we derive proof that the address is irrelevant. ↩ī¸Ž

  2. And this is despite the function being marked as #[inline]. Here it is open coded (shows there aren't any reachable unwinding edges). Gut guess: the call to std::slice::from_raw_parts::precondition_check is blocking inlining :slightly_frowning_face: Justification: on stable, it inlines and doesn't include that call, making it a single straightline basic block, whereas it does include the UB check on beta and doesn't inline there. ↩ī¸Ž

5 Likes

I tried to implement this deduction of functions that don't unwind in MIR in order to make more use of the InstSimplify that I added for #[rustc_nounwind] functions. I couldn't figure out how to do this without query cycles. The deduction is easy for function bodies that do not contain any calls, but doing anything interprocedural in a MIR optimization requires the same query cycle avoidance strategy the inliner uses: inspecting unoptimized MIR.

Someone could implement an opportunistic version, but I would rather like to have a complete one. When the inliner gives up on obviously good inlinings because of cycle avoidance it's hard to notice how inconsistent it is, in this case it would be a fair bit more obvious.

2 Likes

I'd posit that only handling direct recursion and bailing on cycles would already get 99% of the nounwind inference we'd be able to practically do at the MIR level. Or if not 99% of cases, then 99% of benefit, since the primary benefit comes from keeping simple helpers simple (even when they go slightly over the inlining threshold).

Primarily because I think most nonartificial cases with cycles would require some amount of range analysis to prove panic free, which I don't think we're doing on MIR.

You should probably explain why the added method isn't simply fn(&self) -> String. Actually, at the very moment that I'm writing this, I realise this might be the simple problem of core vs alloc, right?

Anyways, also I feel like you're pessimizing the implementation for str which now does go through a dynamic function call to dyn Write, doesn't it?


Edit: Thinking some more of core-compatible alternatives to fn(&self) -> String... Maybe something like fn<S: TraitForRelevantApiOfString>(&self) -> S[1] could work fine? And then for object safety... well, a Self: Sized bound could work; except that that precludes the implementation for str which isn't Sized. If we ever got something like what was just mentioned here yesterday though, that might be enough :thinking:

Edit2: Nevermind the previous paragraph. I think it doesn't work this way, because with Self: Sized or Self: NoDyn you still no longer could call the thing in the generic implementation of ToString for T: Display + !Sized.


  1. some of the relevant API of String would probably include a from_str method, a with_capacity constructor, and a push_str method. Maybe thats all that's needed already? ↩ī¸Ž

If the method is meant to be an optimization not overridden except by "newtype" and "smart pointer" delegating impls, it'd probably make sense to use something like a struct fmt::Buffer([MaybeUninit<u8>; 64]) as the stack buffer, so size is guaranteed to be constant (but not necessarily exposed downstream), and perhaps guarantee alignment matching the stack's.

Compare with io which uses essentially const DEFAULT_BUF_SIZE: usize = if cfg!(bare_metal) { 512 } else { 8 * 1024 }. Windows MAX_PATH is 260 "characters"... essentially 512 bytes. POSIX PATH_MAX is often 1024 or 4096. A KB of stack might be a bit much to reserve when it's only expected to be actually used for leaf scalars, though.

If this is done, the Formatter should probably track it so that leaf scalars in a compound type can use it instead of making their own stack buffers, if &mut is strong enough to not make that a pessimization.

Separate aside: should the capacity hint be Option<usize>, (usize, Option<usize>) like Iterator::size_hint (where essentially only the lower bound is used anyway), or just a usize "reserve hint" like has been suggested a few times for Iterator? (The only reason String::new() isn't just String::with_capacity(0) is const compatibility, which doesn't matter here.)

Note that try_to_str is just providing buffer space which is not the resulting String, and the returned &str is the actual source of truth. So str would just return Ok(self), never falling through to the dyn Write general case.

2 Likes

Mostly yes. But I believe this method would have other applications like fn write_to(target: &mut String, content: impl Display). Maybe I should add this to the ACP.

No, that's why the lifetime 'a is used. str should return Ok(self) on this method. I'll add this example too.

2 Likes

Indeed, I missed that, but in any case it'll be something good to call out explicitly.

2 Likes

True. I originally used Option<usize> to preserve String::new() case which I found now is meaningless. A usize would be better. I don't think the hint should be either lower or upper bound strictly, though.

In case of KBs of buffer, right? Otherwise I don't think it's necessary as each intermediate fmt calls wouldn't utilize the buffer. Leaf scalars may still allocate its own stack buffer but they won't stack if they're true leaves.

Alignment might help the optimization, but I'm not sure about the MaybeUninit part. I do expect this buffer is not something people touch regularly, but would this be a real deal for 64bytes buffer? It wouldn't hurt ergonomics much as implementors may just MaybeUninit::write_slice() it with required capacity, though.

Another possibly less impactful idea might be to transition to to_owned in teaching materials. The "best way" to convert from &str to String has been discussed, so I'll drop a few links here:

1 Like

Wouldn't it be better to try to figure out an "ultra-min" specialisation instead, and make that feature available stably? Or perhaps a limited (but unsafe) specialisation.

2 Likes

Do you think we should specialize both str and &str for this trait? How about &&str and &&&str? Or char, &char, Cow<str>, &Cow<str>, bool, &bool? Or more Arc<str>, &Arc<str>, Rc<str>, &Rc<str>, &Arc<String> or whatever combination that somehow seems practical? The purpose of this ACP is to make it doesn't matter.

It'd be a bummer and pretty annoying to lose the performance IMO. When I started using Rust, "don't use to_string, it's slow" was a common correction, sort of like how "always take &str as an arg, not &String" still is. Over time that went away, which is presumably also why use of to_string became so common. This would basically be a step backwards to needing that correction.

I actually didn't remember what the explicit correction was when I started writing this comment, only that there were right and wrong ways to go from literals to Strings. I didn't remember because I haven't had to for years -- I stopped caring when I was explicitly told or read somewhere there was no longer a penalty! I'm certain I've since used to_string where I "shouldn't have" (and wouldn't have), if it had always remained slow.

Looks like this was the announcement...

One method, to_string, comes from a generic API which was previously relatively slow, while the custom to_owned implementation provided better performance. Using specialization, these two functions are now equivalent.

...and in the PR someone said...

Cool! :skull: to the ".to_string()" is slow memes!

and clippy deprecated the lint against to_string (though it was later restored as a pedantic lint (not due to performance concerns)).

6 Likes

ACP submitted

1 Like

I have often thought that it is a mistake to teach the use of to_string() as a way to create a String from a &str.

to_string() uses the Display trait to obtain a stringified version of the item you are calling it on. For pretty much any type other than &str it involves converting or interpreting the value into something for display purposes. It is more or less a side effect that when called on a borrowed &str you get an owned String with the same content as the &str.

to_owned() however is explicit: given a reference type, return an owned version of the value. The intent is clear: you want an owned value.

It works for other types as well (as long as they implement ToOwned).

I do understand that "my string".to_string() might seem more intuitive to someone new to the language - but on the other hand, the concept of borrowed vs. owned values is so central, the sooner it is highlighted to a new user the better, IMHO.

5 Likes