Is there a good reason why String has no `into_chars`

The stdlib has owning iterators of many collections. It is possible to get an owning iterator over bytes via into_bytes().into_iter(), but it is not possible to get an owning iterator over the chars of a String, like into_chars, or a IntoIterator impl for String.

Is there a good reason that I am missing?

Edit: I understand why String has no IntoIterator.

1 Like

Why no IntoIterator implementation? Same reason why &str or &String isn’t IntoIterator; String does not favor any iteration mode by making it the default, be it over u8 (UTF-8 code units) or char (Unicode scalar values), or over grapheme clusters (probably the closest to our intuitive notion of “character”, as long as it’s applicable at all) e.g. via the unicode-segmentation crate. This helps force users to have at least a slightly higher chance to research the exact consequences of each possible choice; e.g. you wouldn’t want to iterate over u8s and pretend only ASCII exists; nor would you want to iterate over chars and pretend it’s somehow truly separate “characters” that you’ll be getting.

Why no owned iterator (over chars) at all though? I actually don’t know; a method for that sounds reasonable to me. Makes me wonder if there’s any prior discussion.

6 Likes

I asked urlo years ago. It was solved by an XY-Answer. However, I this faced this issue again and am now asking here.

The fact that into_chars method is missing is particularly painful because a nonstdlib-implementation either does some owning_ref tricks or reimplements the utf8-handling of the chars-iterator. If I tried any of these, it would be unsound on the first 5 tries.

1 Like

You can write one like this:

fn into_chars(s: String) -> impl Iterator<Item = char> {
    let mut i = 0;
    std::iter::from_fn(move ||
        if i < s.len() {
            let c = s[i..].chars().next().unwrap();
            i += c.len_utf8();
            Some(c)
        } else {
            None
        }
    )
}

This doesn't suffer too much from the typical lifetime troubles because the closure can capture the String by value, and doesn't need to return references to anything.

2 Likes

Note that you can implement this iterator without having to do anything nasty:

struct IntoChars {
    string: String,
    position: usize,
}

impl Iterator for IntoChars {
    type Item = char;

    fn next(&mut self) -> Option<char> {
        let c = self.string[self.position..].chars().next()?;
        self.position += c.len_utf8();
        Some(c)
    }
}

An into_chars method for String still sounds reasonable though.

Edit: @quaternic you beat me by about a second.

4 Likes

Oh nice. I forgot about len_utf8. Still, unless there is a good reason for not having into_chars, it would be a good addition to the stdlib. The fact that you know the signature and its behavior from the name alone, is a indicator that it is consistent with the rest of the stdlib.

Note that it's better to use a non-owning iterator where possible -- it's a simpler construct and a smaller type. (For Vec the owning iterator isn't that much worse, but for arrays it's much, much better to use the slice iterator than the array::IntoIter when you can.)

So I think its absence is just a reflection of it being something that's not usually needed -- most string handling is written over &strs, rather than forcing Strings, and using chars at all is discouraged as misleading for text since it doesn't solve the problems people wish it did.

It's probably something that's fine to have, but never had a "oh, it's really needed" use case raised. Not to mention that it still needs to do the UTF-8 decoding, so it's not like the owned iterator would be faster.

What's the situation you hit where you wanted an owning iterator over chars? (If the situation three years ago was handled by "you should use bytes instead", maybe the one today is also better handled with an XY answer.)

3 Likes

I suggest opening an ACP

2 Likes

Conceptually, this makes little sense to me. char is a primitive type and it should not be a surprise that it does not represent grapheme clusters.

Of course. And I have enough experience to find the alternatives. It is my understanding that this forum is not about solving specific problems, but language/libs design.

I claim that

  • the stdlib has a into_chars-shaped hole. Every other container has owning iterators, String has chars()
  • There is some code for which into_chars makes the most sense and requires re-implementing this function or nontrivial changes to the ownership structure
  • I have not seen any downsides of including into_chars

You can disagree.

I considered doing that after checking for reasons for why this function does not yet exist.

2 Likes

Not everyone is educated about Unicode. I've had to steer people away from char and emphasize the inherent non-fixed-size nature of grapheme clusters ("character to a human") multiple times.

A String is not a container of chars -- and thinking it is trips up newcomers already. The iterator over what String does own is spelled string.into_bytes().into_iter().


All that being said I don't have any strong objection the idea, given enough documentation.[1]


  1. I'd be opposed to IntoIterator but understand you're not suggesting that. ↩︎

2 Likes

There's .chars() there already, and there's the awfully-named char type, so all the footguns are already there. Lack of into_chars() may be causing more struggles with ownership, than preventing Unicode-ignorant code.

If there was a chance of Rust renaming char to, say, rune, then I think there would be a motivation to hold off with adding into_chars() until it can be into_runes().

Aside: Go’s precedent notwithstanding, “rune” is a terrible name for a type representing a single Unicode codepoint. If anything, “rune” would have been a good name for a grapheme cluster as a “single human-recognized symbol”. (Contrast with glyph in the font / text rendering space which can refer to an entire symbol or just a piece.)

I haven’t seen a good, concise name for a Unicode codepoint. But in my mind, that’s okay, because as noted already there are very very few reasons why you should ever use them, and all of them are circumstances where you should be aware of the relationship between codepoints, code units (bytes), and grapheme clusters. (Examples: a regex engine, locale-aware collation, transcoding, parsing a Unicode-aware format—XML, yuck.)

8 Likes

It kinda is though. Any sequence of codepoints is a valid String. Not necessarily a sensible String, since graphemes can consist of multiple codepoints, but valid in the sense of type-safety. That's why String has push for pushing chars onto a String but no method for pushing bytes, since the latter would be unsound. So if you're slicing and dicing strings at the char level you at least don't have to worry about panics or undefined behaviour.

5 Likes
As I have no actual objection this is chatter for the sake of chatter, but anyway...

I agree, I just meant the heads-up blurb on chars should be copy-pasta'd to into_chars, should it come to be.

They were comparing to the iterators of containers in stdlib, which are generally

  • impl IntoIterator<Item = T> for Container<T>
  • impl<'a> IntoIterator<Item = &'a T> for &'a Container<T>
  • impl<'a> IntoIterator<Item = &'a mut T> for &'a mut Container<T>

And I'm saying it's not the same thing because this todo cannot be solved.

If course. But: I understand str to be a "packed" representation of [char][1]. A fixedbitset also does not give &bool to its contents, but by-value getter and setter.

But I also understand if String should only be considered a container for other sub-strs (as indexed via [..]), but never chars. But since char is part of the stdlib, this is maybe hard to sell.

But indeed, I am unable to find a method that gets a char at a byte-location. (obviously panicking on nonchar boundaries) Maybe s.split_at(ix).1.chars().next()?. However, I do understand why such function does not exit. The loop for i in 0..s.len() {s.char_at_byte(i)} would be a massive footgun that can be missed when not testing on nonascii.


  1. Maybe this mental model is my fault, but it served me well so far. In particular why in-place mutation of chars is not possible. Funnily enough, make_ascii_{upper,lower}case seem to be the only operations that mutate the content of a &mut str ↩︎

I think that's typically something like s[i..].chars().next().unwrap()

or non-panicking s.get(i..).and_then(|s| s.chars().next())

1 Like

s[ix..].chars().next() is probably the shortest way.

As an aside, I just realized there doesn't seem to be a way of decoding a char directly from a byte stream, the way there's char::decode_utf16 (and both encode_utf8 and encode_utf16), and neither is there something like [try_]from_utf8(&[u8]) -> char.

for into_bytes,

pub fn into_bytes(self) -> Vec<u8, Global>

What would into_chars looks like?

I think the idea was

pub fn into_chars(self) -> IntoChars

with a return type

impl Iterator for IntoChars {
    type Item = char;
    // …
}
3 Likes

I know the stdlib worse than I thought... The only other prominent example of the into_X function returning an iterator seem to be {Hash,BTree}Map::into_{keys,values}(). The idea came from there.