Is there a good reason why String has no `into_chars`

The Unicode standard calls these things "characters". Quote from the standard:

The Unicode Standard specifies a numeric value (code point) and a name for each of its characters.

I disagree. Much of the time you want to be looking at the character (i.e. code point) level. The Unicode Standard defines all the character properties at that level. You can parse things by looking at characters or their properties.

Grapheme clusters only really come into play only if you want to count the clusters (maybe for UI purposes), normalize identically looking text, ignore accents, or care about how it looks for some other similar reason.

6 Likes

Or, well, render them, which is a pretty important use case for text :slight_smile: Although one that not many have to worry about at that level, luckily.

5 Likes

Can you please expand on that? Is there a clippy lint to suggest this style of code? I'm accustomed to using into_iter on collections like Vec<T> when I can for T: Copy because it is annoying to dereference.

(Iterator::copied)

3 Likes

It is surprising to me that x.iter().copied() is better than x.into_iter() (or for &a in &x vs for a in x).

One part of the standard calls them characters.

The official Unicode glossary lists four definitions. Notably, the definition of an "abstract character" includes combining character sequences.

1 Like

The main use case of an owning iterator, IMO, is moving items out of a collection using that iterator, i.e. cases where you explicitly don't want to make a copy, but instead transfer ownership of the item from the collection to something else.

Bytes are Copy. An owning iterator provides no real advantage over one that borrows the input for Copy types. Likewise for char, which is also a Copy type.

Can you explain or show an example of something you could do with an owning iterator which you can't do with Chars? Otherwise I fail to see what advantage it provides over using Chars and then just dropping the original String when you're done.

1 Like

You may want to return the iterator, and if the String is a local, it needs to be owned by the iterator.

fn get_chars() -> impl Iterator<Item=char> {
    let s = format!("for one reason or another, you have a String");
    s.chars().filter(|c| ['a','e','i','o','u'].contains(c))
    // error[E0597]: `s` does not live long enough
}
5 Likes

If you already own a String, you can trivially create whatever iterator you need without losing ownership of the String, so it's not clear why you'd want an owning iterator in the first place. Just hang onto the string and construct the iterator at point of use. Unless I'm missing something?

The thing you can't do is write a single function which constructs a string and returns a character iterator over it.

Right, just return the String and when you want to iterate over it call .chars().

I get that this is not returning the owning iterator but it's basically the same, but also more flexible because you can do other stuff with the String. It seems weird to want a String you can only iterate over the chars of.

1 Like

It's not necessarily possible to do that; for example, if you're in the position of needing to satisfy a generic bound of Iterator + 'static, and (probably due to separate reasons) the element type is char.

I'm not claiming this is likely to come up, just that it could, and in that case you need an owning iterator.

4 Likes

Good point.

as others have said, an owning iterator extends the lifetime of the underlying value.

Is it currently possible to implement this function?

  • Assume the string s is created within the function
  • Assume that the filter uses is some other Fn(char) -> bool. This particular check is ascii and an owning iterator over bytes would sufficient.

The solution would be either to

  • collect the chars into a string and return that instead
  • get someone else to allocate that same string and pass &'a str and return impl Iterator<...> + 'a

Am I missing something? Both require some restructuring of the overall data-flow. This is fine in some cases and I found an alternative that worked for me in my use-case. However, I still prefer to write a chain of iterator-adapters and return the resulting iterator, where it makes sense.

This has been covered in this thread. You can implement this function by rolling your own owning iterator, either via a closure or via a struct; it does not require additional allocation.

In any case there is an ACP now.

So, first I want to caveat this that it's a very minor thing. The vast majority of loops -- those that are short and those that do a material amount of work per iteration -- will never actually care about the difference. In many of the other cases still won't make a measurable difference.

But think about what needs to be tracked for a vec::IntoIter<T>: the pointer to the allocation, the offset from the start that you've already consumed, the offset to the end of the available data, and the capacity of the allocation.

That's twice as many things as for a iter::Copied<slice::Iter<'_, T>>, which only needs two things: the pointer to the first element, and the past-the-end pointer. Being a reference (conceptually it's a &'a [T]) it doesn't need to worry about the allocation parts of it.

Less indirection and fewer fields just makes it a bit easier on the optimizer. Sometimes it matters; often it doesn't.

(It's kinda like how it's better to pass &[T] than Vec<T> to a function, even in places where you already have the Vec and it doesn't cause extra allocations.)

2 Likes

Unless the iterator shape fits into an in-place-collectible hole. But that would never be the case for a string.into_chars iterator since mapping one char to another could end up requiring more underlying storage.

Also, non-owning iterators may not be a good idea when you have to Clone the items instead of Copy.

char is always 4 bytes.

These being char, the operation is the same.

I'm overall confused by this post.

Not in a String, no, the underlying allocation is 1-4 bytes per char. Which means you can't reliably perform string.into_chars().map(...).collect::<String>() in-place, which you can for a Vec<u8>.

If you follow the reply chain you'll see a more general statement not specific to char iterators:

Note that it's better to use a non-owning iterator where possible -- it's a simpler construct and a smaller type.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.