Should &str implement IntoIterator as equivalent to .chars()?

I came across the fact that &str does not implement IntoIterator, and the explanation seems to be "it is not clear whether you want the &str's bytes or UTF-8 chars".

However I think it would be a sensible default to implement IntoIterator for &str yielding chars, because most of the time you are using a &str as text and not as a byte string. You could still get an iterator of bytes by calling .bytes() anyways. But it is kind of surprising that you can't use a "text" as an impl IntoIterator<Item=char>.

What do you think?

There's even more than just the option to consider bytes or chars. E.g. there's also grapheme-clusters, as provided by the unicode-segmentation crate. Really, iterating over chars - which are unicode scalar values - can have quite surprising effects, e.g. separating characters from their diacritics, or splitting up more complex emoji into multiple parts.

This is something the documentation of the chars methodpoints out, so I think it really is a good thing that you cannot skip using this method.

22 Likes

I tend to think that when it's not obvious what the language should do, we should discourage the "default" and point in the error message to the probably wanted behavior. This is what the compiler does now (though the error message may be improved, for example by saying what chars() is).

If we do that, then it's easy to type chars() if this what you want; However, if we would allow it, then people won't know there is a problem at all, until they'll have bugs and unexpected results. And Rust in general favors correctness and explicitness over easiness of write.

4 Likes

I feel that this is essentially the same as how Python shouldn't have made for x in dictionary be equivalent to for x in dictionary.keys(). That behavior is Not What You Want often enough that it would be better to be required to explicitly specify .keys(), .values(), or .items() each time, rather than what happens now, which is that you are silently given the wrong iterator if you forget to say .values() or .items() when that was what you needed.

In the same way, it might seem like you want .chars() almost all the time and it's annoying to have to type it, but if Rust had the behavior you wanted, I expect you would find it just as annoying, or possibly worse, to be silently given .chars() in the cases where you actually wanted bytes or grapheme clusters or words or whatever.

10 Likes

Ah, thanks for the hint on grapheme clusters. I was confused at first why "y̆".chars().next() doesn't yield Some('y̆'), but seems like this is actually two characters displayed as one, y and ̆ .

I think I understand now that this is an important distinction when working with multilingual texts, which I probably would have missed if there were an IntoIterator<Item=char> implementation on &str.

4 Likes

Agree fully, but they also had a different kind of consistency to think of - with the in operator, used for containment check. I could see that easily tipping the scales when the design was considered. So there is no perfect choice(?).

1 Like

I actually watched a presentation by Raymond Hettinger in which he asserted that they looked at which requirement comes up more often wrt. for item in dict: do people need keys or key-value pairs? He then went on to say that they found lone keys were needed more often than key-value pairs, so that's why it works the way it does.

Incidentally, this contradicts my own experience and I'm glad Rust chose the other default for std map types.

4 Likes

Interesting, was this decision based on data? Because it contradicts my experience too.

My whole point here is that that's the wrong heuristic. The problem is not that lone keys, or chars, or whatever, aren't wanted more often than the alternative(s). The problem is that if you forget to ask for an alternative when you did need an alternative, you get the wrong behavior, instead of an immediate error.

In Python, this is a really big headache, because "the wrong behavior" has a decent chance of being "your program runs without complaint but its output is garbage". In Rust, it's not so bad, because the compiler checks more stuff, so "the wrong behavior" has a decent chance of still being a compile-time error, but probably not as clear of an error as it would have been if there was no default.

My gut feeling, unsupported by actual research, is that if there are two or more possibilities for what iteration over a container should mean, then the programmer should have to spell it out unless one possibility is used more than 95% of the time, maybe even 99%.

8 Likes

Python is also making different tradeoffs owing to some overloaded typedness, your dict might use country codes

{ "us" : 4,
  "ie" : 8,
  "tw" : 16,
  ... }

And then for k, v in d: will actually run without errors, at least on the surface level. But having collections iter the thing they are collections of seems like a very nice property if you can have it, especially in the presence of strong typing.

In the face of ambiguity as to what strings actually are collections of, we should probably refuse the temptation to guess.

1 Like

(Also a Python core dev here, though I've done very little in the way of the design of the language.)

I'd be shocked if this were determined to be true in any sort of robust way, mostly because this functionality has been in the language for so long, and the older a feature is the more likely it's reason for inclusion was "someone was excited and implemented it".

Another fact that's relevant in Python, but not Rust, is a desire for symmetry between __contains__ and __iter__: That is for x in iter: assert x in iter ought to hold true, and indeed the default implementation of __contains__ simply iterates over self checking equality on each element.

5 Likes

Oh, I'm absolutely not trying to argue the point that explicitly spelling out .chars() is good! I fully agree with the sentiment that

Instead, I was only trying to provide some historical/anecdotal background as to why it works the way it is in Python.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.