`.into` for converting between Vec<u8>, Vec<char> and String

From should be implemented for conversion between these three. How does everyone feel about this?

Desired outcome:

let byte_vec = vec![72, 101, 108, 108, 111, 46];

// Currently impossible. Vec<char> doesn't implement From<Vec<u8>>
let char_vec: Vec<char> = byte_vec.into();

// Currently impossible. String doesn't implement From<Vec<char>>
let string: String = char_vec.into(); // or byte_vec.into()

Harsh reality :frowning:

// In pseudo-code. The previous above examples were tested,
// just to to prove that they don't work. This closely resembles reality, 
// but may have some syntax errors

let byte_vec = vec![72, 101, 108, 108, 111, 46];

// :(
let char_vec: Vec<char> = byte_vec.iter().map(|x| *x as char).collect();

// Trait import required;
// otherwise it's char_vec.into_iter().collect(), which is still verbose (or, 
// worse yet, ugly) and completely avoidable with a From impl.
use std::iter::FromIterator;
let string: String = String::from_iter(char_vec);

In most cases, Vec<char> should not be used at all since String is more memory-efficient. I think it’s a good thing that Rust std does not include any API (including trait implementations such as the suggested ones) explicitly for Vec<char> because the lack of such API signals that Vec<char> is not a good choice of representing strings.

14 Likes

Well, first: AFAICT your byte_vec is really a i32_vec without type annotations.

That left aside:

  • What encoding is that byte vector in? (UTF8? UTF32? other?)
  • How do you handle errors (encoding errors, early truncation, illegal code points, ...)?
  • What is wrong with, e.g., String::fom_utf8()/_lossy()/_unchecked()?
2 Likes

Vec<char> can be indexed, while AFAIK String cannot be directly indexed, at least not that easily. In some use-cases (e.g text parsing) that can be really helpful.

2 Likes

I intended for it to be a UTF8 vector. I should've added Vec<u8>.

Not much -- though .into can be more concise. A case for two ways to do the same thing can be made since String::from and .to_string() and into() already exist for creating a String

Good point -- maybe only .try_into() should be implemented for Vec<u8> to String. I think an implicit lossy conversion via .into() is a bad idea.

What kind of text parsing are you doing that you need to index into Unicode scalar values as opposed to bytes or unicode graphemes?

7 Likes

I don't think anybody will prefer Vec<char> over String unless they have a good reason to do so, e.g. random access is needed. I feel like being able to convert between these two directly would be an usefull addition. I also feel like a conversion from Vec<char> to Vec<u32> and an inverse try_from would be a usefull addition. However, I feel like adding a direct conversion between Vec<char> and Vec<u8> is not unambiguos and rarely used and this use case is better be served by going a detour over String (for UFT-8) or Vec<u32> for UTF-32 instead.

I'm on libs-api, and I can at least say that impl From<Vec<u8>> for Vec<char> won't be added because there is too much ambiguity for what it means. From your code, it seems clear that you just want this to mean that each of the bytes get treated as a single codepoint. (And if so, my goodness, why bother with char at all?) But I initially assumed you wanted the Vec<u8> treated as UTF-8. Either way, writing out the conversion explicitly is much clearer, doesn't require guesswork and is something I would argue as rather niche.

As for impl From<Vec<char>> for String, that's a bit more interesting. We do have impl From<char> for String, and there is definitely one unambiguous meaning as to what the conversion is. So I think adding this impl might be plausible, but my initial reaction to it would be to argue against it. Namely, I personally don't see it as a conversion worth encouraging and the FromIterator code makes it much clearer what kinds of costs are being incurred here. Namely, you can't just do an in-place conversion from Vec<char> to String, you actually have to iterate over the chars, encode them to UTF-8 and push the encoded bytes into a String.

If both of those impls were added, they would enable you to write code like this:

let bytes = vec![0xFF];
let string = String::from(Vec::<char>::from(bytes));

... at which point I would stand up and loudly proclaim, "WAT?"

Popping up a level, I suspect your main issue is using Vec<char> at all in the first place. This is a very rarely used type. I'm not saying it's always wrong, but... it's probably wrong. And very unlikely to be common enough to be worth first class conveniences in the standard library. So I would encourage you to share some snippets of code (perhaps in a new thread) asking how it might be written more idiomatically instead of using Vec<char>. You allude to using Vec<char> so that you have random access in parsers, but parsers on String directly make it very easy to get byte offsets into the String, and those in turn also give you random access.

17 Likes

More concise and less clear - which of the two it represents? Probably not unchecked because it cannot be unsafe, but is it from_utf8().unwrap() or from_utf8_lossy()?

This is different: ToString is a general trait to convert things to String, and it can also convert &str to String. ToOwned is a general trait for going from borrowed->owned, and it can also go from borrowed &str to owned String. And there is the specific &str->String From impl. But here there are already three specific conversion methods, I don't think we should add another one for the sake of brevity.

I also did some toying around into the Rust source right now and -- it would be pretty difficult to implement some of these conversions due to introduced type ambiguity (e.g, does the user mean a Vec<u8> or Vec<char>?). In pre-existing code, Vec::from was used without type annotations. Also, the #[unstable] tag doesn't work for impls right now which is... non-ideal, to say the least. I kind of feared that that'd be the case, but completely lazed out on checking.

I think discourse looking for help would also be suited to the other Rust form, so I will most likely be taking that conversation there. Thanks for the input!

No, Vec<char> can't be meaningfully randomly indexed. This is because Unicode itself, in all of its representations, is a stateful variable-length coding. Codepoints alone may not have any useful meaning. n-th char could be a left side of a flag emoji, or an umlaut modifier for a letter preceding it, or there could have been a direction-swapping codepoint somewhere along the way, and you wouldn't even know where n is in the string logically.

You still have to scan the string from the start to build an index of grapheme clusters, word boundaries, or whatever else you're interested in in the text. But if you do such scan, then you don't need Vec<char>, and indexes into UTF-8 string are just as good. In any case a logical "character" can span multiple elements.

8 Likes

I think there are definitly some use cases for accessing strings by Scalar Value. In particular because Unicode defines most character properties on this level. But yes in allmost every usecase, it is best to consider a string as one object with the separation into Scalar Values being only one of many.

Parsing algorithms don't use random access. They read the characters consecutively.

To read characters consecutively you can use s.chars(): an iterator over chars.