Index by Regex?


#1

I’m quite new to this place and never wrote anything here, as I usually am satisfied with the Rust so far, but recently, skimming through Ruby docs, I stumbled upon one interesting idea of indexing a string with a regex.

I thought it would be cool to have such feature in Rust and threw out a PoC. Example usage would be:

let re = regex!(r"W\w+");
let world = &"Hello, World"[Re(re)];
println!(world);

As you can see I have to use newtype to work around trait implementation rules, which is OK. What I think, it would be cool to have impl Index<Regex> for str in regex crate ready to use, which would make the code more clear and elegant (at least from my POV).

Before providing a PR in Github with this thing, I decided to ask community about the feature, would you accept it in Rust regex crate? What would you say?


#2

We already do not let you index strings, so indexing by regex doesn’t make much sense.


#3

Hmm, I might have missed it somehow. I tried my code with rust-nightly and it worked just fine, and I use str indexing by Range<uint>. I don’t follow rust git very close these days (aside from breaking changes, which, well, break my code, but I have no much time to hack my rust projects too). I think it’s a loss in impressiveness, but nevertheless, thanks for pointing it out. Do you know the reasoning behind the feature removal?


#4

I think we let people index strings by byte indices.


#5

That’s how it work now in nightly, isn’t it? And I think it’s very reasonable, both efficient and expected.


#6

http://doc.rust-lang.org/book/more-strings.html#indexing-strings is what I’m referring to, we haven’t for a long time. We do let you convert to bytes, which might be what @tbu is referring to.


#7

So what you mean I will have to write "abcde".as_bytes()[0..3] instead of "abcde"[0..3]? Not very ergonomic, IMHO, but I see the reason behind it and I have to agree. I think my original question is answered, thank you all for my enlightenment :smile:


#8

Then the guide needs to be updated, we allow indexing by bytes for str:

fn main() {
    println!("{:?}", &"foobar"[0..3]);
}

compiles and prints the first three bytes, i.e. "foo" (Playpen).


#9

Not so fast. I think @steveklabnik is just confused. Rust does not allow indexing string slices by a single integer, for various reasons explained in the book. Rust does however let you index string slices by ranges (aka. slicing). That feature is absolutely not going away. It is in fact marked as stable.

Concerning your original question: I think indexing by regex might be a nice feature. I’m not entirely sure whether this doesn’t already feal like operator abuse though. I think it is likely fine, because the operand is clearly a regex, but I’d want others to chime in.


#10

With ideas like these, the best plan is probably to make a crate on crates.io. If it gets used frequently, the next step would probably be to open an issue/PR on the regex crate about including it.


#11

Ahh, I was not aware of the indexing by slicing, to be honest. That feels… bad :frowning:

Not very ergonomic, IMHO,

The important thing is to realize you’re iterating over bytes, not ’characters’ or something else. With Unicode, you need to know, do you want codepoints, graphemes, or bytes. Because ASCII is a single-byte encoding, people are used to being able to just do stuff over bytes, but that doesn’t work, as a given character can be multi-byte.


#12

Yes, I’m aware of differences between chars/codepoints/graphemes/bytes :smile:


#13

I considered to put it into a crate, but I will have to introduce a lot of newtypes, hence I decided to ask for advice here first.


#14

It’s a cool hack, not sure if there is much more to say than that :smile:


#15

Wait, aren’t regexes unicode-aware? When I worked with regexes in Perl, C++ (PCRE, re2) and Java they were. If they’re unicode-aware, it should be absolutely no problem to index a string by regex because the index will always return a slice to a correct UTF-8 sequence.


#16

I’d like to see it bundled with https://crates.io/crates/regex


#17

That question makes no sense, if you know Unicode supports hieroglyphs and ancient Sumerian letters you’ll realize why that’s nearly impossible.

How do you do Regex, which returns A where B is below it and C bottom left corner? Regex only works ok things that are written left to right. If I write ab* and use a RTL(right to left) language it should match bbbbba!

You could limit Regex only to European-like LTR languages.


#18

You’re conflating (matching on) display with matching a stream of copdeoints. There is no reason regex shouldn’t work on Unicode strings. In particular matching on Unicode Category is supported by the regex crate. You are right though that writing regexes in the presence of RTL characters can be confusing. It is likely better to use an escape sequence in those cases. Depending on ones needs one might also have to think about Unicode Normalization, but none of that fundamentally prevents regex from working.


#19

Since regex is no longer a part of the stabilizing distribution, I don’t see why it shouldn’t support more ambitious and experimental APIs like the one suggested here.

regex is a pre-1.0 library; if these features prove undesirable, we can just remove them.


#20

It’s one of those features which are very unlikely to get much attention unless advertized by including them in the existing library. I never knew that Ruby had such a feature, for example, and wouldn’t expect it to be present in any other language or library I’m working with. Even in Ruby, I doubt a lot of people aware of it and use it. Still, it’s a neat feature to have.