Index by Regex?

kstep · March 11, 2015, 4:19pm

I’m quite new to this place and never wrote anything here, as I usually am satisfied with the Rust so far, but recently, skimming through Ruby docs, I stumbled upon one interesting idea of indexing a string with a regex.

I thought it would be cool to have such feature in Rust and threw out a PoC. Example usage would be:

let re = regex!(r"W\w+");
let world = &"Hello, World"[Re(re)];
println!(world);

As you can see I have to use newtype to work around trait implementation rules, which is OK. What I think, it would be cool to have impl Index<Regex> for str in regex crate ready to use, which would make the code more clear and elegant (at least from my POV).

Before providing a PR in Github with this thing, I decided to ask community about the feature, would you accept it in Rust regex crate? What would you say?

steveklabnik · March 11, 2015, 4:39pm

We already do not let you index strings, so indexing by regex doesn’t make much sense.

kstep · March 11, 2015, 4:53pm

Hmm, I might have missed it somehow. I tried my code with rust-nightly and it worked just fine, and I use str indexing by Range<uint>. I don’t follow rust git very close these days (aside from breaking changes, which, well, break my code, but I have no much time to hack my rust projects too). I think it’s a loss in impressiveness, but nevertheless, thanks for pointing it out. Do you know the reasoning behind the feature removal?

tbu · March 11, 2015, 5:31pm

I think we let people index strings by byte indices.

kstep · March 11, 2015, 5:46pm

That’s how it work now in nightly, isn’t it? And I think it’s very reasonable, both efficient and expected.

steveklabnik · March 11, 2015, 6:18pm

http://doc.rust-lang.org/book/more-strings.html#indexing-strings is what I’m referring to, we haven’t for a long time. We do let you convert to bytes, which might be what @tbu is referring to.

kstep · March 11, 2015, 6:29pm

So what you mean I will have to write "abcde".as_bytes()[0..3] instead of "abcde"[0..3]? Not very ergonomic, IMHO, but I see the reason behind it and I have to agree. I think my original question is answered, thank you all for my enlightenment

tbu · March 11, 2015, 6:35pm

Then the guide needs to be updated, we allow indexing by bytes for str:

fn main() {
    println!("{:?}", &"foobar"[0..3]);
}

compiles and prints the first three bytes, i.e. "foo" (Playpen).

Florob · March 11, 2015, 6:39pm

Not so fast. I think @steveklabnik is just confused. Rust does not allow indexing string slices by a single integer, for various reasons explained in the book. Rust does however let you index string slices by ranges (aka. slicing). That feature is absolutely not going away. It is in fact marked as stable.

Concerning your original question: I think indexing by regex might be a nice feature. I’m not entirely sure whether this doesn’t already feal like operator abuse though. I think it is likely fine, because the operand is clearly a regex, but I’d want others to chime in.

seanmonstar · March 11, 2015, 6:58pm

With ideas like these, the best plan is probably to make a crate on crates.io. If it gets used frequently, the next step would probably be to open an issue/PR on the regex crate about including it.

steveklabnik · March 11, 2015, 7:28pm

Ahh, I was not aware of the indexing by slicing, to be honest. That feels... bad

Not very ergonomic, IMHO,

The important thing is to realize you're iterating over bytes, not 'characters' or something else. With Unicode, you need to know, do you want codepoints, graphemes, or bytes. Because ASCII is a single-byte encoding, people are used to being able to just do stuff over bytes, but that doesn't work, as a given character can be multi-byte.

kstep · March 11, 2015, 7:34pm

Yes, I’m aware of differences between chars/codepoints/graphemes/bytes

kstep · March 11, 2015, 7:36pm

I considered to put it into a crate, but I will have to introduce a lot of newtypes, hence I decided to ask for advice here first.

bluss · March 11, 2015, 9:27pm

It’s a cool hack, not sure if there is much more to say than that

ArtemGr · March 13, 2015, 10:43pm

Wait, aren’t regexes unicode-aware? When I worked with regexes in Perl, C++ (PCRE, re2) and Java they were. If they’re unicode-aware, it should be absolutely no problem to index a string by regex because the index will always return a slice to a correct UTF-8 sequence.

ArtemGr · March 13, 2015, 10:46pm

I’d like to see it bundled with https://crates.io/crates/regex

DanielFath · March 14, 2015, 9:44pm

That question makes no sense, if you know Unicode supports hieroglyphs and ancient Sumerian letters you’ll realize why that’s nearly impossible.

How do you do Regex, which returns A where B is below it and C bottom left corner? Regex only works ok things that are written left to right. If I write ab* and use a RTL(right to left) language it should match bbbbba!

You could limit Regex only to European-like LTR languages.

Florob · March 14, 2015, 10:37pm

You’re conflating (matching on) display with matching a stream of copdeoints. There is no reason regex shouldn’t work on Unicode strings. In particular matching on Unicode Category is supported by the regex crate. You are right though that writing regexes in the presence of RTL characters can be confusing. It is likely better to use an escape sequence in those cases. Depending on ones needs one might also have to think about Unicode Normalization, but none of that fundamentally prevents regex from working.

reem · March 14, 2015, 10:38pm

Since regex is no longer a part of the stabilizing distribution, I don’t see why it shouldn’t support more ambitious and experimental APIs like the one suggested here.

regex is a pre-1.0 library; if these features prove undesirable, we can just remove them.

ArtemGr · March 14, 2015, 11:03pm

It’s one of those features which are very unlikely to get much attention unless advertized by including them in the existing library. I never knew that Ruby had such a feature, for example, and wouldn’t expect it to be present in any other language or library I’m working with. Even in Ruby, I doubt a lot of people aware of it and use it. Still, it’s a neat feature to have.

Topic		Replies	Views
Indexing Rust code in IntelliJ Rust tools and infrastructure	3	1424	March 25, 2019
`str` method for slicing code-point (i.e. `char`) ranges libs	23	2939	March 25, 2019
Wild idea: deprecating APIs that conflate str and [u8] libs	59	3574	November 12, 2020
Why doesn't Index for String delegate to Index for str? libs	9	1261	January 9, 2020
Using a more efficient string matching algorithm libs	39	7691	September 15, 2022

Index by Regex?

Related topics