How much is rindex used?

TL;DR

How much is rindex used?

Let’s admit it: I rarely use index. Even when I could, I often just summon a regular expression and call it a day.

On the other hand, if we’re just looking for a substring inside a string, the regular expression is overkill and a plain search with index would suffice (I touched upon this button in recent post Literal string 0 is false in Perl).

So yes, I’m aware of index and remember about it when the specific need arises… sometimes. I’ve been much less aware about rindex, the elusive sibling that looks for the substring starting from the end.

Sometimes, though, it’s just what’s needed, right? Like when thinking about a possible implementation for an endswith() function that checks –well– if a string ends with another string.

My recent discover [String::Util][] currently has an implementation that’s based on index:

sub endswith {
	my ($str, $substr) = @_;

	if (!defined($str)) {
		return undef;
	}

	if (!$substr) {
		$substr = $str;
		$str    = $_;
	}

	my $len   = length($substr);
	my $start = length($str) - $len;

	my $ret = index($str, $substr, $start) != -1;

	return $ret;
}

IMHO there’s a bug and it will be hopefully addressed.

So I wondered whether an alternative implementation based on rindex would make sense, not so much from a functional point of view (of course you can code something that works with index, the code above is a demonstration) or even most so-called non-functional points of view (like robustness, etc.), but rather from an efficiency point of view.

So… I decided to whip up a quick, arbitrary and unscientific Benchmark, to get the gist of it:

#!/usr/bin/env perl
use v5.24;
use warnings;
use experimental 'signatures';

use Benchmark qw< :all >;

my @stuff = (
   [ qw< foobar foo > ],
   [ qw< barfoo foo > ],
   [ qw< barfoo foobarbaz > ],
   [ 'bar' . ('foo' x 100), ('foo' x 100) ],
   [ ('bar' x 100) . 'foo', 'foo' ],
);

cmpthese(
   -2,
   {
      original => sub { endswith_original($_->@*) for @stuff },
      rindex   => sub { endswith_rindex($_->@*) for @stuff },
   },
);

sub endswith_original {
   my ($str, $substr) = @_;

   if (!defined($str)) {
      return undef;
   }

   if (!$substr) {
      $substr = $str;
      $str    = $_;
   }

   my $len   = length($substr);
   my $start = length($str) - $len;

   my $ret = index($str, $substr, $start) != -1;

   return $ret;
} ## end sub endswith_original

sub endswith_rindex {
   my ($str, $substr) = @_;
   return unless defined $str;
   ($str, $substr) = ($_, $str) unless defined $substr;
   my $target = length($str) - length($substr);
   return ($target >= 0) && rindex($str, $substr) == $target;
}

It seems to have an edge with these test cases:

$ perl string-util.pl 
              Rate original   rindex
original  595595/s       --     -42%
rindex   1032777/s      73%       --

I’m not sure why, though. The endwith_original implementation takes care to run the index search from the right $start point, and the two functions do more or less the same operations (similar test, the both calculate the $start/$target position where the substring is supposed to start from, then invoke their respective incantation).

The edge increases with longer strings, e.g. changing 100 to 1000 everywhere in the code above leads us to this:

 perl string-util.pl 
             Rate original   rindex
original 168048/s       --     -72%
rindex   601510/s     258%       --

My wild guess is that index might not take advantage of the length of the substring it’s looking for, so it surely $starts from the right spot, but then continues to look for it starting from the following character, then the following one, etc. up to the end of the main string.

Then, of course, I didn’t address the elephant in the room, i.e. the negative case. Let’s restrict our test to the following one:

[ ('bar' x 1000), 'foo' ],

As expected, rindex is the least efficient here, because it sweeps the whole string (almost, I think) while index’s $start allows dismissing the test very quickly:

$ perl string-util.pl 
              Rate   rindex original
rindex    990720/s       --     -77%
original 4230914/s     327%       --

At the end of the day, I start wondering whether index/rindex are actually the right tools for the job when we need to do tests like startswith and endswith: they’re prone to spend a lot of time looking for something in the wrong place!

So, let’s do two things:

settle for an arbitrary but mixed case, with both types of test;
add another contender to the mix, based on substr.

# ...
my @stuff = (
    [ ('bar' x 1000) . 'foo', 'foo' ],
    [ ('bar' x 1000) . 'BAR', 'foo' ],
)
# ...

sub endswith_substr {
   my ($str, $substr) = @_;
   return unless defined $str;
   ($str, $substr) = ($_, $str) unless defined $substr;
   my $target = length($str) - length($substr);
   return ($target >= 0) && substr($str, -$target) eq $substr;
}

This new contender is surely valid, although not a clear winner:

$ perl string-util.pl 
              Rate   rindex   substr original
rindex    854434/s       --     -56%     -61%
substr   1934041/s     126%       --     -11%
original 2178605/s     155%      13%       --

My other wild guess is that substr suffers from doing a copy of the whole string, which takes time.

So what gives? I take two things out if it:

index (and possibly rindex, when used in a startswith implementation) shows a behaviour that is suspicious to me. If my first wild guess is right, there’s probably some space for improvement (disclaimer: I used version 5.32.1, maybe it’s been already optimized!)
I would be curious to code a XS version that does the exact test we’re after!

All in all it’s been an interesting ride, stay safe!

ETOOBUSY 🚀 minimal blogging for the impatient