The challenge

Write a script to implement Hash Join algorithm as suggested by wikipedia.

(In the challenge there is a part that is actually ignored in this post)

Example

Input:

    @player_ages = (
        [20, "Alex"  ],
        [28, "Joe"   ],
        [38, "Mike"  ],
        [18, "Alex"  ],
        [25, "David" ],
        [18, "Simon" ],
    );

    @player_names = (
        ["Alex", "Stewart"],
        ["Joe",  "Root"   ],
        ["Mike", "Gatting"],
        ["Joe",  "Blog"   ],
        ["Alex", "Jones"  ],
        ["Simon","Duane"  ],
    );

Output:

    Based on index = 1 of @players_age and index = 0 of @players_name.

    20, "Alex",  "Stewart"
    20, "Alex",  "Jones"
    18, "Alex",  "Stewart"
    18, "Alex",  "Jones"
    28, "Joe",   "Root"
    28, "Joe",   "Blog"
    38, "Mike",  "Gatting"
    18, "Simon", "Duane"

The questions

I admit to have been puzzled by this challenge.

The original text seems to include a part that is very relevant in the actual implementation of a database engine, where e.g. not all data can fit in memory at the same time and strategies have to be thought to address this:

For each tuple $r$ in the build input $R$

Add $r$ to the in-memory hash table

If the size of the hash table equals the maximum in-memory size:

Scan the probe input $S$, and add matching join tuples to the output relation

Reset the hash table, and continue scanning the build input $R$

Do a final scan of the probe input $S$ and add the resulting join tuples to the output relation

So… what is this challenge requesting, actually? I opted to ignore that part, and concentrate on the basic algorithm description in wikipedia, i.e.:

one of the two input relations is transformed into a hash, where keys point to lists (arrays, in our case) of records matching the selected key, implementing the hash phase;
the other is used for the scan phase.

The resulting function might then be used in the memory-aware mechanism.

Is this a deal?

Oh… another thing: we’re going to consider only single-column keys.

Is this still a deal?

The solution

With the simplifications laid out in The questions section, here’s how we can address the challenge in Perl:

#!/usr/bin/env perl
use v5.24;
use warnings;
use experimental 'signatures';
no warnings 'experimental::signatures';
use Data::Dumper;

sub hash_join ($one, $kone, $two, $ktwo) {
   # make sure ($one, $kone) deal with the shorter of the two relations
   ($one, $kone, $two, $ktwo) = ($two, $ktwo, $one, $kone)
      if $one->@* > $two->@*;

   # hash phase, build a hash from ($one, $kone)
   my %hash_one;
   push $hash_one{$_->[$kone]}->@*, $_ for $one->@*;

   # scan phase
   return map {
      my @record = $_->@*;
      my $key = splice @record, $ktwo, 1;
      next unless exists $hash_one{$key};
      map { [$_->@*, @record] } $hash_one{$key}->@*;
   } $two->@*;
}

my @player_ages = (
        [20, "Alex"  ],
        [28, "Joe"   ],
        [38, "Mike"  ],
        [18, "Alex"  ],
        [25, "David" ],
        [18, "Simon" ],
    );

my @player_names = (
        ["Alex", "Stewart"],
        ["Joe",  "Root"   ],
        ["Mike", "Gatting"],
        ["Joe",  "Blog"   ],
        ["Alex", "Jones"  ],
        ["Simon","Duane"  ],
    );

say join ', ', $_->@* for hash_join(\@player_ages, 1, \@player_names, 0);

Our only concession to optimizations is to (possibly) swap the two input relations to make sure that the hash is built starting from the smaller one. Although this should not change anything from the point of view of complexity (it should be somewhere between $O(N + M)$ and $O(N \cdot M)$, where $N$ and $M$ are the number of records and depending on their contents), it makes sense to use that as a base for building the hash so that collisions and re-arrangements will be less probable.

Here is how that can be translated in Perlish Raku:

#!/usr/bin/env raku
use v6;
sub hash-join (@one, $kone is copy, @two, $ktwo is copy) {
   # make sure ($one, $kone) deal with the shorter of the two relations
   (@one, $kone, @two, $ktwo) = (@two, $ktwo, @one, $kone)
      if @one > @two;

   # hash phase, build a hash from (@one, $kone)
   my %hash_one;
   (%hash_one{$_[$kone]} //= []).push($_) for @one;

   # scan phase
   gather for @two -> $record {
      my @record = |$record;
      my $key = @record.splice($ktwo, 1);
      next unless %hash_one{$key}:exists;
      take [($_, @record).flat] for %hash_one{$key}.List;
   }
}

my @player_ages =
        [20, "Alex"  ],
        [28, "Joe"   ],
        [38, "Mike"  ],
        [18, "Alex"  ],
        [25, "David" ],
        [18, "Simon" ],
    ;

my @player_names =
        ["Alex", "Stewart"],
        ["Joe",  "Root"   ],
        ["Mike", "Gatting"],
        ["Joe",  "Blog"   ],
        ["Alex", "Jones"  ],
        ["Simon","Duane"  ],
    ;

.join(', ').say for hash-join(@player_ages, 1, @player_names, 0);

I can definitely feel my Raku teeth growing, because I got most of the changes needed to translate from Perl from the get go. Like using different sigils and the .flat in the right place. Alas… most is not all, because I had to also add the .List for iterating over the array in %hash_one{$key}. All in all I was happy that this was the only change needed!

And yes, I know. There’s a performance hit with gather/take, which is probably something we should not accept while mimicking a database engine. But it’s such a lovely idiom that I can hardly avoid it in these challenges. Heck, these are supposed to be fun!

OK, enough for today… stay safe folks!

ETOOBUSY 🚀 minimal blogging for the impatient

PWC132 - Hash Join

The challenge

The questions

The solution