PWC098 - Read N-characters

TL;DR

A digression on TASK #1 from the Perl Weekly Challenge #098. Enjoy!

This post is probably longer than it should be, skip to the solution if you want to spare you a lot of ramblings.

The challenge

You are given file $FILE. Create subroutine readN($FILE, $number) returns the first n-characters and moves the pointer to the (n+1)th character.

The questions

The first question this time was for myself: did you read it correctly? I initially thought that this challenge was a piece of cake, at least for any language with the notion of a filehand (or whatever their equivalent).

Then yes, I realized that the input across the calls is a file name, not a file handle. Which changes the game completely: it will be up to the function to track the state.

Which (at the very last!) brings me to the questions that should probably pop up in an interview:

  • How to deal with errors?
    • We will assume that throwing an exception is fine here.
  • Is the file supposed to grow in time?
    • We will assume that it might.
  • What to do when we hit the end of file?
    • We will return as many data as possible, possibly and empty string.
  • What do we mean by character exactly? Is there any specific encoding we should take care of?
    • We will assume that characters are the same as bytesā€¦ how convenient!

The solution

As already anticipated in the questions, we will have to keep track of where we are in the file for each of them. This is something that is usually done for us by Perl by means of the filehandle abstraction, which is also way more powerful as it allows for the same file to be opened (and tracked) multiple times.

But Iā€™m digressing.

To keep state, since version 5.10 we can leverage the state variable declarator:

sub readN ($FILE, $number) {
   state $at = {};
   # ...    
}

If you really want to go a lot back in time and avoid this (why should you want to do this?!?) you will of course have to ditch the function signatures and use a closed-on variable instead, like this:

my $at = {}; # it might be a full-fledged hash instead
sub readN { my ($FILE, $number) = @_;
   # ...
}

This is suboptimal though, because the variable is then leaked to the rest of the program scope. Which usually means that itā€™s better to protect the whole thing with a block:

{
   my $at = {}; # it might be a full-fledged hash instead
   sub readN { my ($FILE, $number) = @_;
      # ...
   }
}

But Iā€™m digressing.

To track each fileā€™s state, we might do several things, like:

  • keep a filehandle for each file and use it when it is necessary. If no filehandle is present, then a new one will be opened;
  • keep the character number were we are supposed to read next.

The first approach has the advantage of being conceptually simple, as well as calling the open function once per file. Is this of any help? I honestly donā€™t know, but Iā€™d say no. Moreover, re-opening the file over and over does not really remove any readability, so no harm in not using this approach.

On the other handā€¦ this will allocate a file handle, which can generally be considered a scarce resource. Thereā€™s a limit on how many you can have, although it can be set and moved. Hence, if we have a solution that can give us the same level of clarity and allow us to spare resources.

Which, arguably, might not really be the point here, because we are discussing a challenge. Unearthing these considerations might help understand the problem better in real life situation, anyway, so if a premature optimization is the root of all evil, blindly disregarding performance as a topic is equally evil in my opinion.

But Iā€™m digressing.

Reading data from a file is usually done through the open/read/close in a normal Perl enthusiastā€™s day. Well, actually through readline (in that normal day).

This time, anyway, itā€™s better to go at a bit lower level, e.g. to avoid buffering of inputs. For this reason, weā€™ll resort to the less-used sysopen/sysread/close triad to go straight to the facilities provided by the operating system.

This approach, though, has its consequences. We are assuming that bytes are the same as characters here, because there seems to be no indication of how to reliably infer the encoding from the file. Were we to support e.g. UTF-8 encoding, then reading $number characters would be a (generally) different beast, and we would have to either figure out how to use the facilities that Perl comes with, or re-implement the decoding at some level.

But Iā€™m digressing.

We decided to keep the ā€œcurrentā€ location in the file as an integer offset from the start of the file. For this, we rely on another function that I rarely use, that is sysseek.

Both sysopen and sysseek require integer values to pass additional options, e.g. the opening mode or from where to start looking. For this reason, itā€™s useful to use module Fcntl and import a couple of constants to help us refer to the right values instead of hardwiring magic values in the code:

use Fcntl qw< O_RDONLY SEEK_SET >;

Itā€™s a low-level interface, right? I can imagine that coders might have turned the interface to use names/symbols instead of integer values, but I guess this has been done for a couple of reasons:

  • in a low-level interface, you use low-level tools and the common reference is language C;
  • if we have to use a low-level interface, automating this part if probably the least of our problems.

For this reason, the Fcntl approach of providing constants at the expense of a simple additional use statement is fair.

But Iā€™m digressing.

If you made it so far without hiring a hitman to kill me šŸ˜…, hereā€™s my solution to this weekā€™s challenge:

sub readN ($FILE, $number) {
   state $at = {};
   sysopen my $fh, $FILE, O_RDONLY or die "sysopen('$FILE'): $!\n";
   sysseek $fh, $at->{$FILE} // 0, SEEK_SET;
   my $retval = '';
   my $n = sysread $fh, $retval, $number;
   close $fh or die "close('$FILE'): $!\n";
   die "sysread($FILE) \@$number: $!\n"  if ! defined $n;
   $at->{$FILE} += $n;
   return $retval;
}

The file is opened, read and closed at every call as anticipated.

I have to admit that I added a check on the close too, which is something that I rarely do. Not much because Iā€™m lazy (although I am indeed lazy), but because Iā€™m always dubious about what an error in close should mean for my program. If the sysread was fineā€¦ who cares? So, this line is actually more a cargo-cult bow to purity than something that I truly mean/understand in this case.

But Iā€™m digressing. For the last time, I swear! šŸ˜„

The whole program, as usual. We default to the programā€™s file as the input:

#!/usr/bin/env perl
use 5.024;
use warnings;
use experimental qw< postderef signatures >;
no warnings qw< experimental::postderef experimental::signatures >;
use Fcntl qw< O_RDONLY SEEK_SET >;

sub readN ($FILE, $number) {
   state $at = {};
   sysopen my $fh, $FILE, O_RDONLY or die "sysopen('$FILE'): $!\n";
   sysseek $fh, $at->{$FILE} // 0, SEEK_SET;
   my $retval = '';
   my $n = sysread $fh, $retval, $number;
   die "sysread($FILE) \@$number: $!\n"  if ! defined $n;
   $at->{$FILE} += $n;
   return $retval;
}

my $highlight = "\e[1;97;45m"; my $reset = "\e[0m";
my $file = shift || __FILE__;
my @numbers = @ARGV ? @ARGV : qw< 4 5 2 >;
for my $n (@numbers) {
   my $chunk = readN($file, $n);
   say "got $n: $highlight$chunk$reset";
}

Stay safe!


Comments? Octodon, , GitHub, Reddit, or drop me a line!