ETOOBUSY 🚀 minimal blogging for the impatient
Parsing fixed-format lines
TL;DR
Some notes about parsing fixed-format lines in Perl.
In last post Resuming Data::Tubes I discussed a bit about a pipeline to parse a file that had dispositions of data, each composed of a sequence of fixed-format lines (called records).
For these lines, there’s a document (sorry… Italian only!) explaining the meaning of each line type, but the gist is that:
- only a bunch of characters are allowed, encoded either in ASCII or in EBCDIC (this document’s appendix A has it all).
- each line is always exactly 120 bytes/characters long, whose character index starts at 1.
- each line has its own type that always appears in the same position in the field, i.e. positions 2 to 3 (two characters).
- everything that follows is record/line type specific.
Some aggregates of records are grouped together depending on what is being transferred. As an example, a specific type of disposition is composed of 7 records, collectively holding all the info that is needed for the disposition.
In this case, the lines share some additional partial information - i.e. a “disposition local identifier” from character 4 to character 10 - then they restart holding type-specific data.
To address the parsing of these lines, I found it useful to define the mapping of the structure of each record type to the position of the sub-fields I was interested into:
state $parser_for = {
IM => {
data => 's 14-19',
},
10 => {
progressivo => 'n 4-10',
'data-creazione' => 's 11-16',
'data-valuta' => 's 17-22',
causale => 'n 29-33',
'importo-centesimi' => 'n 34-46',
segno => 's 47',
riferimento => 's 58-69',
'conto-creditore' => 'n 80-91',
},
14 => {
progressivo => 'n 4-10',
data => 's 23-28',
importo => 'n 34-46',
segno => 'n 47',
cliente => 's 98-113',
},
...
My parsing instructions are simple: n
means numeric (where I remove
leading 0
characters) and s
means string (where I trim trailing
spaces). A single integer means a single character, otherwise it’s
ranges (usually one, possibly many). Hence, after detecting the record
type, I do the parsing like this:
while (my ($key, $value) = each $parser_for->{$type}->%*) {
my ($ft, @ranges) = split m{\s+}mxs, $value;
$parsed{$key} = '';
for my $range (@ranges) {
my ($start, $stop) = split m{-}mxs, $range;
$stop //= $start;
$start--;
$parsed{$key} .= substr $line, $start, $stop - $start;
}
$parsed{$key} =~ s{\A0+}{}mxs if $ft eq 'n';
$parsed{$key} =~ s{\s+\z}{}mxs if $ft eq 's';
}
The $start
variable is decreased by one to account for the fact that
strings are indexed starting at 0 in Perl. Variable $stop
does not
need this adjustment because it is supposed to point to the position
immediately after the last character to extract with substr
, which it
already does!
So, here it is… my dumb-simple approach to parsing fixed-format lines. What would be yours?