PWC090 - DNA Sequence

TL;DR

Here we are with TASK #1 from the Perl Weekly Challenge #090. Enjoy!

The challenge

DNA is a long, chainlike molecule which has two strands twisted into a double helix. The two strands are made up of simpler molecules called nucleotides. Each nucleotide is composed of one of the four nitrogen-containing nucleobases cytosine (C), guanine (G), adenine (A) and thymine (T).

You are given DNA sequence, GTAAACCCCTTTTCATTTAGACAGATCGACTCCTTATCCATTCTCAGAGATGTGTTGCTGGTCGCCG.

Write a script to print nucleiobase count in the given DNA sequence. Also print the complementary sequence where Thymine (T) on one strand is always facing an adenine (A) and vice versa; guanine (G) is always facing a cytosine (C) and vice versa.

To get the complementary sequence use the following mapping:

T => A
A => T
G => C
C => G

The questions

This challenge left me a bit puzzled for a couple of reasons.

One is that it seems that there’s one fixed input only. Well, I’ll assume that we need to do something general.

The other one is that there’s no clear indication as to the expected output format. Which is in line with the easy-going style of this challenge (and I like it very much, don’t get me wrong!) but a bit at odds with the lack of an example. I can only guess that the author had other things to do, like… ehrm… add my previous contribution to the Perl Weekly Challenge. Blame that on me 😅

The solution

First, a confession.

Forgive me siblings, because I have string-eval-ed.

I mean, there was a beautiful, juicy apple with tr stamped on it, and I couldn’t resist using it. But this led me to eval-ing, in the worst possible form: the string eval.

A bit of background

If you’re puzzled at this point, I owe you an explanation.

The transliteration operator (tr or y for friends) is a handy operator that transforms letters in a string into other letters (or nothing). This is basically what transliteration means.

As an added bonus, in certain cases it also tells us how many transliterations were done.

So… can you see it? We can use this hammer to hit all of our nails:

  • want to count the number of nucleobases? Transliterate them and get the count!
  • want to generate the complementary of the input sequence? Transliterate it according to the mapping!

So… this challenge is really screaming for tr.

The complementary string is easy

Using the transliteration operator for generating the complementary is easy, just provide a sequence of letters to be substituted, and the sequence of substitutes:

my $complementary = $original_sequence =~ tr{ACGT}{TGCA}r;

The syntax means that any single character in sequence ACGT has to be transliterated in the corresponding character in sequence TGCA.

We’re also adding the /r modifier, so that we keep variable $original_sequence unchanged and generate a copy instead, which will end up in $complementary.

This was the easy part, still on the bright side.

Let’s talk about counting those nucleobases…

Counting all occurrences of A is straightforward: we can just substitute all As (deleting them is OK in this case) and collect the number of substitutions by calling tr in scalar context:

my $n_A = $original_sequence =~ tr{A}{}d;

The /d deletes the A - it’s OK for this toy. The scalar context evaluation, as anticipated, provides the number of substitutions back, which is the same as the number of As in the string. Nifty!

Now we have to repeat this for the other nucleobases. Which brings us to a little bump in the path: should we just copy-and-paste the code for $n_A into similar lines for $n_C, $n_G, and $n_T?

This would break the golden rule of refactoring! So we should instead aim for a more generic way of doing this. Something like this:

my %count_for = map {
    $_ => scalar $original_sequence =~ tr{$_}{}d 
} qw< A C G T >;

Easy, right? We iterate over the four different nucleobases, and call tr for each of them. Mission accomplished, let’s chill out now! Beer anyone? A Ginger Ale maybe?

Well… not so fast!

The problem with the transliteration operator

Fact is that our code in the last subsection does not work.

With the transliteration operator, letters are transliterated literally.

Hu?!? Well, I mean that $_ is not considered as the topic variable, but a sequence of two literal characters, i.e. $ followed by _. Which is clearly written in the documentation, by the way:

But there is never any variable interpolation, so $ and @ are always treated as literals.

So we are back to the bloated approach:

my $n_A = $original_sequence =~ tr{A}{}d;
my $n_C = $original_sequence =~ tr{C}{}d;
my $n_G = $original_sequence =~ tr{G}{}d;
my $n_T = $original_sequence =~ tr{T}{}d;

Well… not so fast!

Choose your sin

There are a couple of ways to fix the loop-based approach to this counting problem.

The cleaner way is probably to recognize that the substitution operator helps doing the counting as well, so the following does work indeed:

my %count_for = map {
    $_ => scalar $original_sequence =~ s{$_}{}g
} qw< A C G T >;

Note that we have to specify the /g modifier to take into account all candidates for the substitution.

But heck! What party is this for celebrating the transliteration operator, if we then use another one?!? This is a sin and a shame!

Why not err to the dark side of the Perl hacker?!? Come on… a ssssssmall eval will causssssssse no harm… it’sssssssss there for a purposssssssse…

OK, we can do this instead:

my %count_for = map {
    $_ => eval "scalar \$s =~ tr{$_}{}d"
} qw< A C G T >;

At each iterator, the string is interpolated to generate a valid tr invocation. As an example, if $_ is equal to C, the resulting string is scalar $s =~ tr{C}{}d, which is exactly what we need to do the counting transliteration for nucleobase C.

The eval here is to… evaluate this string within the program. And save tr’s party and birthday present!

The whole thing

In full disclosure of my sins, here is my code for this week’s challenge, so that you can learn from my errors:

#!/usr/bin/env perl
use 5.024;
use warnings;
use experimental qw< postderef signatures >;
no warnings qw< experimental::postderef experimental::signatures >;

sub dna_sequence ($s) {
   my $complementary = $s =~ tr{ACGT}{TGCA}r;
   my %cf = map { $_ => eval "scalar \$s =~ tr{$_}{}d" } qw< A C G T >;
   return (\%cf, $complementary);
}

my $default =
  'GTAAACCCCTTTTCATTTAGACAGATCGACTCCTTATCCATTCTCAGAGATGTGTTGCTGGTCGCCG';
my $sequence = shift || $default;
my ($cf, $complementary) = dna_sequence($sequence);

$|++;
say {*STDERR} $sequence;
say {*STDOUT} $complementary;
say {*STDOUT} "A<$cf->{A}> C<$cf->{C}> G<$cf->{G}> T<$cf->{T}>";

So long… I’ll wave you from hell!


Comments? Octodon, , GitHub, Reddit, or drop me a line!