ETOOBUSY ๐ minimal blogging for the impatient
Cryptopals 3 - Single-byte XOR cipher
TL;DR
After getting our feet wet in the first two challenges, itโs time to do the first gear switching and decrypt some stuff.
As expected, we start soft with something that is just slightly more than a Caesar cipher. In that, once you know how it works, itโs pretty simple to crack: just try every possible rotation out of the 25 possible ones and stop when the text you get makes any sense.
In this case weโre considering a one-character (which is the same as one byte in this series, sorry Unicode!) key that we use for XOR-ing with each character in the ciphertext, but itโs otherwise the same: we apply exactly the same mapping to each character. So here, too, we might just try all the 256 alternatives and stop when it makes sense.
But we live in Artificial Intelligence Machine Learning times, so why not let the computer do the job? One of the most simple ways to do this is to compute a score out of each candidate plaintext and get the one with the best score:
#!/usr/bin/env perl
use v5.24;
use warnings;
use experimental 'signatures';
no warnings 'experimental::signatures';
use List::Util 'sum';
use CryptoPals qw< decode_base16 >;
my $input = '1b37373331363f78151b7f2b783431333d78397828372d363c7837'
. '3e783a393b3736';
my ($decrypted, $score) = single_char_decrypt(decode_base16($input));
say $decrypted;
sub single_char_decrypt ($ciphertext, $scorer = undef) {
# Adjust the scoring function depending on inputs
$scorer //= \&score_by_frequency;
if (ref $scorer eq 'HASH') {
my $weight_for = $scorer;
$scorer = sub { score_by_frequency($_[0], $weight_for) };
}
# Iterate through all possible one-char alternatives. We are assuming
# that one char is one byte here, which basically means English. The
# key for decryption is a sequence of that one character that is the
# same length as the ciphertext, hence $clen here
my $clen = length($ciphertext);
my ($best, $best_score); # keep best scoring solution along the way
for my $ord (0 .. 255) {
my $repeated_key = chr($ord) x $clen;
my $cleartext = $ciphertext ^ $repeated_key;
my $score = $scorer->($cleartext);
($best, $best_score) = ($cleartext, $score)
if !$best_score || $best_score < $score;
} ## end for my $ord (0 .. 255)
# depending on context we also give the score back
return ($best, $best_score // 0) if wantarray;
return $best;
} ## end sub single_char_decrypt
Weโve factored decode_base16
into an external module for ease of
reuse.
The single_char_decrypt
implements the scaffolding for our analysis:
it gets the $ciphertext
and tries out all the possible one-char
(one-byte) alternatives, collecting the best along the way. Scoring is
left to an input $scorer
function, which gets some default but leaves
space for trying out other alternatives.
One of the classical approaches, and most effective too (high gain with
low expense) is to score a candidate plaintext according to the expected
frequency of letters in a language. That is, if the most frequent letter
in English is E
, we would expect that the average text should have
many of them compared to other letters, right?
In our case, then, weโll calculate the score by just adding the score for each letter, assigning the score based on the occurences of the letter from some reference corpus. In our case weโre considering what is available in page English Letter Frequency (based on a sample of 40,000 words) - there are other sources with slightly different numbers, but not too much.
# Simple scoring function, each letter has an associated weight. By default
# we get the English letters frequency, but the hash with the weights for
# each letter can be passed in. Letters will be considered in uppercase.
sub score_by_frequency ($data, $weight_for = undef) {
# Ignore data that contains non-printable non-space chars
return 0 unless $data =~ m{\A [[:print:][:space:]]+ \z}mxs;
state $english_weight_for = {
# https://pi.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html
E => 21912, # ETAOIN SRHDLU joke...
T => 16587,
A => 14810,
O => 14003,
I => 13318,
N => 12666,
S => 11450,
R => 10977,
H => 10795,
D => 7874,
L => 7253,
U => 5246,
C => 4943,
M => 4761,
F => 4200,
Y => 3853,
W => 3819,
G => 3693,
P => 3316,
B => 2715,
V => 2019,
K => 1257,
X => 315,
Q => 205,
J => 188,
Z => 128,
# set space the same as the most frequent letter, many words are
# followed by a space
' ' => 21912,
};
$weight_for //= $english_weight_for;
return sum map { $weight_for->{uc $_} // 0 } split m{}mxs, $data;
} ## end sub score_as_english_text ($data)
As indicated, weโre expecting fully printable texts here, so we toss
away whatever has non-printable characters. Interestingly, POSIX says
that class :print:
does not contain spaces, even though they seem
pretty printable to me. Whatever.
This calculation operation reminded me of a scalar product, although Iโm not sure I can express why. Not that it makes a real difference though.
One last thing, texts in English do not only contain characters, but
spaces and punctuation too. I decided to ignore punctuation, but spaces
appear frequently so I decided to give them the same frequency of the
most frequent letter (E
in the English case). The rationale is that
many words are followed by a space, so why not.
The example above worksโฆ but Iโll not print the result and spoiler it!
Stay safe and secure!