Counting tags in this blog

TL;DR

Where I need to cleanup tags in this blog.

Looking through some posts, I’ve been hit by the suspect that I might need to do some cleanup in the tags. One example is that I’ve used command-line and command line - shouldn’t I use only one?

Perl to the rescue

The following Perl script helped me out:

 1 #!/usr/bin/env perl
 2 use strict;
 3 use warnings;
 4 use autodie;
 5 
 6 my %count_for;
 7 for my $filename (@ARGV) {
 8    open my $fh, '<', $filename;
 9    while (<$fh>) {
10       my ($tags) = m{\A tags: \s* \[ (.*?) \]}mxs or next;
11       for (split m{,}mxs, $tags) {
12          (my $tag = $_) =~ s{\A\s+|\s+\z}{}gmxs;
13          $count_for{$tag}++;
14       }
15       last;
16    }
17 }
18 
19 print "$_: $count_for{$_}\n" for sort {$a cmp $b} keys %count_for;

After some boilerplate (lines 1 to 4) we get into the real action. Let’s just note that module autodie helps us forget about checking the outcome of opening files (line 8), reading them, or closing them (which does not even appear here, but is performed by Perl behind the scenes).

The strategy is straightforward: first let’s go through all the input files, counting the occurrence of all tags (lines 6 to 17), then print out these counts sorting the tags lexicographically (line 19), so that we can easily spot similar tags with different spellings.

The first loop (line 7) iterates over all input files; this means that our script should be called like this:

$ perl count-tags.pl _posts/*.md
# ...

Each file is opened (line 8) and read line by line, until we hit the line with the tags (which means that we get past line 10). These tags are collected in variable $tags, which we split by commas iterating over the results (line 11), each of which is first trimmed (line 12) and then used to increment the corresponding count in hash %count_for (line 13).

Printing is easy: we iterate over the sorted keys of hash %count_for, and print the key and the corresponding count, one per line (all in line 19).

Does it work?

I daresay… yes, I spotted the following candidates for some cleanup:

command line: 2
command-line: 1
...
math: 1
maths: 19
...
print and play: 1
print-and-play: 1

Looking at the other tags, I think I’ll stick to the space-separated alternatives!

Interested?

The code can be easily copied from this blog post, or downloaded from a local version. Enjoy!


Comments? Octodon, , GitHub, Reddit, or drop me a line!