Extract links/images from files or URLs

TL;DR

if you need to squeeze all link/image URLs from HTML files or URLs, look no further. Quick’n’dirty but should serve most needs.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
use Mojo::DOM;
use Mojo::File;
use Mojo::URL;
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new->max_redirects(10);
for my $input (@ARGV) {
    my ($dom, $base);
    if ($input =~ m{\A https?:// }imxs) {
        my $tx = $ua->get($input);
        $base = $tx->req->url;
        $dom = $tx->result->dom;
    }
    else {
        $dom = Mojo::DOM->new(Mojo::File->new($input)->slurp);
        $base = $ENV{XLINX_BASE} // undef;
        $base = Mojo::URL->new($base) if defined $base;
    }
    $dom->find('a[href],img[src]')->each(
        sub {
            my $l = $_[0]->attr(lc($_[0]->tag) eq 'a' ? 'href' : 'src');
            say $base ? Mojo::URL->new($l)->to_abs($base)->to_string : $l;
        }
    );
}

If the above snippet from GitLab doesn’t show up, please take a look at this possibly outdated local version.

This is a quick’n’dirty way of extracting all links (i.e. href attributes of a tags) and images (i.e. src attributes of img tags) out of a list of local files (interpreted as HTML) or URLs (dynamically downloaded). Leverages Mojolicious.

It’s very bare-bones, e.g. it does not pre-pend a base URL in case of relative URLs. It should be a good starting point though.


Comments? Octodon, , GitHub, Reddit, or drop me a line!