Feb 14, 2014

perl extract urls from file

As a developer, we need to extract URLs from an input file quite often in our day to day work.
Let us discuss how to extract URLs from an input Text or HTML file using a perl snippet.
Let us automate extracting URLs

What does the script do?
The following script will take input file as command line argument.
It will read the file line by line, extract the HTTP urls and push the results into an array.
Finally prints the array object.

How to call the script?
perl get_urls_from_input_file.pl "C:\Files_To_Read\blog_html.txt"


get_urls_from_input_file.pl

use strict;
use warnings;
use warnings;
use Data::Dumper;

my %substitute = ( '%20' => '', '%3A' => '' , '%24' => '');
my @result_urls;

#perl get_urls_from_input_file.pl 'C:\Files_To_Read\blog_html.txt'

my $infile  = $ARGV[0];

open(my $fh, '<', $infile) or die "Could not open logfile: $!";
while ( my $each_line = <$fh> ) {
    chomp $each_line;

 if ($each_line =~ /['"](http:\/\/.*?)['"]/) {
  my $grep_url = $1;
  push(@result_urls, $1);
 }
}
close $fh;

print "\n" . Dumper(\@result_urls);

1;

No comments:

Post a Comment