Extraction of TLD from urls and sorting domains and subdomains for each TLD file

I have a list of million urls. I need to extract the TLD for each url and create multiple files for each TLD. For example collect all urls with .com as tld and dump that in 1 file, another file for .edu tld and so on. Further within each file, I have to sort it alphabetically by domains and then by subdomains etc.

Can anyone give me a head start for implementing this in perl?

Answers


  1. Use URI to parse the URL,
  2. Use its host method to get the host,
  3. Use Domain::PublicSuffix's get_root_domain to parse the host name.
  4. Use the tld or suffix method to get the real TLD or the pseudo TLD.

use feature qw( say );

use Domain::PublicSuffix qw( );
use URI                  qw( );

my $dps = Domain::PublicSuffix->new();

for (qw(
   http://www.google.com/
   http://www.google.co.uk/
)) {
   my $url = $_;

   # Treat relative URLs as absolute URLs with missing http://.
   $url = "http://$url" if $url !~ /^\w+:/;

   my $host = URI->new($url)->host();
   $host =~ s/\.\z//;  # D::PS doesn't handle "domain.com.".

   $dps->get_root_domain($host)
      or die $dps->error();

   say $dps->tld();     # com  uk
   say $dps->suffix();  # com  co.uk
}

Need Your Help

JsonConverter CanConvert does not receive type

c# json.net

I have a custom JsonConverter, which doesn't seem to be called correctly. I have created the converter, added it to the JsonSerializerSettings.Converters collection and marked the property on the e...

Jquery: rebind function on A link click

javascript jquery function bind unbind

I have a menu that shows on click (similar to the fb mobile menu). When you click anywhere on the viewport, the menu hides again and the link is disabled (i want to unbind the link because the a li...

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.