Help-Site Computer Manuals
Software
Hardware
Programming
Networking
  Algorithms & Data Structures   Programming Languages   Revision Control
  Protocols
  Cameras   Computers   Displays   Keyboards & Mice   Motherboards   Networking   Printers & Scanners   Storage
  Windows   Linux & Unix   Mac

WWW::SimpleRobot
a simple web robot for recursively following links on web pages.

WWW::SimpleRobot - a simple web robot for recursively following links on web pages.


NAME

WWW::SimpleRobot - a simple web robot for recursively following links on web pages.


SYNOPSIS


    use WWW::SimpleRobot;

    my $robot = WWW::SimpleRobot->new(

        URLS            => [ 'http://www.perl.org/' ],

        FOLLOW_REGEX    => "^http://www.perl.org/";,

        DEPTH           => 1,

        TRAVERSAL       => 'depth',

        VISIT_CALLBACK  => 

            sub { 

                my ( $url, $depth, $html, $links ) = @_;

                print STDERR "Visiting $url\n"; 

                print STDERR "Depth = $depth\n"; 

                print STDERR "HTML = $html\n"; 

                print STDERR "Links = @$links\n"; 

            }

        ,

        BROKEN_LINK_CALLBACK  => 

            sub { 

                my ( $url, $linked_from, $depth ) = @_;

                print STDERR "$url looks like a broken link on $linked_from\n"; 

                print STDERR "Depth = $depth\n"; 

            }

    );

    $robot->traverse;

    my @urls = @{$robot->urls};

    my @pages = @{$robot->pages};

    for my $page ( @pages )

    {

        my $url = $page->{url};

        my $depth = $page->{depth};

        my $modification_time = $page->{modification_time};

    }


DESCRIPTION


    A simple perl module for doing robot stuff. For a more elaborate interface,

    see WWW::Robot. This version uses LWP::Simple to grab pages, and

    HTML::LinkExtor to extract the links from them. Only href attributes of

    anchor tags are extracted. Extracted links are checked against the

    FOLLOW_REGEX regex to see if they should be followed. A HEAD request is

    made to these links, to check that they are 'text/html' type pages.


BUGS


    This robot doesn't respect the Robot Exclusion Protocol

    (http://info.webcrawler.com/mak/projects/robots/norobots.html) (naughty

    robot!), and doesn't do any exception handling if it can't get pages - it

    just ignores them and goes on to the next page!


AUTHOR

Ave Wrigley <Ave.Wrigley@itn.co.uk>


COPYRIGHT

Copyright (c) 2001 Ave Wrigley. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Programminig
Wy
Wy
yW
Wy
Programming
Wy
Wy
Wy
Wy