Help-Site Computer Manuals
Software
Hardware
Programming
Networking
  Algorithms & Data Structures   Programming Languages   Revision Control
  Protocols
  Cameras   Computers   Displays   Keyboards & Mice   Motherboards   Networking   Printers & Scanners   Storage
  Windows   Linux & Unix   Mac

ElephantAgent
the agent that never forgets

ElephantAgent - the agent that never forgets


NAME

ElephantAgent - the agent that never forgets


DESCRIPTION

This is the robot agent that never forgets. One of the major advantages of the original MOMspider link checker was that it didn't need to keep checking robots.txt files every time it was started. This agent does the same by using a disk cache of hosts and status.

Why bother:- just use a cacheing server.. (because we would know to recall the robots.txt when needed..)

host format

a host keeps this state

last - time of last visit count - number of visits last_robot - time of last robot check robot_stat - robot status (exclude, open, controled) robots_txt - robot file

This has to be implemented as a complete rewrite of the RobotUA because that assumes multi-level hashes (can't do with MLDBM) and because it all directly accesses the contents of its own hash..

It is possible that people will be running several different user agents in one program (why?) but then wish to share robot exclusion info between them.

decision making

There are many decisions to make:-

should I cache this robots.txt? only if relatively short, or if we use this site often.. should I recheck a robots.txt yes if more than $max_hits to the site yes if more than $max_time since last check yes if more than $max_size from site $max_size = 1000 * $robots_txt_size $max_hits = 1000 $max_time = three_weeks (we should generally use head for re-checking)

package LWP::ElephantUA; $REVISION=q$Revision: 1.3 $ ; $VERSION = sprintf ( ``%d.%02d'', $REVISION =~ /(\d+).(\d+)/ );

require LWP::UserAgent; @ISA = qw(LWP::UserAgent);

require WWW::RobotRules; require HTTP::Request; require HTTP::Response;

use Carp (); use LWP::Debug (); use HTTP::Status (); use HTTP::Date qw(time2str);


NAME

LWP::RobotUA - A class for Web Robots


SYNOPSIS


  require LWP::RobotUA;

  $ua = new LWP::RobotUA 'my-robot/0.1', 'me@foo.com';

  $ua->delay(10);  # be very nice, go slowly

  ...

  # just use it just like a normal LWP::UserAgent

  $res = $ua->request($req);


DESCRIPTION

This class implements a user agent that is suitable for robot applications. Robots should be nice to the servers they visit. They should consult the /robots.txt file to ensure that they are welcomed and they should not send too frequent requests.

But, before you consider writing a robot take a look at <URL:http://info.webcrawler.com/mak/projects/robots/robots.html>.

When you use a LWP::RobotUA as your user agent, then you do not really have to think about these things yourself. Just send requests as you do when you are using a normal LWP::UserAgent and this special agent will make sure you are nice.


SEE ALSO

the LWP::UserAgent manpage


METHODS

The LWP::RobotUA is a sub-class of LWP::UserAgent and implements the same methods. The use_alarm() method also desides whether we will wait if a request is tried too early (if true), or will return an error response (if false).

In addition these methods are provided:

$ua = LWP::RobotUA->new($agent_name, $from)

A name and the mail address of the human running the the robot is required by the constructor. The name can be changed later though the agent() method. The mail address chan be changed with the from() method.

$ua->delay([$minutes])

Set the minimum delay between requests to the same server. The default is 1 minute.

$ua->host_count($hostname)

Returns the number of documents fetched from this server host.

$ua->host_wait($hostname)

Returns the number of seconds you must wait before you can make a new request to this host.

$ua->as_string

Returns a text that describe the state of the UA. Mainly useful for debugging.


AUTHOR

Gisle Aas <aas@sn.no>

Programminig
Wy
Wy
yW
Wy
Programming
Wy
Wy
Wy
Wy