Help-Site Computer Manuals
Software
Hardware
Programming
Networking
  Algorithms & Data Structures   Programming Languages   Revision Control
  Protocols
  Cameras   Computers   Displays   Keyboards & Mice   Motherboards   Networking   Printers & Scanners   Storage
  Windows   Linux & Unix   Mac

Search::Circa::Parser
provide functions to parse HTML pages by Circa

Search::Circa::Parser - provide functions to parse HTML pages by Circa


NAME

Search::Circa::Parser - provide functions to parse HTML pages by Circa


SYNOPSIS


      use Search::Circa::Indexer;

      my $index = new Search::Circa::Indexer;

      $index->connect(...);

      $index->Parser->look_at({ url => url,

                                idr => account });


DESCRIPTION

This module use HTML::Parser facilities. It's call by Search::Circa::Indexer for index each document. Main method is look_at.


Public Class Interface

new Search::Circa::Indexer object
Create a new Circa::Parser object with indexer instance properties

look_at refHash
Index an url. Job done is:
  • Test if url used is valid. Return -1 else

  • Get the page and add each words found with weight set in constructor.

  • If maximum level of links is not reach, add each link found for the next indexation

Keys for refHashParameters:

url
Url to read

idc
Id of url in table links

idr
Id of account's url

lastModif
(optional) : If this parameter is set, Circa didn't make any job on this page if it's older that the date.

url_local
(optional) Local url to reach the file

categorieAuto
(optional) If $categorieAuto set to true, Circa will create/set the category of url with syntax of directory found. Ex: http://www.alianwebserver.com/societe/stvalentin/index.html will create and set the category for this url to Societe / StValentin. If $categorieAuto set to false, $categorie will be used.

niveau
(optional) Depth of actual link.

categorie
(optional) See $categorieAuto.

Return (-1,0) if url isn't valide, number of word and number of links found else

set_agent local
Set user agent for Circa robot. If local is set to 0 or $self->{ConfigMoteur}->{'temporate'}==0, LWP::UserAgent will be used. Else LWP::RobotUA is used.

analyse data, facteur
Split data in words, and put them in global %$RM with score. Hash structure is ('mots'=>facteur).
data
Buffer to analyse

facteur
Basic score for each word

tag
Method call for each HTML tag find in HTML pages.

text
Method call for each content of tag in HTML pages

check_links tag, links
Check if url $links will be add to Circa. Url must begin with $self->host_indexed, and his extension must be not doc,zip,ps,gif,jpg,gz, pdf,eps,png,deb,xls,ppt,class,GIF,css,js,wav,mid.

If $links is accepted, return url. Else return 0.


VERSION

$Revision: 1.27 $


SEE ALSO

the Search::Circa::Indexer manpage


AUTHOR

Alain BARBET alian@alianwebserver.com

Programminig
Wy
Wy
yW
Wy
Programming
Wy
Wy
Wy
Wy