Help-Site Computer Manuals
Software
Hardware
Programming
Networking
  Algorithms & Data Structures   Programming Languages   Revision Control
  Protocols
  Cameras   Computers   Displays   Keyboards & Mice   Motherboards   Networking   Printers & Scanners   Storage
  Windows   Linux & Unix   Mac

Circa::Indexer
provide functions to administrate Circa, a www search engine running with Mysql

Circa::Indexer - provide functions to administrate Circa, a www search engine running with Mysql


NAME

Circa::Indexer - provide functions to administrate Circa, a www search engine running with Mysql


SYNOPSIS


 use Circa::Indexer;

 my $indexor = new Circa::Indexer;

 

 die "Erreur à la connection MySQL:$DBI::errstr\n"

   if (!$indexor->connect);

 

 $indexor->create_table_circa;

 

 $indexor->drop_table_circa;

 

 $indexor->addSite({url   => "http://www.alianwebserver.com/";,

                    email => 'alian@alianwebserver.com',

                    title => "Alian Web Server"});

 

 my ($nbIndexe,$nbAjoute,$nbWords,$nbWordsGood) = $indexor->parse_new_url(1);

 print   "$nbIndexe pages indexées,"

   "$nbAjoute pages ajoutées,"

   "$nbWordsGood mots indexés,"

   "$nbWords mots lus\n";

 

 $indexor->update(30,1);

Look too in circa_admin,admin.cgi,admin_compte.cgi


DESCRIPTION

This is Circa::Indexer, a module who provide functions to administrate Circa, a www search engine running with Mysql. Circa is for your Web site, or for a list of sites. It indexes like Altavista does. It can read, add and parse all url's found in a page. It add url and word to MySQL for use it at search.

This module provide routine to :

  • Add url

  • Create and update each account

  • Parse url, Index words, and so on.

  • Provide routine to administrate present url

Remarques:

  • This file are not added : doc,zip,ps,gif,jpg,gz,pdf,eps,png, deb,xls,ppt,class,GIF,css,js,wav,mid

  • Weight for each word is in hash $ConfigMoteur

How it's work ?

Circa parse html document. convert it to text. It count all word found and put result in hash key. In addition of that, it read title, keywords, description and add a weight to all word found.

Example: A config:


 my %ConfigMoteur=(

  'author'              => 'circa@alianwebserver.com', # Responsable du moteur

  'temporate'           => 1,  # Temporise les requetes sur le serveur de 8s.

  'facteur_keyword'     => 15, # <meta name="KeyWords"

  'facteur_description' => 10, # <meta name="description"

  'facteur_titre'       => 10, # <title></title>

  'facteur_full_text'   => 1,  # reste

  'facteur_url'         => 15, # Mots trouvés dans l'url

  'nb_min_mots'         => 2,  # facteur min pour garder un mot

  'niveau_max'          => 7,  # Niveau max à indexer

  'indexCgi'            => 0,  # Index lien des CGI (ex: ?nom=toto&riri=eieiei)

  );

A html document:


 <html>

 <head>

 <meta name="KeyWords"

       CONTENT="informatique,computing,javascript,CGI,perl">

 <meta name="Description" 

       CONTENT="Rubriques Informatique (Internet,Java,Javascript, CGI, Perl)">

 <title>Alian Web Server:Informatique,Société,Loisirs,Voyages</title>

 </head>

 <body>

 different word: cgi, perl, cgi

 </body>

 </html>

After parsing I've a hash with that:


 $words{'informatique'}= 15 + 10 + 10 =35

 $words{'cgi'} = 15 + 10 +1

 $words{'different'} = 1

Words is add to database if total found is > $ConfigMoteur{'nb_min_mots'} (2 by default). But if you set to 1, database will grow very quicly but allow you to perform very exact search with many worlds so you can do phrase searches. But if you do that, think to take a look at size of table relation.

After page is read, it's look into html link. And so on. At each time, the level grow to one. So if < to $Config{'niveau_max'}, url is added.


Class Interface

Constructors and Instance Methods

new PARAMHASH
You can use the following keys in PARAMHASH:
author
Default: 'circa@alianwebserver.com', appear in log file of web server indexed (as agent)

temporate
Default: 1, boolean. If true, wait 8s between request on same server and LWP::RobotUA will be used. Else this is LWP::UserAgent (more quick because it doesn't request and parse robots.txt rules, but less clean because a robot must always say who he is, and heavy server load is avoid).

facteur_keyword
Default: 15, weight of word found on meta KeyWords

facteur_description
Default:10, weight of word found on meta description``

facteur_titre
Default:10, weight of word found on <title></title>

facteur_full_text
Default:1, weight of word found on rest of page

facteur_url
Default: 15, weight of word found in url

nb_min_mots
Default: 2, minimal number of times a word must be found to be added

niveau_max
Default: 7, Maximal number of level of links to follow

indexCgi
Default 0, follow of not links of CGI (ex: ?nom=toto&riri=eieiei)

size_max size
Get or set size max of file read by indexer (For avoid memory pb).

host_indexed host
Get or set the host indexed.

set_host_indexed url
Set base directory with $url. It's used for restrict access only to files found on sub-directory on this serveur.

proxy adresse proxy
Get or set proxy for LWP::Robot or LWP::Agent

Ex: $circa->proxy('http://proxy.sn.no:8001/');

Methods use for global adminstration

addSite ref_hash
ref_hash can have these keys: url, email, title, categorieAuto, cgi, rep, file

Create account with first url url. Return id of account created

parse_new_url id account
Parse les pages qui viennent d'être ajoutée. Le programme va analyser toutes les pages dont la colonne 'parse' est égale à 0.

Retourne le nombre de pages analysées, le nombre de page ajoutées, le nombre de mots indexés.

update nb days, id account
Update url not visited since nb days for account id account. If idp is not given, 1 will be used. Url never parsed will be indexed.

Return ($nb,$nbAjout,$nbWords,$nbWordsGood)

  • $nb: Number of links find

  • $nbAjout: Number of links added

  • $nbWords: Number of word find

  • $nbWordsGood: Number of word added

create_table_circa
Create tables needed by Circa - Cree les tables necessaires à Circa:
  • categorie : Catégories de sites

  • links : Liste d'url

  • responsable : Lien vers personne responsable de chaque lien

  • relations : Liste des mots / id site indexes

  • inscription : Inscriptions temporaires

drop_table_circa
Drop all table in Circa ! Be careful ! - Detruit touted les tables de Circa

drop_table_circa_id id account
Drop table for account id account

create_table_circa_id id account
Create tables needed by Circa for account id account
categorie
Catégories de sites

links
Liste d'url

relations
Liste des mots / id site indexes

stats
Liste des requetes

export [mysqldump], [directory of export]
Export data from Mysql in directory of export/circa.sql with mysqldump.

mysqldump: path of bin of mysqldump. If not given, search in /usr/bin/mysqldump, /usr/local/bin/mysqldump, /opt/bin/mysqldump.

<directory of export>: path of directory where circa.sql will be created. If not given, create it in $CircaConf::export, else in /tmp directory.

import_data [path_of_bin_mysql], [path_of_circa_file]
Import data in Mysql from circa.sql

path_of_bin_mysql : path to reach bin of mysql. If not given, search in /usr/bin/mysql, /usr/local/bin/mysql, /opt/bin/mysql, ENV{PATH}

path_of_circa_file : path of directory where circa.sql will be read. If not given, read it from $CircaConf::export, else /tmp directory.

Method for administrate each account

admin_compte id account
Return hash with some informations account id account Keys are:
responsable
Email address given with account creation

titre
Title given with account creation

nb_links
Number of url for this account

nb_words
Number of world stored

last_index
Date of last index process

nb_request
Number of request asked

racine
Url given with account creation

most_popular_word nb item to display, id account
Retourne la reference vers un hash representant la liste des $max mots les plus présents dans la base de reponsable $id

stat_request id account
Return some statistics about request make on Circa

inscription email, url, titre
Inscrit un site dans une table temporaire

HTML functions

header_compte CGI object, id account, url of script
Function use with CGI admin_compte.cgi. Display list of features of admin_compte.cgi for this account

get_liste_liens id account
Return a html select buffer with list of url for account id account

get_liste_liens_a_valider id account,CGI object
Return a html select buffer with link to valid for account id account

get_liste_site cgi object
Return a html select buffer with list of account

get_liste_langues id account, default value, CGI object
Return a html select buffer with distinct known languages found at index time

get_liste_mot id account, id url
Return a html buffer with words found at index time for url id url.


SEE ALSO

the Search::Circa manpage, Root class for circa

the Search::Circa::Parser manpage, Manage Parser of Indexer

circa_admin, command line to use indexer


VERSION

$Revision: 1.39 $


AUTHOR

Alain BARBET alian@alianwebserver.com

Programminig
Wy
Wy
yW
Wy
Programming
Wy
Wy
Wy
Wy