Help-Site Computer Manuals
Software
Hardware
Programming
Networking
  Algorithms & Data Structures   Programming Languages   Revision Control
  Protocols
  Cameras   Computers   Displays   Keyboards & Mice   Motherboards   Networking   Printers & Scanners   Storage
  Windows   Linux & Unix   Mac

Search::ContextGraph
spreading activation search engine

Search::ContextGraph - spreading activation search engine


NAME

Search::ContextGraph - spreading activation search engine


SYNOPSIS


  use Search::ContextGraph;

  my $cg = Search::ContextGraph->new();

  # first you add some documents, perhaps all at once...

  

  my %docs = (

    'first'  => [ 'elephant', 'snake' ],

    'second' => [ 'camel', 'pony' ],

    'third'  => { 'snake' => 2, 'constrictor' => 1 },

  );

  $cg->bulk_add( %docs );

  

  # or in a loop...

  

  foreach my $title ( keys %docs ) {

         $cg->add( $title, $docs{$title} );

  }

  #     or from a file...

  my $cg = Search::ContextGraph->load_from_dir( "./myfiles" );

  # you can store a graph object for later use

  

  $cg->store( "stored.cng" );

  

  # and retrieve it later...

  

  my $cg = ContextGraph->retrieve( "stored.cng" );

  

  

  # SEARCHING

  

  # the easiest way

  my @ranked_docs = $cg->simple_search( 'peanuts' );

  # get back both related terms and docs for more power

  my ( $docs, $words ) = $cg->search('snake');

  # you can use a document as your query

  my ( $docs, $words ) = $cg->find_similar('First Document');

  # Or you can query on a combination of things

  my ( $docs, $words ) = 

    $cg->mixed_search( { docs  => [ 'First Document' ],

                         terms => [ 'snake', 'pony' ]

                     );

  # Print out result set of returned documents

  foreach my $k ( sort { $docs->{$b} <=> $docs->{$a} }

      keys %{ $docs } ) {

      print "Document $k had relevance ", $docs->{$k}, "\n";

  }

  # Reload it

  my $new = Search::ContextGraph->retrieve( "filename" );


DESCRIPTION

Spreading activation is a neat technique for building search engines that return accurate results for a query even when there is no exact keyword match. The engine works by building a data structure called a context graph, which is a giant network of document and term nodes. All document nodes are connected to the terms that occur in that document; similarly, every term node is connected to all of the document nodes that term occurs in. We search the graph by starting at a query node and distributing a set amount of energy to its neighbor nodes. Then we recurse, diminishing the energy at each stage, until this spreading energy falls below a given threshold. Each node keeps track of accumulated energy, and this serves as our measure of relevance.

This means that documents that have many words in common will appear similar to the search engine. Likewise, words that occur together in many documents will be perceived as semantically related. Especially with larger, coherent document collections, the search engine can be quite effective at recognizing synonyms and finding useful relationships between documents. You can read a full description of the algorithm at http://www.nitle.org/papers/Contextual_Network_Graphs.pdf.

The search engine gives expanded recall (relevant results even when there is no keyword match) without incurring the kind of computational and patent issues posed by latent semantic indexing (LSI). The technique used here was originally described in a 1981 dissertation by Scott Preece.


CONSTRUCTORS

new %PARAMS
Object constructor. Possible parameters:
auto_reweight
Rebalance the graph every time a change occurs. Default is true. Disable and do by hand using reweight_graph for better performance in graphs with frequent updates/additions/deletions.

debug LEVEL
Set this to 1 or 2 to turn on verbose debugging output

max_depth
Set the maximum distance to spread energy out from the start node. Default is effectively unlimited. You can tweak it using set_max_depth. Comes in handy if you find searches are too slow.

xs
When true, tells the module to use compiled C internals. This reduces memory requirements by about 60%, but actually runs a little slower than the pure Perl version. Don't bother to turn it on unless you have a huge graph. Default is pure Perl.
START_ENERGY
Initial energy to assign to a query node. Default is 100.

ACTIVATE_THRESHOLD
Minimal energy needed to propagate search along the graph. Default is 1.

COLLECT_THRESHOLD
Minimal energy needed for a node to enter the result set. Default is 1.

load_from_dir DIR [, \&PARSE ]
Load documents from a directory. Takes two arguments, a directory path and an optional parsing subroutine. If the parsing subroutine is passed an argument, it will use it to extract term tokens from the file. By default, the file is split on whitespace and stripped of numbers and punctuation.

load_from_tdm FILENAME
Opens and loads a term-document matrix (TDM) file to initialize the graph. The TDM encodes information about term-to-document links. This is a legacy method mainly for the convenience of the module author. For notes on the proper file format, see the README file. =cut

sub load_from_tdm { my ( $self, $file ) = @_; croak ``TDM file $file does not exist'' unless -f $file; return if $self->{'loaded'}; $self->_read_tdm( $file ); $self->{'loaded'} = 1; $self->reweight_graph(); }

rename OLD, NEW
Renames a document. Will return undef if the new name is already in use.

retrieve FILENAME
Loads a previously stored graph from disk, using Storable.


ACCESSORS

[get|set]_activate_threshold
Accessor for node activation threshold value. This value determines how far energy can spread in the graph. Lower it to increase the number of results. Default is 1.

[get|set]_auto_reweight
Accessor for auto reweight flag. If true, edge weights will be recalculated every time a document is added, updated or removed. This can significantly slow down large graphs. On by default.

[get|set]_collect_threshold
Accessor for collection threshold value. This determines how much energy a node must have to make it into the result set. Lower it to increase the number of results. Default is 1.

[get|set]_debug_mode LEVEL
Turns debugging on or off. 1 is verbose, 2 is very verbose, 0 is off.

[get|set]_initial_energy
Accessor for initial energy value at the query node. This controls how much energy gets poured into the graph at the start of the search. Increase this value to get more results from your queries.

[get|set]_max_depth LEVEL
You can tell the graph to cut off searches after a certain distance from the query node. This can speed up searches on very large graphs, and has little adverse effect, especially if you are interested in just the first few search results. Set this value to undef to restore the default (10^8).


METHODS

add DOC, WORDS
Add a document to the search engine. Takes as arguments a unique doc identifier and a reference to an array or hash of words in the document. For example:

        TITLE => { WORD1 => COUNT1, WORD2 => COUNT2 ... }

or


        TITLE => [ WORD1, WORD2, WORD3 ]

Use bulk_add if you want to pass in a bunch of docs all at once.

add_file PATH [, name => NAME, parse => CODE]
Adds a document from a file. By default, uses the PATH provided as the document identifier, and parses the file by splitting on whitespace. If a fancier title, or more elegant parsing behavior is desired, pass in named arguments as indicated. NAME can be any string, CODE should be a reference to a subroutine that takes one argument (the contents of the file) and returns an array of tokens, or a hash in the form TOKEN => COUNT, or a reference to the same.

bulk_add DOCS
Add documents to the graph in bulk. Takes as an argument a hash whose keys are document identifiers, and values are references to hashes in the form { WORD1 => COUNT, WORD2 => COUNT...} This method is faster than adding in documents one by one if you have auto_rebalance turned on.

degree NODE
Given a raw node, returns the degree (raw node means the node must be prefixed with 'D:' or 'T:' depending on type )

delete DOC
Remove a document from the graph. Takes a document identifier as an argument. Returns 1 if successful, undef otherwise.

has_doc DOC
Returns true if the document with identifier DOC is in the collection

has_term TERM
Returns true if the term TERM is in the collection

distance NODE1, NODE2, TYPE
Calculates the distance between two nodes of the same type (D or T) using the formula:

        distance = ...

=cut

sub distance {
my ( $self, $n1, $n2, $type ) = @_;
croak unless $type;
$type = lc( $type );
croak unless $type =~ /^[dt]$/;
my $key = ( $type eq 't' ? 'terms' : 'documents' );
my @shared = $self->intersection( $key => [ $n1, $n2 ] );
return 0 unless @shared;
#warn ``Found '', scalar @shared, `` nodes shared between $n1 and $n2\n'';




        my $node1 = _nodeify( $type, $n1 );

        my $node2 = _nodeify( $type, $n2 );

        # formula is w(t1,d1)/deg(d1) + w(t1,d2)/deg(d2) ... ) /deg( t1 )



        #warn "Calculating distance\n";

        my $sum1 = 0;

        my $sum2 = 0;

        foreach my $next ( @shared ) {

                my ( undef, $lcount1) =  split m/,/, $self->{neighbors}{$node1}{$next};

                my ( undef, $lcount2) =  split m/,/, $self->{neighbors}{$node2}{$next};

                my $degree = $self->degree( $next );

                #warn "\t degree of $next is $degree\n";

                my $elem1 = $lcount1 / $degree;

                $sum1 += $elem1;

                my $elem2 = $lcount2 / $degree;

                $sum2 += $elem2;

        }

        #warn "sum is $sum1, $sum2\n";

        my $final = ($sum1 / $self->degree( $node1 )) + ( $sum2 / $self->degree( $node2 ));

        #warn "final is $final\n";

        return $final;

        

        

}

distance_matrix TYPE LIMIT
Used for clustering using linear local embedding. Produces a similarity matrix in a format I'm too tired to document right now. LIMIT is the maximum number of neighbors to keep for each node.

intersection @NODES
Returns a list of neighbor nodes that all the given nodes share in common

raw_search @NODES
Given a list of nodes, returns a hash of nearest nodes with relevance values, in the format NODE => RELEVANCE, for all nodes above the threshold value. (You probably want one of search, find_similar, or mixed_search instead).

reweight_graph
Iterates through the graph, calculating edge weights and normalizing around nodes. This method is automatically called every time a document is added, removed, or updated, unless you turn the option off with auto_reweight(0). When adding a lot of docs, this can be time consuming, so either set auto_reweight to off or use the bulk_add method to add lots of docs at once

update ID, WORDS
Given a document identifier and a word list, updates the information for that document in the graph. Returns the number of changes made

doc_count [TERM]
Returns a count of all documents that TERM occurs in. If no argument is provided, returns a document count for the entire collection.

doc_list [TERM]
Returns a sorted list of document identifiers that contain TERM, in ASCII-betical order. If no argument is given, returns a sorted document list for the whole collection.

dump_node NODE
Lists all of the neighbors of a node, together with edge weights connecting to them

dump_tdm [FILE]
Dumps internal state in term-document matrix (TDM) format, which looks like this:

        A B C B C B C

        A B C B C B C

        A B C B C B C

Where each row represents a document, A is the number of terms in the document, B is the term node and C is the edge weight between the doc node and B. Mostly used as a legacy format by the module author. Doc and term nodes are printed in ASCII-betical sorted order, zero-based indexing. Up to you to keep track of the ID => title mappings, neener-neener! Use doc_list and term_list to get an equivalently sorted list

near_neighbors [NODE]
Returns a list of neighbor nodes of the same type (doc/doc, or term/term) two hops away.

term_count [DOC]
Returns the number of unique terms in a document or, if no document is specified, in the entire collection.

term_list [DOC]
Returns a sorted list of unique terms appearing in the document with identifier DOC, in ASCII-betical order. If no argument is given, returns a sorted term list for the whole collection.

word_count [TERM]
Returns the total occurence count for a term, or if no argument is given, a word count for the entire collection. The word count is always greater than or equal to the term count.

search @QUERY
Searches the graph for all of the words in @QUERY. Use find_similar if you want to do a document similarity instead, or mixed_search if you want to search on any combination of words and documents. Returns a pair of hashrefs: the first a reference to a hash of docs and relevance values, the second to a hash of words and relevance values.

simple_search QUERY
This is the DWIM method - takes a query string as its argument, and returns an array of documents, sorted by relevance.

find_by_title @TITLES
Given a list of patterns, searches for documents with matching titles

find_similar @DOCS
Given an array of document identifiers, performs a similarity search and returns a pair of hashrefs. First hashref is to a hash of docs and relevance values, second is to a hash of words and relevance values.

merge TYPE, GOOD, @BAD
Combine all the nodes in @BAD into the node with identifier GOOD. First argument must be one of 'T' or 'D' to indicate term or document nodes. Used to combine synonyms in the graph.

mixed_search @DOCS
Given a hashref in the form: { docs => [ 'Title 1', 'Title 2' ], terms => ['buffalo', 'fox' ], } } Runs a combined search on the terms and documents provided, and returns a pair of hashrefs. The first hashref is to a hash of docs and relevance values, second is to a hash of words and relevance values.

store FILENAME
Stores the object to a file for later use. Not compatible (yet) with compiled XS version, which will give a fatal error.

have_edge RAWNODE1, RAWNODE2
Returns true if the nodes share an edge. Node names must be prefixed with 'D' or 'T' as appropriate.


BUGS


AUTHOR

Maciej Ceglowski <maciej@ceglowski.com>

The spreading activation technique used here was originally discussed in a 1981 dissertation by Scott Preece, at the University of Illinois.

XS implementation thanks to Schuyler Erle.


CONTRIBUTORS


        Schuyler Erle

        Ken Williams

        Leon Brocard  




COPYRIGHT AND LICENSE

Perl module: (C) 2003 Maciej Ceglowski

XS Implementation: (C) 2003 Maciej Ceglowski, Schuyler Erle

This program is free software, distributed under the GNU Public License. See LICENSE for details.

Programminig
Wy
Wy
yW
Wy
Programming
Wy
Wy
Wy
Wy