Help-Site Computer Manuals
Software
Hardware
Programming
Networking
  Algorithms & Data Structures   Programming Languages   Revision Control
  Protocols
  Cameras   Computers   Displays   Keyboards & Mice   Motherboards   Networking   Printers & Scanners   Storage
  Windows   Linux & Unix   Mac

WWW::Scraper::JustTechJobs
Scrapes Just*Jobs.com

WWW::Scraper::JustTechJobs - Scrapes Just*Jobs.com


NAME

WWW::Scraper::JustTechJobs - Scrapes Just*Jobs.com


SYNOPSIS


    require WWW::Search;

    $search = new WWW::Scraper('JustTechJobs');


DESCRIPTION

This class is an JustTechJobs specialization of WWW::Search. It handles making and interpreting Just*Jobs searches http://www.Just*Jobs.com (where * is 'Perl', 'Java', etc).


OPTIONS

search_debug, search_parse_debug, search_ref Specified at the WWW::Search manpage.


AUTHOR

WWW::Scraper::JustTechJobs is written and maintained by Glenn Wood, http://search.cpan.org/search?mode=author&query=GLENNWOOD.


COPYRIGHT

Copyright (c) 2001 Glenn Wood All rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.


XML Scaffolding

Look at the idea from the perspective of the XML ``scaffold'' I'm suggesting for parsing the response HTML.

(This is XML, but looks superficially like HTML)

<HTML> <BODY> <TABLE NAME=``name'' or NUMBER=``number''> <TR TYPE=``header''/> <TR TYPE = ``detail*''> <TD BIND=``title'' /> <TD BIND=``description'' /> <TD BIND=``location'' /> <TD BIND=``url'' PARSE=``anchor'' /> </TR> </TABLE> </BODY> </HTML>

This scaffold describes the relevant skeleton of an HTML document; there's HTML and BODY elements, of course. Then the <TABLE> entry tells our parser to skip to the TABLE in the HTML named ``name'', or skip ``number'' TABLE entries (default=0, to pick up first TABLE element.) Then the TABLE is described. The first <TR> is described as a ``header'' row. The parser throws that one away. The second <TR> is a ``detail'' row (the ``*'' means multiple detail rows, of course). The parser picks up each <TD> element, extracts it's content, and places that in the hash entry corresponding to its BIND= attribute. Thus, the first TD goes into $result->_elem('title') (I needed to learn to use LWP::MemberMixin. Thanks, another lesson learned!) The second TD goes into $result->_elem('description'), etc. (Of course, some of these are _elem_array, but these details will be resolved later). The PARSE= in the url TD suggests a way for our parser to do special handling of a data element. The generic scaffold parser would take this XML and convert it to a hash/array to be processed at run time; we wouldn't actually use XML at run time. A backend author would use that hash/array in his native_setup_search() code, calling the ``scaffolder'' scanner with that hash as a parameter.

As I said, this works great if the response is TABLE structured, but I haven't seen any responses that aren't that way already.

This converts to an array tree that looks like this:


    my $scaffold = [ 'HTML', 

                     [ [ 'BODY', 

                       [ [ 'TABLE', 'name' ,                  # or 'name' = undef; multiple <TABLE number=n> mean n 'TABLE's here ,

                         [ [ 'NEXT', 1, 'NEXT &gt;' ] ,       # meaning how to find the NEXT button.

                           [ 'TR', 1 ] ,                      # meaning "header".

                           [ 'TR', 2 ,                        # meaning "detail*"

                             [ [ 'TD', 1, 'title' ] ,         # meaning clear text binding to _elem('title').

                               [ 'TD', 1, 'description' ] ,

                               [ 'TD', 1, 'location' ] ,

                               [ 'TD', 2, 'url' ]             # meaning anchor parsed text binding to _elem('title').

                             ]

                         ] ]

                       ] ]

                     ] ]

                  ];
Programminig
Wy
Wy
yW
Wy
Programming
Wy
Wy
Wy
Wy