Help-Site Computer Manuals
Software
Hardware
Programming
Networking
  Algorithms & Data Structures   Programming Languages   Revision Control
  Protocols
  Cameras   Computers   Displays   Keyboards & Mice   Motherboards   Networking   Printers & Scanners   Storage
  Windows   Linux & Unix   Mac

Alvis::Canonical
Perl extension for converting documents in various formats into the Alvis canonical format for documents

Alvis::Canonical - Perl extension for converting documents in various formats into the Alvis canonical format for documents


NAME

Alvis::Canonical - Perl extension for converting documents in various formats into the Alvis canonical format for documents


SYNOPSIS


 use Alvis::Canonical;

 # Create a new instance, specify the conversion of both numeric and 

 # symbolic character entities to Unicode characters

 my $C=Alvis::Canonical->new(convertCharEnts=>1,

                             convertNumEnts=>1);

 if (!defined($C))

 {

     die("Unable to instantiate Alvis::Canonical.");

 }

 # Convert an HTML document text in UTF-8 to the canonical format.

 # Specify that you want the title and baseURL as well, if any can be

 # determined.

 my ($txt,$header)=$C->HTML($html,

                            {title=>1,

                             baseURL=>1});

 if (!defined($txt))

 {

    die $C->errmsg();

 }


DESCRIPTION

Assumes the input is in UTF-8 and does NOT contain '\0's (or rather that they carry no meaning and are removable).


METHODS

new()

Available options:


    warnings         Issue warnings about badly faulty original HTML where

                     we have to resort to an heuristic solution.

                     Puts a warning to STDERR documenting the error and

                     the solution. Default: no.

    convertCharEnts  Convert HTML symbolic character entities to UTF-8 

                     characters? Default: yes.

    convertNumEnts   Convert HTML numerical character entities to UTF-8 

                     characters? Default: yes.

    sourceEncoding   the encoding of the source documents. Default: undef,

                     which means it is guessed.  

     

  my $C=Alvis::Canonical->new(convertCharEnts=>1,

                              convertNumEnts=>1);

  if (!defined($C))

  {

    die die("Unable to instantiate Alvis::Canonical.");

  }

HTML($html,$options)

Converts dirty HTML to a valid Alvis canonicalDocument. $options is a mechanism for returning the title and base URL of the document. If their extraction is desired, set fields 'title' and 'baseURL' to a defined value. If you know the encoding of the source document, set option 'sourceEncoding', e.g.


  my ($txt,$header)=$C->HTML($html,

                            {title=>1,

                             baseURL=>1,

                             sourceEncoding=>'iso-8859-2'});

errmsg()

Returns a stack of error messages, if any. Empty string otherwise.


SEE ALSO

Alvis::Convert


AUTHOR

Kimmo Valtonen, <kimmo.valtonen@hiit.fi>


COPYRIGHT AND LICENSE

Copyright (C) 2006 by Kimmo Valtonen

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

Programminig
Wy
Wy
yW
Wy
Programming
Wy
Wy
Wy
Wy