NAME
    HTML::HTML5::Parser - parse HTML reliably

SYNOPSIS
      use HTML::HTML5::Parser;
  
      my $parser = HTML::HTML5::Parser->new;
      my $doc    = $parser->parse_string(<<'EOT');
      <!doctype html>
      <title>Foo</title>
      <p><b><i>Foo</b> bar</i>.
      <p>Baz</br>Quux.
      EOT
  
      my $fdoc   = $parser->parse_file( $html_file_name );
      my $fhdoc  = $parser->parse_fh( $html_file_handle );

DESCRIPTION
    This library is substantially the same as the non-CPAN module
    Whatpm::HTML. Changes include:

    *       Provides an XML::LibXML-like DOM interface. If you usually use
            XML::LibXML's DOM parser, this should be a drop-in solution for
            tag soup HTML.

    *       Constructs an XML::LibXML::Document as the result of parsing.

    *       Via bundling and modifications, removed external dependencies on
            non-CPAN packages.

  Constructor
    "new"
              $parser = HTML::HTML5::Parser->new;

            The constructor does not do anything interesting.

  XML::LibXML-Compatible Methods
    "parse_file", "parse_html_file"
          $doc = $parser->parse_file( $html_file_name [,\%opts] );

        This function parses an HTML document from a file or network;
        $html_file_name can be either a filename or an URL.

        Options include 'encoding' to indicate file encoding (e.g. 'utf-8')
        and 'user_agent' which should be a blessed "LWP::UserAgent" object
        to be used when retrieving URLs.

        If requesting a URL and the response Content-Type header indicates
        an XML-based media type (such as XHTML), XML::LibXML::Parser will be
        used automatically (instead of the tag soup parser). The XML parser
        can be told to use a DTD catalogue by setting the option
        'xml_catalogue' to the filename of the catalogue.

        HTML (tag soup) parsing can be forced using the option 'force_html',
        even when an XML media type is returned. If an options hashref was
        passed, parse_file will set $options->{'parser_used'} to the name of
        the class used to parse the URL, to allow the calling code to
        double-check which parser was used afterwards.

        If an options hashref was passed, parse_file will set
        $options->{'response'} to the HTTP::Response object obtained by
        retrieving the URI.

    "parse_fh", "parse_html_fh"
          $doc = $parser->parse_fh( $io_fh [,\%opts] );

        "parse_fh()" parses a IOREF or a subclass of "IO::Handle".

        Options include 'encoding' to indicate file encoding (e.g. 'utf-8').

    "parse_string", "parse_html_string"
          $doc = $parser->parse_string( $html_string [,\%opts] );

        This function is similar to "parse_fh()", but it parses an HTML
        document that is available as a single string in memory.

        Options include 'encoding' to indicate file encoding (e.g. 'utf-8').

    "load_xml", "load_html"
        Wrappers for the parse_* functions. These should be roughly
        compatible with the equivalently named functions in XML::LibXML.

        Note that "load_xml" first attempts to parse as real XML, falling
        back to HTML5 parsing; "load_html" just goes straight for HTML5.

    The push parser and SAX-based parser are not supported. Trying to change
    an option (such as recover_silently) will make HTML::HTML5::Parser carp
    a warning. (But you can inspect the options.)

  Additional Methods
    The module provides a few additional methods to obtain additional,
    non-DOM data from DOM nodes.

    "error_handler"
        Get/set an error handling function. Must be set to a coderef or
        undef.

        The error handling function will be called with a single parameter,
        a HTML::HTML5::Parser::Error object.

    "errors"
        Returns a list of errors that occurred during the last parse.

        See HTML::HTML5::Parser::Error.

    "compat_mode"
          $mode = $parser->compat_mode( $doc );

        Returns 'quirks', 'limited quirks' or undef (standards mode).

    "dtd_public_id"
          $pubid = $parser->dtd_public_id( $doc );

        For an XML::LibXML::Document which has been returned by
        HTML::HTML5::Parser, using this method will tell you the Public
        Identifier of the DTD used (if any).

    "dtd_system_id"
          $sysid = $parser->dtd_system_id( $doc );

        For an XML::LibXML::Document which has been returned by
        HTML::HTML5::Parser, using this method will tell you the System
        Identifier of the DTD used (if any).

    "source_line"
          ($line, $col) = $parser->source_line( $node );
          $line = $parser->source_line( $node );

        In scalar context, "source_line" returns the line number of the
        source code that started a particular node (element, attribute or
        comment).

        In list context, returns a line/column pair. (Tab characters count
        as one column, not eight.)

SEE ALSO
    <http://suika.fam.cx/www/markup/html/whatpm/Whatpm/HTML.html>

AUTHOR
    Toby Inkster, <tobyink@cpan.org>

COPYRIGHT AND LICENSE
    Copyright (C) 2007-2011 by Wakaba

    Copyright (C) 2009-2012 by Toby Inkster

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself, either Perl version 5.8.1 or, at
    your option, any later version of Perl 5 you may have available.

