package.xml100666 0 0 26067 10724355642 6272 XML_HTMLSax3 pear.php.net A SAX parser for HTML and other badly formed XML documents XML_HTMLSax3 is a SAX based XML parser for badly formed XML documents, such as HTML. The original code base was developed by Alexander Zhukov and published at http://sourceforge.net/projects/phpshelve/. Alexander kindly gave permission to modify the code and license for inclusion in PEAR. NOTE! This package is now dual licensed under PHP license v3.01 and LGPL 3.0 See the CVS repo link for the actual licenses PEAR::XML_HTMLSax3 provides an API very similar to the native PHP XML extension (http://www.php.net/xml), allowing handlers using one to be easily adapted to the other. The key difference is HTMLSax will not break on badly formed XML, allowing it to be used for parsing HTML documents. Otherwise HTMLSax supports all the handlers available from Expat except namespace and external entity handlers. Provides methods for handling XML escapes as well as JSP/ASP opening and close tags. Version 1.x introduced an API similar to the native SAX extension but used a slow character by character approach to parsing. Version 2.x has had it's internals completely overhauled to use a Lexer, delivering performance *approaching* that of the native XML extension, as well as a radically improved, modular design that makes adding further functionality easy. Version 3.x is about fine tuning the API, behaviour and providing a mechanism to distinguish HTML "quirks" from badly formed HTML (later functionality not yet implemented) A big thanks to Jeff Moore (lead developer of WACT: http://wact.sourceforge.net) who's largely responsible for new design, as well input from other members at Sitepoint's Advanced PHP forums: http://www.sitepointforums.com/showthread.php?threadid=121246. Thanks also to Marcus Baker (lead developer of SimpleTest: http://www.lastcraft.com/simple_test.php) for sorting out the unit tests. Harry Fuecks hfuecks hfuecks@phppatterns.com no 2007-12-01 3.0.0 3.0.0 stable stable PHP Fixed bug #1850 HTMLtoXHTML.php does not produce XHTML [dufuz] Fixed bug #11607 Requesting License change, emails to listed authors bounce [cdake} Fixed bug #12159 not clarified license [hfuecks] This package is now dual licensed under PHP license v3.01 and LGPL 3.0 4.0.5 1.4.0b1 pcre 2004-06-02 3.0.0RC1 3.0.0RC1 beta beta PHP * Re PEAR version naming rules, you now include XML/HTMLSax3.php and the main class is called XML_HTMLSax3 * Now able to parse Word generated HTML - fixed bug with parsing of XML escape sequences * API break (minor): no longer extends PEAR * API break (minor): attributes with no value (like option selected) are now populated with NULL instead of TRUE * API break (minor): replaced XML_OPTION_FULL_ESCAPES with XML_OPTION_STRIP_ESCAPES - by default you now get back the complete escape sequence * Added some more examples 2.1.2 2.1.2 stable stable 2003-12-05 PHP * Bug fixed (thanks Jeff) where badly formed attributes resulted in infinite loop * Added additional boolean argument to open and close handler calls to spot empty tags like br/ - should not break exising APIs * Added XML_OPTION_FULL_ESCAPES which (when = 1) passes through the complete content in an XML escape, allowing comment / cdata reconstruction 2.1.1 2.1.1 stable stable 2003-10-08 PHP * Reporting of byte index with get_current_position() more accurate on opening tags (thanks to Alexander Orlov at x-code.com) * All parser options now available to PHP versions lt 4.3.x, using implementation of html_entity_decode in PHP 2.1.0 2.1.0 stable stable 2003-09-10 PHP * Well (unit) tested with SimpleTest 2.0.2 2.0.2 alpha alpha 2003-08-11 PHP * API is backwards compatible apart from the renaming of parser options * Performance dramatically increased. Not much slower than Expat * Better handling of XML comments and CDATA * Option to trigger additional data handler calls for linefeeds and tabs * Option to trigger additional data handler calls for XML entities and parse them if required. * Added public get_current_position() and get_length() methods 1.1 1.1 stable stable 2003-06-26 PHP * Bug fixes to Attribute_Parser to cope with newline, tag, forward slash and whitespace issues. 1.0 1.0 stable stable 2003-06-08 PHP * Modifications to file structure to place Attributes_Parser.php and State_Machine.php in subdirectory HTMLSax * XML_HTMLSax.php includes Attributes_Parser.php and State_Machine.php using require_once() 0.9.0rc2 0.9.0rc2 beta beta 2003-05-18 PHP *First release under PEAR *Changed package name to XML_HTMLSax *Added patch from John Luxford to parse single quoted attributes *Modified State_Machine to be a simple variable store 0.9.0rc1 0.9.0rc1 beta beta 2003-05-09 PHP A summary of the main differences between this version of HTML_Sax and HTMLSax2002082201 are as follows; *Instead of extending HTMLSax with your own "handlers" class, you now use the set_object() method to pass an instance of the class to HTMLSax. *Class method callbacks are specified using the following methods; *set_element_handler('startHandler','endHandler') <tag> and </tag> *set_data_handler('dataHandler') for contents of an element *set_pi_handler('piHandler') for <?php ?>, <?xml ?> etc. *set_escape_handler(') for anything beginning with <! *set_jasp_handler() - set listener for <% %> tags *Attributes which no value are created and set to true *Comments are handled and may contain entities; < > *The callback handlers will all be passed an instance of HTMLSax in the same way as the native PHP XML Expat extension *Setting of parser options is handled specifically by the set_option() method. Available options are; *skipWhiteSpace; instruct the parser to ignore whitespace characters *trimDataNodes; trim whitespace inside character data *breakOnNewLine; newline characters found in character data are treated as new events triggering another data callback *caseFolding; converts element names to uppercase XML_HTMLSax3-3.0.0/docs/examples/example.html100666 0 0 1223 10724355642 13666 This is HTML 4.0

This page is HTML 4.0

It contains a number of classic "failings" in terms of well formed XML, such as
- Tags which aren't closed
- Attributes which have no value

A standard XML parser will complain about but using HTMLSax it can be converted to XHTML 1.0.

Do you like XHTML?
XML_HTMLSax3-3.0.0/docs/examples/ExpatvsHtmlSax.php100666 0 0 2664 10724355642 15023 '); // Time Expat $start = getmicrotime(); xml_parse($parser, $doc); $end = getmicrotime(); echo ( "Expat took:\t\t".(getmicrotime()-$start)."
" ); $start = getmicrotime(); $parser =& new XML_HTMLSax3(); $parser->set_object($handler); $parser->set_element_handler('openHandler','closeHandler'); $parser->set_data_handler('dataHandler'); // Time HTMLSax $start = getmicrotime(); $parser->parse($doc); echo ( "HTMLSax took:\t\t".(getmicrotime()-$start) ); echo (''); ?>XML_HTMLSax3-3.0.0/docs/examples/HTMLtoXHTML.php100666 0 0 7247 10724355642 14016 xhtml = ''; $this->inTitle = false; $this->pCounter = 0; } // Handles the writing of attributes - called from $this->openHandler() function writeAttrs ($attrs) { if (is_array($attrs)) { foreach ($attrs as $name => $value) { // Watch for 'checked' if ($name == 'checked') { $this->xhtml.=' checked="checked"'; // Watch for 'selected' } else if ($name == 'selected') { $this->xhtml.=' selected="selected"'; } else { $this->xhtml.=' '.$name.'="'.$value.'"'; } } } } // Opening tag handler function openHandler(&$parser, $name, $attrs) { if ((isset($attrs['id']) && $attrs['id'] == 'title') || $name == 'title') { $this->inTitle = true; } switch ($name) { case 'input': $this->xhtml .= 'writeAttrs($attrs); $this->xhtml .= " />\n"; break; case 'img': $this->xhtml .= 'writeAttrs($attrs); $this->xhtml .= " />\n"; break; case 'br': $this->xhtml .= "
\n"; break; case 'html': $this->xhtml .= "\n"; break; case 'p': if ($this->pCounter != 0) { $this->xhtml.="

\n"; } $this->xhtml .= '

'; $this->pCounter++; break; default: $this->xhtml .= '<'.$name; $this->writeAttrs($attrs); $this->xhtml .= ">\n"; break; } } // Closing tag handler function closeHandler(&$parser, $name) { if ($this->inTitle) { $this->inTitle = false; } if ($name == 'body' && $this->pCounter != 0) { $this->xhtml .= "

\n"; } $this->xhtml .= "\n"; } // Character data handler function dataHandler(&$parser, $data) { $this->xhtml .= $this->inTitle ? 'This is XHTML 1.0' : $data; } // Escape handler function escapeHandler(&$parser, $data) { if ($data == 'doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN"') { $this->xhtml.=''; } } // Return the XHTML document function getXHTML () { return $this->xhtml; } } // Get the HTML file $doc = file_get_contents('example.html'); // Instantiate the handler $handler =& new HTMLtoXHTMLHandler(); // Instantiate the parser $parser =& new XML_HTMLSax3(); // Register the handler with the parser $parser->set_object($handler); // Set the handlers $parser->set_element_handler('openHandler','closeHandler'); $parser->set_data_handler('dataHandler'); $parser->set_escape_handler('escapeHandler'); // Parse the document $parser->parse($doc); echo $handler->getXHTML();XML_HTMLSax3-3.0.0/docs/examples/SimpleExample.php100666 0 0 4071 10724355642 14627 ' ); echo ( 'Attrs:
' );
        print_r($attrs);
        echo ( '
' ); } function closeHandler(& $parser,$name) { echo ( 'Close Tag Handler: '.$name.'
' ); } function dataHandler(& $parser,$data) { echo ( 'Data Handler: '.$data.'
' ); } function escapeHandler(& $parser,$data) { echo ( 'Escape Handler: '.$data.'
' ); } function piHandler(& $parser,$target,$data) { echo ( 'PI Handler: '.$target.' - '.$data.'
' ); } function jaspHandler(& $parser,$data) { echo ( 'Jasp Handler: '.$data.'
' ); } } $doc=<< HTML Sax in Action This is a processing instruction' ); ?> PHP <% document.write('Hello World!'); %> EOD; // Instantiate the handler $handler=new MyHandler(); // Instantiate the parser $parser=& new XML_HTMLSax3(); // Register the handler with the parser $parser->set_object($handler); // Set a parser option $parser->set_option('XML_OPTION_TRIM_DATA_NODES'); // Set the handlers $parser->set_element_handler('openHandler','closeHandler'); $parser->set_data_handler('dataHandler'); $parser->set_escape_handler('escapeHandler'); $parser->set_pi_handler('piHandler'); $parser->set_jasp_handler('jaspHandler'); // Parse the document $parser->parse($doc); ?>XML_HTMLSax3-3.0.0/docs/examples/SimpleTemplate.php100666 0 0 6026 10724355642 15011 vars[$name] = $value; } function display() { echo $this->output; } // Notice fourth argument function open(& $parser,$name,$attrs,$empty) { // Should check more carefully but this is just an example... if ( $name == 'var' ) { if ( isset($this->vars[$attrs['name']]) ) { $this->output.= $this->vars[$attrs['name']]; } } else { $tag = "<$name"; foreach ( $attrs as $key => $value ) { if ( is_null($value) ) { $tag .= ' '.$key; } else { $tag .= " $key=\"$value\""; } } if ( $empty ) { $tag .= '/>'; } else { $tag .= '>'; } $this->output .= $tag; } } // Notice fourth argument function close(& $parser,$name,$empty) { if ( !$empty ) { $this->output.= ""; } } function data(& $parser,$data) { $this->output .= $data; } function escape(& $parser,$data) { $this->output .= ""; } function pi(& $parser,$target,$data) { $this->output .= ""; } function jasp(& $parser,$data) { $this->output .= "<%$data%>"; } } $tpl=new SimpleTemplate(); $tpl->setVar('title','HTMLSax as a Template Parser'); $para1 = <<WACT and PHPOOT. For the most part is allows you to preserve the structure of original template, preserving whitespace and so on with one or two minor exceptions, such as whitespace between attributes and the quotes used for attributes. Compare the source template for this example with the output. EOD; $tpl->setVar('para1',$para1); $para2 = <<setVar('para2',$para2); // Instantiate the parser $parser=& new XML_HTMLSax3(); // Register the handler with the parser $parser->set_object($tpl); // Set a parser option $parser->set_option('XML_OPTION_FULL_ESCAPES'); // Set the handlers $parser->set_element_handler('open','close'); $parser->set_data_handler('data'); $parser->set_escape_handler('escape'); $parser->set_pi_handler('pi'); $parser->set_jasp_handler('jasp'); // Parse the document $parser->parse(file_get_contents('simpletemplate.tpl')); $tpl->display(); ?>XML_HTMLSax3-3.0.0/docs/examples/simpletemplate.tpl100666 0 0 1123 10724355642 15112 <var name="title">



<% // JASP handler deals with this document.write("Hello World!"); %> XML_HTMLSax3-3.0.0/docs/examples/worddoc.htm100666 0 0 3407 10724355642 13526

 

XML_HTMLSax3-3.0.0/docs/examples/WordDoc.php100666 0 0 2335 10724355642 13424 '.$data."\n\n\n"); } } $h = & new MyHandler(); // Instantiate the parser $parser=& new XML_HTMLSax3(); $parser->set_object($h); $parser->set_escape_handler('escape'); if ( isset($_GET['strip_escapes']) ) { $parser->set_option('XML_OPTION_STRIP_ESCAPES'); } ?>

Parsing Word Documents

Shows HTMLSax parsing a simple Word generated HTML document and the impact of the option 'XML_OPTION_STRIP_ESCAPES' which can be set like;

$parser->set_option('XML_OPTION_STRIP_ESCAPES');

Word generates some strange XML / HTML escape sequences like <![endif]> - now (3.0.0+) handled by HTMLSax correctly.

XML_OPTION_STRIP_ESCAPES = 0 : XML_OPTION_STRIP_ESCAPES = 1

Starting to parse...

parse(file_get_contents('worddoc.htm')); ?>

Parsing completed

XML_HTMLSax3-3.0.0/docs/Readme100666 0 0 20465 10724355642 10700 $Id: Readme,v 1.4 2004/06/02 14:33:38 hfuecks Exp $ ++Introduction XML_HTMLSax3 is a SAX based XML parser for badly formed XML documents, such as HTML. The original code base was developed by Alexander Zhukov and published at http://sourceforge.net/projects/phpshelve/. Alexander kindly gave permission to modify the code and license for inclusion in PEAR. PEAR::XML_HTMLSax3 provides an API very similar to the native PHP SAX extension (http://www.php.net/xml), allowing handlers using one to be easily adapted to the other. The key difference is HTMLSax will not break on badly formed XML, allowing it to be used for parsing HTML documents. Otherwise HTMLSax supports all the handlers available from Expat except namespace and external entity handlers. Provides methods for handling XML escapes as well as JSP/ASP opening and close tags. Version 1.x introduced an API similar to the native SAX extension but used a slow character by character approach to parsing. Version 2.x has had it's internals completely overhauled to use a Lexer, delivering performance *approaching* that of the native XML extension, as well as a radically improved, modular design that makes adding further functionality easy. Version 3.x is about fine tuning the API, behaviour and providing a mechanism to distinguish HTML "quirks" from badly formed HTML A big thanks to Jeff Moore (lead developer of WACT: http://wact.sourceforge.net) who's largely responsible for new design, as well input from members at Sitepoint's Advanced PHP forums: http://www.sitepointforums.com/showthread.php?threadid=121246. Thanks also to Marcus Baker (lead developer of SimpleTest: http://www.lastcraft.com/simple_test.php) for sorting out the unit tests. ++Uses Some particular situations where XML_HTMLSax3 can be useful include; - Template Engines (see WACT for example: http://wact.sf.net) - Parsing XML documents (such as those online) where the source is out of your control and Expat is choking because it's badly formed. - Converting HTML to XHTML - Reading HTML based content from a database and converting to PDF (with help from a PDF generation library and probably PEAR::XML_SaxFilters as well) - Parsing ASP(.NET) and JSP pages. - Creating a PHP-GTK based web browser? A PHP CSS Parser exists: http://www.phpclasses.org/browse.html/package/1081.html ++Features - Won't "break" on badly formed XML. May in some instances get it "wrong" (see Limitations) but will continue parsing. - Provides an API similar to the native PHP XML extension so switching code from one to the other is typically minimal effort. - Can be instructed to behave in more or less the same manner as SAX, when dealing with linefeeds, tabs and XML entities - In addition to handling basic XML elements attributes and data also capable of dealing with; - Processing instructions e.g. / etc. Within PI's XML entities are not parsed (i.e. ignore < and > ) - XML Escape markup such as , and . Within this XML entities are not parsed (useful for JavaScript, for example) - JSP / ASP (JASP) marked up with <% %>. Note: You will need to deal with <%@ %> and <%= %> yourself. With JASP markup XML entities are not parsed ++Usage Notes - Performance-wise, it runs faster on PHP 4.3.0 thanks to strspn() and strcspn() supporting position arguments. For older PHP versions while loops are used to achieve the same effect, meaning a slightly higher overhead. Note also that setting XML options with XML_HTMLSax3::set_option() also slows down the parser, the options being handled by "decorators" which perform some further formatting on XML events which have already been parsed. - By default, no parser options are set - Regarding the XML_OPTION_ENTITIES_PARSED, this uses the html_entity_decode() function which is only available in PHP 4.3.0+. To get round this, HTMLSax checks your PHP version and for the function name html_entity_decode. If not found, it defines a function which mirrors the behavior of the native PHP html_entity_decode(). Both XML_OPTION_ENTITIES_PARSED and XML_OPTION_ENTITIES_UNPARSED can be used down to PHP version 4.0.5, due to the regular expression used to find entities. - For attributes which have just a name but no value e.g.