bastien Aperghis-Tramoni
This program is free software. You can redistribute it and/or modify it
under the terms of the GNU General Public License, version 2 or later.
=cut
html2wml-0.4.11/doc/readme.txt 0100644 0000765 0000024 00000060545 07746772372 0016026 0 ustar 00maddingue staff NAME
Html2Wml -- Program that can convert HTML pages to WML pages
SYNOPSIS
Html2Wml can be used as either a shell command:
$ html2wml file.html
or as a CGI:
/cgi-bin/html2wml.cgi?url=/index.html
In both cases, the file can be either a local file or a URL.
DESCRIPTION
Html2Wml converts HTML pages to WML decks, suitable for being viewed on
a Wap device. The program can be launched from a shell to statically
convert a set of pages, or as a CGI to convert a particular (potentially
dynamic) HTML resource.
Althought the result is not guarantied to be valid WML, it should be the
case for most pages. Good HTML pages will most probably produce valid
WML decks. To check and correct your pages, you can use W3C's softwares:
the *HTML Validator*, available online at http://validator.w3.org and
*HTML Tidy*, written by Dave Raggett.
Html2Wml provides the following features:
* translation of the links
* limitation of the cards size by splitting the result into several
cards
* inclusion of files (similar to the SSI)
* compilation of the result (using the WML Tools, see the section on
"LINKS")
*
a debug mode to check the result using validation functions
OPTIONS
Please note that most of these options are also available when calling
Html2Wml as a CGI. In this case, boolean options are given the value "1"
or "0", and other options simply receive the value they expect. For
example, `--ascii' becomes `?ascii=1' or `?a=1'. See the file
t/form.html for an example on how to call Html2Wml as a CGI.
Conversion Options
-a, --ascii
When this option is on, named HTML entities and non-ASCII characters
are converted to US-ASCII characters using the same 7 bit
approximations as Lynx. For example, `©' is translated to
"(c)", and `ß' is translated to "ss". This option is off by
default.
--[no]collapse
This option tells Html2Wml to collapse redundant whitespaces,
tabulations, carriage returns, lines feeds and empty paragraphs. The
aim is to reduce the size of the WML document as much as possible.
Collapsing empty paragraphs is necessary for two reasons. First,
this avoids empty screens (and on a device with only 4 lines of
display, an empty screen can be quite ennoying). Second, Html2wml
creates many empty paragraphs when converting, because of the way
the syntax reconstructor is programmed. Deleting these empty
paragraphs is necessary like cleaning the kitchen :-)
If this really bother you, you can desactivate this behaviour with
the --nocollapse option.
--ignore-images
This option tells Html2Wml to completly ignore all image links.
--[no]img-alt-text
This option tells Html2Wml to replace the image tags with their
corresponding alternative text (as with a text mode web browser).
This option is on by default.
--[no]linearize
This option is on by default. This makes Html2Wml flattens the HTML
tables (they are linearized), as Lynx does. I think this is better
than trying to use the native WML tables. First, they have extremely
limited features and possibilities compared to HTML tables. In
particular, they can't be nested. In fact this is normal because Wap
devices are not supposed to have a big CPU running at some
zillions-hertz, and the calculations needed to render the tables are
the most complicated and CPU-hogger part of HTML.
Second, as they can't be nested, and as typical HTML pages heavily
use imbricated tables to create their layout, it's impossible to
decide which one could be kept. So the best thing is to keep none of
them.
[Note] Although you can desactivate this behaviour, and although
there is internal support for tables, the unlinearized mode has not
been heavily tested with nested tables, and it may produce
unexpected results.
-n, --numeric-non-ascii
This option tells Html2wml to convert all non-ASCII characters to
numeric entities, i.e., "é" becomes `é', and "ß" becomes
`ß'. By default, this option is off.
-p, --nopre
This options tells Html2Wml not to use the tag. This option
was added because the compiler from WML Tools 0.0.4 doesn't support
this tag.
Links Reconstruction Options
--hreftmpl=*TEMPLATE*
This options sets the template that will be used to reconstruct the
`href'-type links. See the section on "LINKS RECONSTRUCTION" for
more information.
--srctmpl=*TEMPLATE*
This option sets the template that will be used to reconstruct the
`src'-type links. See the section on "LINKS RECONSTRUCTION" for more
information.
Splitting Options
-s, --max-card-size=*SIZE*
This option allows you to limit the size (in bytes) of the generated
cards. Default is 1,500 bytes, which should be small enought to be
loaded on most Wap devices. See the section on "DECK SLICING" for
more information.
-t, --card-split-threshold=*SIZE*
This option sets the threshold of the split event, which can occur
when the size of the current card is between `max-card-size' -
`card-split-threshold' and `max-card-size'. Default value is 50. See
the section on "DECK SLICING" for more information.
--next-card-label=*STRING*
This options sets the label of the link that points to the next
card. Default is "[>>]", which whill be rendered as "[>>]".
--prev-card-label=*STRING*
This options sets the label of the link that points to the previous
card. Default is "[<<]", which whill be rendered as "[<<]".
HTTP Authentication
-U, --http-user=*USERNAME*
Use this option to set the username for an authenticated request.
-P, --http-passwd=*PASSWORD*
Use this option to set the password for an authenticated request.
Proxy Support
-[no]Y, --[no]proxy
Use this option to activate proxy support. By default, proxy support
is activated. See the section on "PROXY SUPPORT".
Output Options
-k, --compile
Setting this option tells Html2Wml to use the compiler from WML
Tools to compile the WML deck. If you want to create a real Wap
site, you should seriously use this option in order to reduce the
size of the WML decks. Remember that Wap devices have very little
amount of memory. If this is not enought, use the splitting options.
Take a look in wml_compilation/ for more information on how to use a
WML compiler with Html2Wml.
-o, --output
Use this option (in shell mode) to specify an output file. By
default, Html2Wml prints the result to standard output.
Debugging Options
-d, --debug[=*LEVEL*]
This option activates the debug mode. This prints the output result
with line numbering and with the result of the XML check. If the WML
compiler was called, the result is also printed in hexadecimal an
ascii forms. When called as a CGI, all of this is printed as HTML,
so that can use any web browser for that purpose.
--xmlcheck
When this option is on, it send the WML output to XML::Parser to
check its well-formedness.
DECK SLICING
The *deck slicing* is a feature that Html2Wml provides in order to match
the low memory capabilities of most Wap devices. Many can't handle cards
larger than 2,000 bytes, therefore the cards must be sufficiently small
to be viewed by all Wap devices. To achieve this, you should compile
your WML deck, which reduce the size of the deck by 50%, but even then
your cards may be too big. This is where Html2Wml comes with the deck
slicing feature. This allows you to limit the size of the cards,
currently only *before* the compilation stage.
Slice by cards or by decks
On some Wap phones, slicing the deck is not sufficient: the WML browser
still tries to download the whole deck instead of just picking one card
at a time. A solution is to slice the WML document by decks. See the
figure below.
_____________ _____________
| deck | | deck #1 |
| _________ | | _________ |
| | card #1 | | | | card | |
| |_________| | | |_________| |
| _________ | |_____________|
| | card #2 | |
| |_________| | . . .
| _________ |
| | ... | | _____________
| |_________| | | deck #n |
| _________ | | _________ |
| | card #n | | | | card | |
| |_________| | | |_________| |
|_____________| |_____________|
WML document WML document
sliced by cards sliced by decks
What this means is that Html2Wml generates several WML documents. In CGI
mode, only the appropriate deck is sent, selected by the id given in
parameter. If no id was given, the first deck is sent.
Note on size calculation
Currently, Html2Wml estimates the size of the card on the fly, by
summing the length of the strings that compose the WML output, texts and
tags. I say "estimates" and not "calculates" because computing the exact
size would require many more calculations than the way it is done now.
One may objects that there are only additions, which is correct, but
knowing the *exact* size is not necessary. Indeed, if you compile the
WML, most of the strings of the tags will be removed, but not all.
For example, take an image tag: ` '. When compiled, the string `"img"' will be replaced by a one
byte value. Same thing for the strings `"src"' and `"alt"', and the
spaces, double quotes and equal signs will be stripped. Only the text
between double quote will be preserved... but not in every cases.
Indeed, in order to go a step further, the compiler can also encode
parts of the arguments as binary. For example, the string
`"http://www."' can be encoded as a single byte (`8F' in this case). Or,
if the attribute is `href', the string `href="http://' can become the
byte `4B'.
As you see, it doesn't matter to know exactly the size of the textual
form of the WML, as it will always be far superior to the size of the
compiled form. That's why I don't count all the characters that may be
actually written.
Also, it's because I'm quite lazy ;-)
Why compiling the WML deck?
If you intent to create real WML pages, you should really consider to
always compile them. If you're not convinced, here is an illustration.
Take the following WML code snipet:
Yahoo!
It's the basic and classical way to code an hyperlink. It takes 42 bytes
to code this, because it is presented in a human-readable form.
The WAP Forum has defined a compact binary representation of WML in its
specification, which is called "compiled WML". It's a binary format,
therefore you, a mere human, can't read that, but your computer can. And
it's much faster for it to read a binary format than to read a textual
format.
The previous example would be, once compiled (and printed here as
hexadecimal):
1C 4A 8F 03 y a h o o 00 85 01 03 Y a h o o ! 00 01
This only takes 21 bytes. Half the size of the human-readable form. For
a Wap device, this means both less to download, and easier things to
read. Therefore the processing of the document can be achieved in a
short time compared to the tectual version of the same document.
There is a last argument, and not the less important: many Wap devices
only read binary WML.
ACTIONS
Actions are a feature similar to (but with far less functionalities!)
the SSI (Server Side Includes) available on good servers like Apache. In
order not to interfere with the real SSI, but to keep the syntax easy to
learn, it differs in very few points.
Syntax
Basically, the syntax to execute an action is:
Note that the angle brackets are part of the syntax. Except for that
point, Actions syntax is very similar to SSI syntax.
Available actions
Only few actions are currently available, but more can be implemented on
request.
include
Description
Includes a file in the document at the current point. Please
note that Html2Wml doesn't check nor parse the file, and if
the file cannot be found, will silently die (this is the
same behavior as SSI).
Parameters
`virtual=url' -- The file is get by http.
`file=path' -- The file is read from the local disk.
fsize
Description
Returns the size of a file at the current point of the
document.
Parameters
`virtual=url' -- The file is get by http.
`file=path' -- The file is read from the local disk.
Notes If you use the file parameter, an absolute path is
recommend.
skip
Description
Skips everything until the first `end_skip' action.
Generic parameters
The following parameters can be used for any action.
for=*output format*
This paramater restricts the action for the given output format.
Currently, the only available format is "`wml'" (when using
`html2chtml' the format is "`chtml'").
Examples
If you want to share a navigation bar between several WML pages, you can
`include' it this way:
Of course, you have to write this navigation bar first :-)
If you want to use your current HTML pages for creating your WML pages,
but that they contains complex tables, or unecessary navigation tables,
etc, you can simply `skip' the complex parts and keep the rest.
unecessary parts for the WML pages
useful parts for the WML pages
LINKS RECONSTRUCTION
The links reconstruction engine is IMHO the most important part of
Html2Wml, because it's this engine that allows you to reconstruct the
links of the HTML document being converted. It has two modes, depending
upon whether Html2Wml was launched from the shell or as a CGI.
When used as a CGI, this engine will reconstructs the links of the HTML
document so that all the urls will be passed to Html2Wml in order to
convert the pointed files (pages or images). This is completly automatic
and can't be customized for now (but I don't think it would be really
useful).
When used from the shell, this engine reconstructs the links with the
given templates. Note that absolute URLs will be left untouched. The
templates can be customized using the following syntax.
Templates
HREF Template
This template controls the reconstruction of the `href' attribute of
the `A' tag. Its value can be changed using the --hreftmpl option.
Default value is `"{FILEPATH}{FILENAME}{$FILETYPE =~
s/s?html?/wml/o; $FILETYPE}"'.
Image Source Template
This template controls the reconstruction of the `src' attribute of
the `IMG' tag. Its value can be changed using the --srctmpl option.
Default value is `"{FILEPATH}{FILENAME}{$FILETYPE =~
s/gif|png|jpe?g/wbmp/o; $FILETYPE}"'
Syntax
The template is a string that contains the new URL. More precisely, it's
a Text::Template template. Parameters can be interpolated as a constant
or as a variable. The template is embraced between curcly bracets, and
can contain any valid Perl code.
The simplest form of a template is `{PARAM}' which just returns the
value of PARAM. If you want to do something more complex, you can use
the corresponding variable; for example `{"foo $PARAM bar"}', or `{join
"_", split " ", PARAM}'.
You may read the Text::Template manpage for more information on what is
possible within a template.
If the original URL contained a query part or a fragment part, then they
will be appended to the result of the template.
Available parameters
URL This parameter contains the original URL from the `href' or `src'
attribute.
FILENAME
This parameter contains the base name of the file.
FILEPATH
This parameter contains the leading path of the file.
FILETYPE
This parameter contains the suffix of the file.
This can be resumed this way:
URL = http://www.server.net/path/to/my/page.html
------------^^^^ ----
| | \
| | \
FILEPATH FILENAME FILETYPE
Note that `FILETYPE' contains all the extensions of the file, so if its
name is index.html.fr for example, `FILETYPE' contains "`.html.fr'".
Examples
To add a path option:
{URL}$wap
Using Apache, you can then add a Rewrite directive so that URL ending
with `$wap' will be redirected to Html2Wml:
RewriteRule ^(/.*)\$wap$ /cgi-bin/html2wml.cgi?url=$1
To change the extension of an image:
{FILEPATH}{FILENAME}.wbmp
PROXY SUPPORT
Html2Wml uses LWP built-in proxy support. It is activated by default,
and loads the proxy settings from the environment variables, using the
same variables as many others programs. Each protocol (http, ftp, etc)
can be mapped to use a proxy server by setting a variable of the form
`PROTOCOL_proxy'. Example: use `http_proxy' to define the proxy for http
access, `ftp_proxy' for ftp access. In the shell, this is only a matter
of defining the variable.
For Bourne shell:
$ export http_proxy="http://proxy.domain.com:8080/"
For C-shell:
% setenv http_proxy "http://proxy.domain.com:8080/"
Under Apache, you can add this directive to your configuration file:
SetEnv http_proxy "http://proxy.domain.com:8080"
but this has the default that another CGI, or another program, can use
this to access external ressources. A better way is to edit Html2Wml and
fill the option `proxy-server' with the appropriate value.
CAVEATS
Html2Wml tries to make correct WML documents, but the well-formedness
and the validity of the document are not guarantied.
Inverted tags (like "bold italic ") may produce unexpected
results. But only bad softwares do bad stuff like this.
LINKS
Download
Html2Wml
This is the web site of the Html2Wml project, hosted by
SourceForge.net. All the stable releases can be downloaded from this
site.
[ http://www.html2wml.org/ ]
Nutialand
This is the web site of the author, where you can find the archives
of all the releases of Html2Wml.
[ http://www.maddingue.org/softwares/ ]
Resources
The WAP Forum
This is the official site of the WAP Forum. You can find some
technical information, as the specifications of all the technologies
associated with the WAP.
[ http://www.wapforum.org/ ]
WAP.com
This site has some useful information and links. In particular, it
has a quite well done FAQ.
[ http://www.wap.com/ ]
The World Wide Web Consortium
Altough not directly related to the Wap stuff, you may find useful
to read the specifications of the XML (WML is an XML application),
and the specifications of the different stylesheet languages (CSS
and XSL), which include support for low-resolution devices.
[ http://www.w3.org/ ]
TuxMobil
This web site is dedicated to Mobile UniX systems. It leads you to a
lot of useful hands-on information about installing and running
Linux and BSD on laptops, PDAs and other mobile computer devices.
[ http://www.tuxmobil.org/ ]
Programmers utilities
HTML Tidy
This is a very handful utility which corrects your HTML files so
that they conform to W3C standards.
[ http://www.w3.org/People/Raggett/tidy ]
Kannel
Kannel is an open source Wap and SMS gateway. A WML compiler is
included in the distribution.
[ http://www.kannel.org/ ]
WML Tools
This is a collection of utilities for WML programmers. This include
a compiler, a decompiler, a viewer and a WBMP converter.
[ http://pwot.co.uk/wml/ ]
WML browsers and Wap emulators
Opera
Opera is originaly a Web browser, but the version 5 has a good
support for XML and WML. Opera is available for free for several
systems.
[ http://www.opera.com/ ]
wApua
wApua is an open source WML browser written in Perl/Tk. It's easy to
intall and to use. Its support for WML is incomplete, but sufficient
for testing purpose.
[ http://fsinfo.cs.uni-sb.de/~abe/wApua/ ]
Tofoa
Tofoa is an open source Wap emulator written in Python. Its
installation is quite difficult, and its incomplete WML support
makes it produce strange results, even with valid WML documents.
[ http://tofoa.free-system.com/ ]
EzWAP
EzWAP, from EZOS, is a commercial WML browser freely available for
Windows 9x, NT, 2000 and CE. Compared to others Windows WML
browsers, it requires very few resources, and is quite stable. Its
support for the WML specs seems quite complete. A very good
software.
[ http://www.ezos.com/ ]
Deck-It
Deck-It is a commercial Wap phone emulator, available for Windows
and Linux/Intel only. It's a very good piece of software which
really show how WML pages are rendered on a Wap phone, but one of
its major default is that it cannot read local files.
[ http://www.pyweb.com/tools/ ]
Klondike WAP Browser
Klondike WAP Browser is a commercial WAP browser available for
Windows and PocketPC.
[ http://www.apachesoftware.com/ ]
WinWAP
WinWAP is a commercial Wap browser, freely available for Windows.
[ http://www.winwap.org/ ]
WAPman
WAPman from EdgeMatrix, is a commercial WAP browser available for
Windows and PalmOS.
[
http://www.edgematrix.com/edge/control/MainContentBean?page=download
s ]
Wireless Companion
Wireless Companion, from YourWap.com, is a WAP emulator available
for Windows.
[ http://www.yourwap.com/ ]
Mobilizer
Mobilizer is a Wap emulator available for Windows and Unix.
[ http://mobilizer.sourceforge.net/ ]
QWmlBrowser
QWmlBrowser (formerly known as WML BRowser) is an open source WML
browser, written using the Qt toolkit.
[ http://www.wmlbrowser.org/ ]
Wapsody
Wapsody, developed by IBM, is a freely available simulation
environment that implements the WAP specification. It also features
a WML browser which can be run stand-alone. As Wapsody is written in
Java/Swing, it should work on any system.
[ http://alphaworks.ibm.com/aw.nsf/techmain/wapsody ]
WAPreview
WAPreview is a Wap emulator written in Java. As it uses an HTML
based UI and needs a local web proxy, it runs quite slowly.
[ http://wapreview.sourceforge.net ]
PicoWap
PicoWap is a small WML browser made by three French students.
[ http://membres.lycos.fr/picowap/ ]
ACKNOWLEDGEMENTS
Werner Heuser, for his numerous ideas, advices and his help for the
debugging
Igor Khristophorov, for his numerous suggestions and patches
And all the people that send me bug reports: Daniele Frijia, Axel
Jerabek, Ouyang
AUTHOR
Sébastien Aperghis-Tramoni
COPYRIGHT
Copyright (C)2000, 2001, 2002 Sébastien Aperghis-Tramoni
This program is free software. You can redistribute it and/or modify it
under the terms of the GNU General Public License, version 2 or later.
html2wml-0.4.11/doc/style.css 0100644 0000765 0000024 00000003233 07746772372 0015671 0 ustar 00maddingue staff body {
background: white;
color: black;
font-family: "Lucida", "Verdana", "Arial", sans-serif;
font-size: 80%;
margin: 0;
padding: 1ex;
}
h1, h2, h3 {
font-family: "Lucida", "Verdana", "Arial", sans-serif;
background: transparent;
color: #006699;
}
code, pre {
font-family: "Courier New", monospace;
}
table {
border-collapse: collapse;
border-spacing: 0;
border-width: 0;
color: inherit;
}
td {
padding: 0.1ex 0.2em;
}
th {
background: #CCC;
color: inherit;
padding: 0.2ex 0.5em;
text-align: left;
}
th a:link, th a:visited {
background: transparent;
color: black;
}
img {
border: 0;
vertical-align: top;
}
form { margin: 0; }
input { margin: 2px; }
a:link, a:visited {
background: transparent;
color: #006699;
}
a:hover, a:focus {
background: #EEE;
}
div {
border-width: 0;
}
dt {
margin-top: 1em;
}
pre {
background: #eeeeee;
border: 1px solid #888888;
color: black;
padding-top: 0.25em;
padding-bottom: 0.25em;
white-space: pre;
}
hr {
display: none;
}
#index {
display: block;
border: outset 2px #006699;
position: fixed; top: 5px; right: 5px;
margin-left: 3px;
padding: 0.6ex 1.5em;
background: white;
font-size: 80%;
}
ul#index { padding-left: 2em; }
#index ul { padding-left: 2em; }
ul#index > li, ul#index > ul { display: none; }
ul#index:hover > li, ul#index:hover > ul { display: block; }
@media print {
/* h1 { page-break-before: always } */
h1, h2, h3,
h4, h5, h6 { page-break-after: avoid }
ul, ol, dl { page-break-before: avoid }
body { font-family: "Times New Roman", serif; }
p { text-align: justify; }
#index { display: none }
}
html2wml-0.4.11/html2chtml.cgi 0100755 0000765 0000024 00000161306 07746772372 0016025 0 ustar 00maddingue staff #!/usr/bin/perl
# ########################################################################### #
# Html2cHtml #
# ========== #
# Author: Sebastien Aperghis-Tramoni #
# #
# This program converts HTML pages to compactHTML pages. #
# See the documentation for more informations. #
# #
# This program is available under the GNU General Public License. #
# #
# You can find the original archive of this program on the author's web site #
# http://www.maddingue.org/softwares/ #
# #
# and on the web site of Html2Wml on SourceForge #
# http://www.html2wml.org/ #
# #
# Copyright (c)2000, 2001, 2002 Sebastien Aperghis-Tramoni #
# ########################################################################### #
use strict;
use CGI;
use File::Basename;
use Getopt::Long;
use HTML::Parser;
use LWP::UserAgent;
use POSIX qw(isatty);
use Text::Template;
use URI;
use URI::URL;
use vars qw($program $version);
$program = 'Html2cHtml';
$version = '0.4.11-chtml.03';
# --------------------------------------------------------------------------- #
# Static configuration #
# #
# If you want to hard-code some parameters of Html2cHtml, this is the #
# place to edit. Please check the documentation for more information. #
# #
my %defaults = (
## proxy settings
'proxy-server' => '', ## proxy server
## cHTML version and identifier
'wmlvers' => q||,
## characters encoding
'encoding' => 'iso-8859-1',
## links reconstruction default options
'hreftmpl' => '{FILEPATH}{FILENAME}{$FILETYPE =~ s/s?html?/wml/o; $FILETYPE}',
'srctmpl' => '{FILEPATH}{FILENAME}{$FILETYPE =~ s/gif|png|jpe?g/wbmp/o; $FILETYPE}',
## links reconstruction in CGI mode
'relative-url' => 1, ## use relative self path ?
);
my %options = (
'help' => 0, ## show the usage and exit
'version' => 0, ## show the program name and version and exit
## conversion options
'ascii' => 0, ## convert named entities to US-ASCII
'collapse' => 1, ## collapse white space characters
'compile' => 0, ## compile WML to binary tokenized data
'ignore-images' => 0, ## completly ignore image links
'img-alt-text' => 1, ## replace IMG tags with their ALT attribute
'linearize' => 1, ## suppress the tables tags
'nopre' => 0, ## don't use PRE tag
'numeric-non-ascii' => 0, ## convert non-ASCII characters to numeric entities
## links reconstruction options
'hreftmpl' => $defaults{hreftmpl},
'srctmpl' => $defaults{srctmpl},
## card splitting options
'split-card' => 1, ## slice the document by cards
'split-deck' => 0, ## slice the document by decks
'max-card-size' => 4_096, ## maximum size of data per card
'card-split-threshold' => 50, ## card split threshold
'next-card-label' => '[>>]', ## label of the link to go to the next card
'prev-card-label' => '[<<]', ## label of the link to go to the previous card
## HTTP authentication
'http-user' => '', ## HTTP user
'http-passwd' => '', ## HTTP password
## proxy support
'proxy' => 1, ## turn proxy support on/off
## debugging options
'debug' => undef,## activate the debug mode
'xmlcheck' => 0, ## perform a well-formedness check (using XML::Parser)
);
# You should not edit below this line unless you know what you are doing. #
# --------------------------------------------------------------------------- #
#
# globals
#
sub debug;
sub warning;
sub fatal;
use vars qw($cgi);
$cgi = 0;
my $agent; ## LWP user agent
my $result; ## WML deck in text format
my $binary; ## WML deck in binary format
my $xmlckres = '';
my $complres = '';
my %optname = (
'a' => 'ascii',
'c' => 'xmlcheck',
'd' => 'debug',
'i' => 'ignore-images',
#'h' => 'help', ## shell only
'k' => 'compile',
'n' => 'numeric-non-ascii',
#'o' => 'output', ## shell only
'p' => 'nopre',
'P' => 'http-passwd',
's' => 'max-card-size',
't' => 'card-split-threshold',
'U' => 'http-user',
#'v' => 'version', ## shell only
);
my %optchar = ();
## used by the html parser
use vars qw(%state);
%state = (
doc_uri => '', ## document absolute URI
doc_name => '', ## document file name
self_url => '', ## the CGI's URL for self-referencing
self_srv => '', ## the CGI's server
output => '', ## buffer for storing output
type => 'chtml', ## type of the output
decks => {}, ## hash that contains the decks, indexed by their id
skip => 0, ## skip switch (on/off)
stack => [], ## tag stack
cardsize => 0, ## size of the current card/deck
cardid => 'wdf000', ## ID of the current card/deck (stands for "WML Document - Fragment 000")
title => '', ## title of the WML deck
encoding => '', ## encoding of the document
);
my %entities = (); ## named entities conversion table
my %num2ascii = (); ## non-ASCII characters to ASCII equivalent conversion table
#
# The following two hashes are based on the WML DTD. They are the hardcoded
# conversion tables which describe the legal syntax of WML tags.
#
my %dtdent = ();
## parameter entities
$dtdent{heading} = 'h1,h2,h3,h4,h5,h6';
$dtdent{list} = 'ul,ol,dir,menu';
## text markup
$dtdent{phrase} = 'dfn';
$dtdent{special} = 'a,img,br';
$dtdent{form} = 'input,select,textarea';
$dtdent{text} = "$dtdent{phrase},$dtdent{special},$dtdent{form}";
## HTML content model
$dtdent{block} = "p,$dtdent{list},pre,dl,div,center,blockquote,form,hr";
$dtdent{flow} = "$dtdent{text},$dtdent{block}";
$dtdent{emph} = 'em,strong,b,i,u,big,small';
$dtdent{layout} = 'br';
$dtdent{fields} = "$dtdent{flow},input,select,fieldset";
$dtdent{body} = "$dtdent{heading},$dtdent{text},$dtdent{block}";
my %with = (
html => { action => 'keep', nest => 'head,body', unique => 1 },
## header tags
head => { action => 'keep', nest => 'meta,title,base', unique => 1 },
meta => { action => 'keep', nest => 'EMPTY', attributes => 'http-equiv' },
title => { action => 'keep', nest => '' },
base => { action => 'keep', nest => 'EMPTY', atrributes => 'href' },
style => { action => 'skip' },
script => { action => 'skip' },
## structural tags
body => { action => 'keep', nest => $dtdent{body}, unique => 1 },
div => { action => 'keep', nest => $dtdent{body}, attributes => 'align' },
center => { action => 'keep', nest => $dtdent{body} },
h1 => { action => 'keep', nest => "$dtdent{text}", attributes => 'align' },
h2 => { action => 'keep', nest => "$dtdent{text}", attributes => 'align' },
h1 => { action => 'keep', nest => "$dtdent{text}", attributes => 'align' },
h4 => { action => 'keep', nest => "$dtdent{text}", attributes => 'align' },
h5 => { action => 'keep', nest => "$dtdent{text}", attributes => 'align' },
h6 => { action => 'keep', nest => "$dtdent{text}", attributes => 'align' },
p => { action => 'keep', nest => "$dtdent{text}", attributes => 'align' },
pre => { action => 'keep', nest => "$dtdent{text}", attributes => 'align' },
plaintext => { action => 'keep', nest => "$dtdent{text}", attributes => 'align' },
blockquote => { action => 'keep', nest => $dtdent{body} },
br => { action => 'keep', nest => 'EMPTY' },
hr => { action => 'keep', nest => 'EMPTY', attributes => 'align,' },
## lists
dl => { action => 'keep', nest => 'dt,dd' },
dt => { action => 'keep', nest => $dtdent{text} },
dd => { action => 'keep', nest => $dtdent{flow} },
ol => { action => 'keep', nest => 'li' },
ul => { action => 'keep', nest => 'li' },
dir => { action => 'keep', nest => "li,$dtdent{block}" },
menu => { action => 'keep', nest => "li,$dtdent{block}" },
li => { action => 'keep', nest => $dtdent{flow} },
## tables
caption => { action => 'replace', new_value => 'p', render => 'b' },
'tr' => { action => 'replace', new_value => 'p' },
th => { action => 'replace', new_value => 'p', render => 'b' },
#td => { action => 'keep', nest => "$dtdent{emph},$dtdent{layout},img,a,anchor" },
## link tags
a => { action => 'keep', nest => "$dtdent{text}", attributes => 'name,href,accesskey' },
img => { action => 'keep', nest => 'EMPTY', attributes => 'src,align,width,height,hspace,vspace,alt,border' },
frame => { action => 'replace', new_value => 'p' },
area => { action => 'replace', new_value => 'p' },
## style tags
em => { action => 'keep', nest => $dtdent{flow} },
strong => { action => 'keep', nest => $dtdent{flow} },
b => { action => 'keep', nest => $dtdent{flow} },
i => { action => 'keep', nest => $dtdent{flow} },
u => { action => 'keep', nest => $dtdent{flow} },
big => { action => 'keep', nest => $dtdent{flow} },
small => { action => 'keep', nest => $dtdent{flow} },
## form tags
form => { action => 'keep', nest => $dtdent{body}, attributes => 'action,method,enctype' },
input => { action => 'keep', nest => 'EMPTY', attributes => 'type,name,value,checked,size,maxlength,src,align'},
'select' => { action => 'keep', nest => 'option', attributes => 'name,size,multiple' },
option => { action => 'keep', nest => 'EMPTY', attributes => 'selected,value' },
textarea => { action => 'keep', nest => 'EMPTY', attributes => 'name,rows,cols' },
);
#
# The following hash hardcodes the parent-lookup for each element
# of the WML syntax, i.e. for each element, it gives the prefered
# parent element.
#
my %reverse = (
html => '',
## header tags
head => 'html', meta => 'head',
title => 'head', base => 'head',
## structural tags
body => 'html', div => 'body', center => 'body', p => 'body',
h1 => 'body', h2 => 'body', h3 => 'body', h4 => 'body',
h5 => 'body', h6 => 'body', pre => 'body', plaintext => 'body',
br => 'p', hr => 'p',
## lists tags
dl => 'p', dt => 'dl', dd => 'dl',
ol => 'p', ul => 'p', dir => 'p', menu => 'p',
li => 'ul',
## link tags
a => 'p', img => 'p',
## style tags
b => 'p', i => 'p', u => 'p',
strong => 'p', em => 'p',
big => 'p', small => 'p',
## form tags
form => 'body', input => 'form', 'select' => 'form', option => 'select',
textarea => 'form',
);
#
# main
#
$| = 1;
my $time = time;
Getopt::Long::Configure(qw(no_auto_abbrev));
fileparse_set_fstype('Unix'); ## this is because I use fileparse() to
## split the URL fragments
## CGI security options
$CGI::POST_MAX = 1024 * 1; # max 1K posts
$CGI::DISABLE_UPLOADS = 1; # no uploads
load_entities();
## create the user agent
$agent = new LWP::UserAgent protocols_forbidden => ['file'];
$agent->agent("[$program/$version ".$agent->agent.']');
for my $opt (keys %optname) {
$optchar{$optname{$opt}} = $opt
}
## constructing %num2ascii using data from %entities
for my $ent (keys %entities) {
$num2ascii{$entities{$ent}[0]} = $entities{$ent}[1]
}
if(@ARGV or isatty(\*STDOUT)) {
## launched from shell
my @opts = (
## usage options
qw(help|h|H version|v|V),
## conversion options
qw(ascii|a! collapse! ignore-images|i img-alt-text!
linearize! nopre|p numeric-non-ascii|n),
## links reconstructions options
qw(hreftmpl=s srctmpl=s),
## card splitting options
qw(split-card split-deck
max-card-size|s=i card-split-threshold|t=i
next-card-label=s prev-card-label=s),
## HTTP authentication
qw(http-user|U=s http-passwd|P=s),
## proxy support
qw(proxy|Y!),
## output options
qw(compile|k output|o=s),
## debugging options
qw(debug|d:i xmlcheck|c!)
);
## getting options
GetOptions(\%options, @opts);
version() if $options{version};
usage() if $options{help};
usage() unless @ARGV;
apply_options();
## converting the file
$result = html2wml(shift);
} else {
## launched from web
$cgi = new CGI;
$agent->agent($cgi->user_agent . ' ' . $agent->agent);
$cgi->compile(qw(param url header));
## get the options
for my $param ($cgi->param) {
my $option = length($param) == 1 ? $optname{$param} : $param;
next unless exists $options{$option};
$options{$option} = $cgi->param($param)
}
apply_options();
$state{doc_name} = 'output';
## creating static part of the self url
my $cgi_options = '';
for my $param ($cgi->param) {
next if 'url,id' =~ /\b$param\b/;
my $value = $cgi->param($param);
next unless $value;
next unless exists $options{$param} or exists $options{$optname{$param}};
my $opt = exists $optchar{$param} ? $optchar{$param} : $param;
$cgi_options .= "$param=$value;"
}
$state{self_url} = $cgi->url(-relative => $defaults{'relative-url'}) . "?$cgi_options";
## send debug header if needed
print $cgi->header if $options{'debug'};
## execute main part
$result = html2wml($cgi->param('url') || '/');
}
## special case: splitting by decks
if($cgi and $options{'split-deck'}) {
## return the desired deck (as specified by the id parameter)
## or the first deck if none has been specified
$result = $state{decks}{ $cgi->param('id') || (sort keys %{$state{decks}})[0] }
}
## XML check
if($options{xmlcheck}) {
eval {
require XML::Parser;
my $parser = new XML::Parser Style => 'Tree', ErrorContext => 2;
$parser->parse($result);
};
$@ =~ /Can't locate/ and $@ = '(XML::Parser not available)';
$xmlckres = $@ ? "\nExpat errors\n$@" : "Expat: well-formed";
eval {
require XML::LibXML;
my $parser = new XML::LibXML;
$parser->validation(1);
$parser->expand_entities(1);
$parser->load_ext_dtd(1);
my $dom = $parser->parse_string($result);
#die "document isn't valid" unless $dom->is_valid;
};
$@ =~ /Can't locate/ and $@ = ', (XML::LibXML not available)';
$xmlckres .= $@ ? "\nGnome-XML errors\n$@" : ", Gnome-XML: valid";
eval {
require XML::Checker::Parser;
#XML::Checker::Parser::map_uri(
# "-//WAPFORUM//DTD WML 1.2//EN" => "file:///Users/madingue/Documents/Softwares/Html2Wml/devel/t/wml12.dtd"
#);
my $parser = new XML::Checker::Parser Style => 'Tree',
ErrorContext => 2, ParseParamEnt => 1, #NoLWP => 1,
SkipExternalDTD => 1, KeepCDATA => 1;
$parser->parse($result);
};
$@ =~ /Can't locate/ and $@ = ', (XML::Checker not available)';
$xmlckres .= $@ ? "\nXML-Checker errors\n$@" : ", XML-Checker: valid";
}
## WML tokenization
if($options{compile}) {
$binary = '';
my $buf;
eval {
require IPC::Open2;
require IO::Handle;
my $in = new IO::Handle;
my $out = new IO::Handle;
my $pid = IPC::Open2::open2($out, $in, $defaults{wmlc}, '-', '-');
print $in $result;
close($in);
$binary = join '', <$out>;
close($out);
waitpid($pid, 0);
};
$complres = $@
}
if($options{'debug'}) { ## debug output
$time = time - $time;
my @times = times;
$times[0] += $times[2]; ## total user time
$times[1] += $times[3]; ## total system time
$times[2] = $times[0] + $times[1]; ## total time
my $i = 1;
$result .= "\n";
$result =~ s/^/@{[sprintf '%3d', $i++]}: /gm; ## add lines number
$result = simple_wrap($result);
if($cgi) {
print qq|\n\n$program -- Debug Mode \n|,
qq|\n|,
qq|\n\n$program -- Debug Mode \n|,
qq|This is the result of the conversion of the document |,
qq|$state{doc_uri} by $program v$version.
\n|,
qq| \n|,
htmlize($result),
qq| \nResult of XML check:
\n|,
htmlize($xmlckres);
print qq| \nResult of WML compilation:
\n|,
($complres ? "$complres\n" : hextype($binary)), " \n"
if $options{compile};
printf " \nTime: $time wallclock secs (%.2f usr + %.2f sys = %.2f cpu)
\n",
@times[0..2];
print qq|\n\n|
} else {
my $s = "$program -- Debug Mode\n";
print $s, '-'x length($s), "\n",
$result, "\n", ' -'x5, "\n",
$xmlckres, "\n";
print ' -'x5, "\nCompiled WML\n", ' -'x5, "\n",
($complres ? "$complres\n" : hextype($binary))
if $options{compile};
print ' -'x5, "\n";
printf "Time: $time wallclock secs (%.2f usr + %.2f sys = %.2f cpu)\n", @times[0..2];
}
} else { ## normal output
my $out = \*STDOUT;
if($options{'output'}) {
open(OUT, ">$options{output}") or fatal "cannot write to '$options{output}': $!\n";
$out = \*OUT;
}
if($options{'compile'}) {
print $out $cgi->header(
-type => 'application/vnd.wap.wmlc',
-content_length => length $result
) if $cgi;
print $out $binary;
} else {
print $out $cgi->header(
-type => "text/vnd.wap.wml; charset=$state{encoding}",
-content_length => length $result
) if $cgi;
print $out $result;
}
}
#
# apply_options()
# -------------
sub apply_options {
if($options{'linearize'}) {
delete @with{qw(table tr td th)};
$with{'caption'} = { action => 'replace', new_value => 'p', render => 'b' };
$with{'tr'} = { action => 'replace', new_value => 'p' };
delete @reverse{qw(table tr td)};
}
if($options{'ignore-images'}) {
delete $with{'img'};
}
if(not defined $options{'debug'}) {
$options{'debug'} = 0;
} elsif($options{'debug'} == 0) {
$options{'debug'} = 1;
}
if($options{'debug'}) {
$options{'xmlcheck'} = 1;
}
if($options{'nopre'}) {
delete $with{'pre'};
$with{'pre'} = { action => 'replace', new_value => 'p' };
$with{'plaintext'} = { action => 'replace', new_value => 'p' };
}
if($cgi) {
$options{'split-card'} = 0;
$options{'split-deck'} = 1;
## security: don't allow to modify the templates
## when called as a CGI
$options{'hreftmpl'} = $defaults{'hreftmpl'};
$options{'srctmpl'} = $defaults{'srctmpl'};
}
## security: check if the templates contains suspicious code
## if the templates have changed
if($options{hreftmpl} ne $defaults{hreftmpl} or $options{srctmpl} ne $defaults{srctmpl}) {
my $forbidden = join '|', '[``]', map {"\\b$_\\b"}
qw(eval exec system unlink kill fork open sysopen pipe socket);
$options{hreftmpl} = $defaults{hreftmpl} if $options{hreftmpl} =~ /$forbidden/;
$options{srctmpl} = $defaults{hreftmpl} if $options{srctmpl} =~ /$forbidden/;
}
$options{'cardsize-limit'} = $options{'max-card-size'} - $options{'card-split-threshold'};
if($^O eq 'MacOS') {
$options{'compile'} = 0;
}
if($options{'compile'}) {
$options{'prev-card-label'} = '[<<]';
}
## proxy support
if($options{'proxy'}) {
if($defaults{'proxy-server'}) {
## use hardcoded settings
$agent->proxy([qw(http ftp gopher)] => $defaults{'proxy-server'});
} else {
## load from environment
$agent->env_proxy();
}
}
## cHTML: there is no "multi-body" in cHTML as there is the
## multi-card mechanism in WML, so we always activate the
## document slicing (split-deck option)
$options{'split-card'} = 0;
$options{'split-deck'} = 1;
}
#
# html2wml()
# --------
sub html2wml {
my $url = shift;
my $file = '';
my $type = '';
my $enc = '';
my $converter = new HTML::Parser api_version => 3;
return unless $url;
## read the file
if($url =~ m{https?://}) { ## absolute uri
($file,$type,$enc) = get_url($url)
} elsif(not $cgi) { ## local file
$file = read_file($url)
} else { ## absolute url relative to the server
($file,$type,$enc) = get_url( $url = URI::URL->new($url, $cgi->url)->abs )
}
$enc ||= '';
$enc =~ s/charset=//i;
url_encode($url);
$state{doc_uri} = $url;
($state{self_srv}) = ($state{self_url} =~ m|^(https?://[\w.-]+(?::\d+)?)/|);
## strip the DOCTYPE
$file =~ s/]+>//go;
## try to get the document charset encoding
if(not $enc and $file =~ m|meta +http-equiv.+charset=["']?([a-zA-Z0-9_-]+)['"]?|i) {
$enc = lc $1
}
$state{encoding} = $enc || $defaults{'encoding'};
$type ||= '';
## if it's an image, call send_image()
if(index($type, 'image') >= 0 or $url =~ /\.(?:gif|jpg|png)$/i) {
@_ = ($file, $url);
goto &send_image
}
## get the document title
if($file =~ m|([^<]+) |i) {
$state{title} = $1;
convert_entities($state{title});
clean_spaces($state{title});
}
## WML header
$state{skip} = 0;
$state{output} = "$defaults{wmlvers}\n";
## affectation of the HTML::Parser handlers
$converter->unbroken_text(1);
$converter->handler(start => \&start_tag, 'tagname, attr');
$converter->handler(end => \&end_tag, 'tagname');
$converter->handler(text => \&text_tag, 'text, is_cdata');
$converter->handler(comment => \&comment_tag, 'tokens');
#$converter->handler(declaration => \&default_handler, 'text');
#$converter->handler(process => \&default_handler, 'text');
#$converter->handler(default => \&default_handler, 'text');
## begin the conversion
$converter->parse($file);
$converter->eof;
## flush the stack
while(my $tag = pop @{$state{stack}}) {
$state{output} .= "$tag>"
}
post_conversion_cleanup();
$state{decks}{$state{cardid}} = $state{output};
return $state{output}
}
#
# post_conversion_cleanup()
# -----------------------
#
sub post_conversion_cleanup {
## convert alone ampersand characters to entities
$state{output} =~ s/\&\s/\& /go;
## correct unclosed numeric entities
$state{output} =~ s/(\\d+)([^\d;])/$1;$2/go;
## convert the named HTML entities to numeric entities
convert_entities($state{output});
## convert non-ASCII characters to numeric entities
if($options{'numeric-non-ascii'}) {
$state{output} =~ s/([\x80-\xFF])/''.ord($1).';'/eg;
}
## escape $ chars
#$state{output} =~ s/\$/\$\$/go; ## not needed in cHTML
collapse($state{output}) if $options{'collapse'};
## set the title of the card
if(length $state{title}) {
my $title = $state{title};
$title =~ s/"/\"/go;
$title =~ s/\$/\$\$/go;
$title =~ s/(\\d+)([^\d;])/$1;$2/go;
$state{output} =~ s/|>|go; ## collapse spaces inside tags
$_[0] =~ s|\s+/>|/>|go; ## collapse spaces inside empty tags
$_[0] =~ s|<(\w+) +|<$1 |g; ## collapse spaces between tag and attributes
$_[0] =~ s|\s+|
|go; ## collapse spaces at the begining of a paragraph
$_[0] =~ s|\s+
||go; ## collapse spaces at the end of a paragraph
## collapse empty paragraphs
$_[0] =~ s|]*>\s*
||go;
$_[0] =~ s|]*>\s*(?: )+\s*
||go;
$_[0] =~ s|]*>\s*(?:\ \s*)+
||go;
$_[0] =~ s|]*>\s*(?:\ \s*)+
||go;
$_[0] =~ s|]*>\s*(?:\[IMG\]\s*)+
||go;
$_[0] =~ s|<(\w+)>\s*\1>||go;
## collapse multiple lines
$_[0] =~ s/\n+/\n/go;
$_[0] =~ s/(?: +\n)+/\n/go;
}
#
# get_url()
# -------
# This function gets and returns the file from the given URI.
# If called in a array context, returns the file content and the associated
# MIME type (as given by the server).
#
sub get_url {
my $uri = shift;
my $quiet = shift || 0;
if($cgi and index($uri, 'file:') == 0) {
cgi_error(q|For security reasons, the file: scheme is not allowed.|)
}
my $request = new HTTP::Request GET => $uri;
my $response = $agent->request($request);
if($response->is_error) {
if($response->status_line == 401) {
## Authorization required
my($realm) = ($response->header('WWW-Authenticate') =~ /realm=(.+)/);
my $self = "$state{self_url}url=$state{doc_uri}";
if($options{'http-user'} and $options{'http-passwd'}) {
$request->www_authenticate($response->header('WWW-Authenticate'));
$request->authorization_basic($options{'http-user'}, $options{'http-passwd'});
$response = $agent->request($request);
} else {
if($cgi) {
print $cgi->header(-type => 'text/vnd.wap.wml'), <<"PASSFORM"; exit
$defaults{wmlvers}
Please enter your user name and password for $realm.
User:
Password:
PASSFORM
} else {
fatal <<"PASSASK"
website requires authentication
The web site requires you to authenticate in order to process your request.
Please enter your user name and password for $realm.
Use the --http-user and --http-passwd options (or their short counterparts
-U and -P). Check the documentation for more information.
PASSASK
}
}
} else {
my $err = <<"ERR";
The following error occured while trying to access the following URL
-- $uri --
Error @{[ $response->status_line ]}
ERR
if($cgi) {
if($quiet) {
warning "can't fetch file:\n", $err;
return '';
} else {
cgi_error($err)
}
} else {
fatal "fetch error\n\n", $err
}
}
}
return wantarray ? ($response->content, $response->content_type,
$response->content_encoding) : $response->content
}
#
# read_file()
# ---------
# This function reads and returns the file from the local disk.
#
sub read_file {
my $filepath = shift;
my $quiet = shift || 0;
my $dir = dirname($filepath);
my $file = basename($filepath);
chdir($dir) if $dir;
open(FILE, $file) or my $failed = 1;
if($failed) {
if($quiet) {
warning("can't read file '$file': $!\n") and return ''
} else {
fatal("can't read file '$file': $!\n")
}
}
local $/ = undef;
$file = ;
close(FILE);
return $file
}
#
# send_image()
# ----------
# This function allow Html2Wml to send WBMP images to the client.
# Currently, it send an empty hardcoded image, but support for
# conversion from common formats (GIF, JPEG, PNG) will be added soon.
#
sub send_image {
my $data = shift;
my $path = shift;
my $pixel = pack 'C*', 0, 0, 1, 1, 0xFF; ## this is one white pixel
## TODO: add the code to allow conversion using an external program
print $cgi->header(-type => 'image/wbmp', -content_length => length $pixel), $pixel;
exit
}
#
# convert_entities()
# ----------------
# This function converts the named HTML entities into numeric entities.
#
sub convert_entities {
my $ascii = $options{ascii};
## try to correct unclosed named entities
$_[0] =~ s/(&\w{2,6})\b([^;])/$1;$2/go;
## convert numeric entities and non-ASCII characters
## to ASCII equivalent if requested
if($ascii) {
$_[0] =~ s/(\d+);/$num2ascii{$1}/g;
$_[0] =~ s/([\x80-\xFF])/$num2ascii{ord($1)}/g;
}
my $code = q| while($_[0] =~ /&(\w+);/g) { |
. q| my $ent = $1; |
. q| if(exists $entities{$ent}) { |
.($ascii ? q| my $chr = $entities{$ent}[1]; |
: q| my $chr = ''.$entities{$ent}[0].';'; | )
. q| $_[0] =~ s/&$ent;/$chr/g |
. q| } |
. q| } |;
eval $code;
if($_[0] =~ /&(\w{2,6});?/) {
## there are some residual unknown or incorrect named entities
while($_[0] =~ /&(\w{2,6});?/g) {
my $ent = $1;
## check if $ent is a known entity
if(exists $entities{$ent}) {
warning "unclosed entity: $ent, corrected\n";
my $chr = $ascii ? $entities{$ent}[1] : ''.$entities{$ent}[0].';';
$_[0] =~ s//$chr/;
next
}
my($e1,$e2) = ('','');
## split the entity in two parts and check if the first part
## is a valid entity name
## entities names are between 2 and 6 characters long, so this
## loop won't be executed more than 4 times
for my $i (2..length($ent)) {
$e1 = substr($ent, 0, $i);
$e2 = substr($ent, $i);
last if exists $entities{$e1}
}
if(exists $entities{$e1}) {
warning "unknown entity: $ent, replaced with $e1\n";
my $chr = $ascii ? $entities{$e1}[1] : ''.$entities{$e1}[0].';';
$_[0] =~ s//$chr$e2/;
} else {
warning "unknown entity: $ent\n";
$_[0] =~ s//&$ent/;
}
}
}
## escape the remaining ampersands
$_[0] =~ s/&(\w+[^;])/&$1/g;
}
#
# clean_spaces()
# ------------
sub clean_spaces {
$_[0] =~ s/\t+/ /go;
$_[0] =~ s/^\s+/ /go;
$_[0] =~ s/ +/ /go;
}
#
# HTML::Parser start tag handler
#
sub start_tag {
my($tag, $attr) = @_;
local $_;
return unless exists $with{$tag};
return if $state{skip};
my $curr_tag = ($with{$tag}{action} eq 'replace' ? $with{$tag}{new_value} : $tag);
my $prev_tag = scalar @{$state{stack}} ? ${$state{stack}}[-1] : 0;
## prevent incorrect auto-nesting
return if $curr_tag eq $prev_tag and $with{$curr_tag}{unique} and $with{$curr_tag}{nest} !~ /\b$curr_tag\b/;
## special case: replacing image with its alternative text when necessary
if($curr_tag eq 'img' and $options{'img-alt-text'}) {
my $alt = $attr->{alt} || $attr->{title} || $attr->{id} || $attr->{name} || '[IMG]';
text_tag($alt) and return
}
## special case: tag
if($tag eq 'frame') {
if($prev_tag eq 'p') { pop @{$state{stack}}; $state{output} .= '' }
if($prev_tag eq 'wml') { push @{$state{stack}}, 'card'; $state{output} .= '' }
my $link = xlate_url($$attr{src}, 'href');
$state{output} .= qq|Frame: $$attr{name}
|;
}
## special case: image map tag
if($tag eq 'area') {
if($prev_tag eq 'p') { pop @{$state{stack}}; $state{output} .= '' }
if($prev_tag eq 'wml') { push @{$state{stack}}, 'card'; $state{output} .= '' }
my $link = xlate_url($$attr{href}, 'href');
$state{output} .= qq|Image map: $$attr{href}
|;
}
## special case: when inside a don't allow opening tags
if($prev_tag eq 'a' and $with{a}{nest} !~ /\b$curr_tag\b/) {
return
}
## special case: is replaced by
if($curr_tag eq 'a' and not exists $attr->{href}) {
$curr_tag = $tag = 'anchor';
$attr->{id} = exists $attr->{id} ? $attr->{id} : $attr->{name};
delete $attr->{name};
}
## special case: element that defines a base URL
if($tag eq 'base') {
$state{doc_uri} = URI::URL->new($attr->{href}, $state{doc_uri})->abs;
}
## reconstruct well-formed attributes list with only the allowed ones
if(exists $with{$curr_tag}{attributes} and scalar keys %$attr) {
my @attrs = ();
for my $param (keys %$attr) {
if(index($with{$curr_tag}{attributes}, $param) >= 0) {
$param = $with{$tag}{attrconv}{$param} if exists $with{$tag}{attrconv}{$param};
my $value = $attr->{$param};
if($param eq 'align' or $param eq 'type') {
$value = lc $value
} elsif($param eq 'href' or $param eq 'src') {
$value = xlate_url($value, $param);
}
convert_entities($value);
push @attrs, qq|$param="$value"|;
}
}
$attr = join ' ', '', @attrs;
} else {
$attr = ''
}
## set the skip mode state
$state{skip} = 1 if $with{$curr_tag}{action} eq 'skip';
debug [2], "\n(start tag) <$tag> => action: ", ($with{$tag}{action} ? $with{$tag}{action} : 'clear'), ($curr_tag ne $tag ? " with $curr_tag " : ''), ($attr? ", attributes:$attr" : ''), "\n";
if($with{$curr_tag}{action} eq 'keep') {
# TODO: this part of the syntax repairing engine will have to be
# re-written. Maybe a loop on the stack to check whether the
# tree is correct, and in case not, insert the missing ones
if(scalar @{$state{stack}}) {
debug [2], " -> syntax repair: closing tags ";
## syntax repair: close the tags that were left opened
while($prev_tag = pop @{$state{stack}}) {
if($with{$prev_tag}{nest} =~ /\b$curr_tag\b/
or $with{$prev_tag}{nest} =~ /\b$reverse{$curr_tag}\b/) {
push @{$state{stack}}, $prev_tag;
last
}
debug [2], "$prev_tag> ";
$state{output} .= "$prev_tag>";
}
debug [2], "\n";
}
## syntax repair: open the tags that should have been opened
if($with{$prev_tag}{nest} !~ /\b$curr_tag\b/) {
debug [2], " -> syntax repair: opening tags ";
my($inner_tag,$outter_tag) = ($curr_tag,$prev_tag);
my @nesting_tags = ();
while($reverse{$inner_tag} and $reverse{$inner_tag} ne $outter_tag) {
$inner_tag = $reverse{$inner_tag};
debug [2], "<$inner_tag> ";
unshift @nesting_tags, $inner_tag;
}
push @{$state{stack}}, @nesting_tags;
for my $t (@nesting_tags) { $state{output} .= "<$t>" }
debug [2], "\n";
debug [2], " new stack: (@{$state{stack}})\n";
}
}
## clean up a little
collapse($state{output}) if $options{'collapse'};
## split the card if needed ## disabled for cHTML
#if($state{cardsize} > $options{'cardsize-limit'}
# and exists $with{$tag}{special} and $with{$tag}{special} =~ /nowidow/) {
# split_card()
#}
## simple tag translation
if($with{$curr_tag}{action} eq 'keep') {
if($with{$curr_tag}{nest} eq 'EMPTY') {
$state{cardsize} += length($curr_tag) + length($attr);
$state{output} .= "<$curr_tag$attr/>"
} else {
$state{cardsize} += length($curr_tag) + length($attr);
$state{output} .= "<$curr_tag$attr>";
push @{$state{stack}}, $curr_tag;
}
} else {
## do nothing
}
## additional rendering effects
if(defined $with{$tag}{render}) { ## note that it's $tag, not $curr_tag
for my $t (split ',', $with{$tag}{render}) {
$state{cardsize} += length $t;
$state{output} .= "<$t>"
}
}
}
#
# HTML::Parser end tag handler
#
sub end_tag {
my($tag) = @_;
return unless exists $with{$tag};
my $curr_tag = ($with{$tag}{action} eq 'replace' ? $with{$tag}{new_value} : $tag);
## special case: anchors
if($tag eq 'a' and ${$state{stack}}[-1] eq 'anchor') { $curr_tag = $tag = 'anchor'}
debug [2], "( end tag ) $curr_tag> stack = (@{$state{stack}})\n\n";
$state{skip} = 0 if $with{$tag}{action} eq 'skip';
return if $state{skip};
return if exists $with{$tag}{nest} and $with{$tag}{nest} eq 'EMPTY';
## additional rendering effects
if(defined $with{$tag}{render}) { ## note that it's $tag, not $curr_tag
for my $t (reverse split ',', $with{$tag}{render}) {
$state{cardsize} += length $t;
$state{output} .= "$t>"
}
}
## special case: /card cleans up the stack
if($curr_tag eq 'card') {
while(${$state{stack}}[-1] ne $curr_tag) {
my $t = pop @{$state{stack}};
$state{cardsize} += length $t;
$state{output} .= "$t>";
}
my $s = qq|\n
|;
$state{cardsize} += length $s ;#- 25;
$state{output} .= $s;
}
## closing element
if(${$state{stack}}[-1] eq $curr_tag and $with{$curr_tag}{action} eq 'keep') {
$state{cardsize} += length $curr_tag;
$state{output} .= "$curr_tag> ";
pop @{$state{stack}};
} else {
## do nothing
}
## clean up a little
collapse($state{output}) if $options{'collapse'};
## check current card size ## disabled for cHTML
#if($curr_tag ne 'card' and $curr_tag ne 'wml' and $state{cardsize} > $options{'cardsize-limit'}) {
# split_card()
#}
}
#
# HTML::Parser text handler
#
sub text_tag {
my($text) = @_;
my $curr_tag = ${$state{stack}}[-1] || '';
debug [3], "(text node) stack = (@{$state{stack}})\n- - - - -\n$text\n- - - - -\n";
return if $state{skip};
return if $text =~ /^\s*$/s; ## skip empty lines
## add a para tag if we're on the card node
if($curr_tag eq 'card') {
$state{cardsize} += 4;
$state{output} .= "\n";
push @{$state{stack}}, 'p';
}
clean_spaces($text) if $options{'collapse'} and $curr_tag ne 'pre';
#
# TODO: add the code that split too long chunks of text
#
$state{output} .= $text;
$state{cardsize} += length $text;
}
#
# HTML::Parser comment tag handler
#
sub comment_tag {
my($comment) = @_;
local $_;
$comment = join '', @$comment;
debug [3], "( comment ) stack = (@{$state{stack}})\n $comment\n";
## Actions engine
if($comment =~ /^\s*\[(\w+)\s*(.*)\]\s*$/) {
my $action = $1;
my %attributes = map { /\G(\w+)=["']([^"']+)["']/g } split /\s+/, $2;
for my $attr (keys %attributes) {
if($attr eq 'for') {
return if $attributes{$attr} ne $state{type};
}
if($attr eq 'virtual' and $attributes{virtual} !~ /^http:/) {
$attributes{virtual} = URI::URL->new( $attributes{virtual}, $state{doc_uri} )->abs
}
}
for($action) {
/include/ and do {
my $buf;
if($attributes{virtual}) {
$buf = get_url($attributes{virtual}, 1);
} elsif($attributes{file}) {
$buf = read_file($attributes{file}, 1)
}
$state{output} .= $buf;
$state{cardsize} = length $buf;
};
/skip/ and do {
$state{skip} = 1;
};
/end_skip/ and do {
$state{skip} = 0;
};
/fsize/ and do {
my $buf;
if($attributes{virtual}) {
$buf = length get_url($attributes{virtual}, 1);
} elsif($attributes{file}) {
$buf = length read_file($attributes{file}, 1)
}
$state{output} .= $buf;
$state{cardsize} = length $buf;
};
}
}
}
#
# HTML::Parser default handler
#
sub default_handler {
my($text) = @_;
debug [2], "( default ) [$text]\n\n";
}
#
# split_card()
# ----------
# This function closes the current card and creates a new one.
#
sub split_card {
my @stack = @{$state{stack}};
shift @stack; ## shift the tag
shift @stack; ## shift the tag
my $id = $state{cardid}++;
$state{cardsize} = 0;
debug [2], "(splitcard) stack = (@{$state{stack}})\n\n";
for my $tag (reverse @stack) { $state{output} .= "$tag>" }
my $doc_uri = $state{doc_uri};
# strip the server part if the document and this CGI are on the same server
$doc_uri =~ s/^$state{self_srv}//o if $cgi;
my $link_to_next = $options{'split-deck'} ?
"$state{self_url}url=$doc_uri;id=$state{cardid}" : "#$state{cardid}";
$state{output} .= join '', qq|\n|,
qq| $options{'next-card-label'}
|,
qq|\n\n|;
if($options{'split-deck'}) {
post_conversion_cleanup();
$state{output} .= '';
$state{decks}{$id} = $state{output};
$state{output} = "$defaults{wmlvers}\n";
}
$state{output} .= qq|\n|;
for my $tag (@stack) { $state{output} .= "<$tag>" }
}
#
# xlate_url()
# ---------
# This function translates the given url so that the pointed document will
# pass through this CGI for conversion when in CGI mode, or construct a url
# that fits the needs of the webmaster using the given template, if present.
#
sub xlate_url {
my $url = shift; ## $url is the url from a href or a src attribute
my $type = shift; ## $type is 'src' or 'href'
## URL encode special characters
url_encode($url);
## we only treat http URLs
return $url if $url =~ /^(\w+):/ and index($1, 'http') != 0;
if($cgi) {
## CGI mode
# create the absolute URL relative to the document
my $link = URI::URL->new($url, $state{doc_uri})->abs;
# strip the server part if the URL and this CGI are on the same server
$link =~ s/^$state{self_srv}//o;
return "$state{self_url}url=$link"
} else {
## shell mode
## we don't touch URLs other than http(s):
return $url if $url =~ m|^(\w+):| and index($1, 'http') < 0;
## This is where the link reconstruction engine lives... (waah... :)
if($options{"${type}tmpl"} and $url !~ m|^https?://|) {
## we don't touch absolute urls
my $tmpl = $options{"${type}tmpl"};
my $uri = new URI $url, 'http';
if($uri->path) {
my($filename,$filepath,$filetype) = fileparse($uri->path, '((?:\.\w+)+)');
my $init_vars = qq|{
sub FILEPATH { q<$filepath> }
sub FILENAME { q<$filename> }
sub FILETYPE { q<$filetype> }
sub URL { q<$url> }
}|;
my $new_url = new Text::Template TYPE => 'STRING', SOURCE => $init_vars.$tmpl
or fatal("can't construct template: $Text::Template::ERROR\n");
return $new_url->fill_in(HASH => {
'FILEPATH' => $filepath,
'FILENAME' => $filename,
'FILETYPE' => $filetype,
'URL' => $url
}) or fatal("$Text::Template::ERROR\n")
} else {
return $url
}
} else {
return $url
}
}
}
#
# url_encode()
#
sub url_encode {
$_[0] =~ s'[$]'%24'go;
$_[0] =~ s'&'%26'go;
$_[0] =~ s';'%3b'go;
$_[0] =~ s'='%3d'go;
$_[0] =~ s'[?]'%3f'go;
}
#
# htmlize()
# -------
# This function translate the given text into HTML
#
sub htmlize {
my $str = shift;
## convert special chars to entities
$str =~ s/&/\&/go;
$str =~ s/\</go;
$str =~ s/>/\>/go;
## add a small syntax highlighting
$str =~ s{(\<[!?/]?)(\w+)(.*?)([!?/]?\>)}
{$1$2 $3 $4 }gs;
$str =~ s{\<!--(.*?)--\>}{\<!--$1 --\>}gs;
$str =~ s{href="([^\"]+)"}{href="$1 "}gs;
return "$str "
}
#
# hextype()
# -------
# This function generates a human readable representation of binary data
#
sub hextype {
my $data = shift; ## data to print
my $colwidth = shift || 16; ## width of ASCII column
my $half = $colwidth/2;
my $line = 1;
my $out = '';
while(length $data) {
my @hex = unpack 'C'x$colwidth, substr($data, 0, $colwidth);
substr($data, 0, $colwidth) = '';
$out .= sprintf '%3d: '. ((('%02x 'x$half).' ')x2) .' ', $line++, @hex;
$out .= sprintf ''.('%s'x$half)x2 . "\n", map { $_ > 32 ? chr : '.' } @hex;
}
return $out
}
#
# simple_wrap()
# -----------
# This function wraps the text given in parameter.
#
sub simple_wrap {
my $orig = ref $_[0] ? $_[0] : \$_[0];
my $text = '';
my $curlen = 0;
my $beg = ' 'x5;
my $cols = 75;
while($$orig =~ m/(\s*\S+\s+)/gm) {
if($curlen + length($1) > $cols) {
$text .= "\n$beg$1";
$curlen = 1 + length($beg) + length($1)
} else {
$text .= $1;
$curlen += length $1;
}
$curlen = 0 if index($1, "\n") >= 0;
}
return $text
}
#
# load_entities()
# -------------
#
sub load_entities {
%entities = (
## Special entities
quot => [ 34, '"'],## double quote
quote => [ 34, '"'],## double quote
amp => [ 38, '&'],## ampersand
apos => [ 39, '''],## single quote
lt => [ 60, '<'],## less than sign
gt => [ 62, '>'],## greater than sign
## Spacing characters
nbsp => [ 32, ' '], ## non-breaking space (real value #160)
ensp => [ 32, ' '], ## en space (real value: #8194, U+2002)
emsp => [ 32, ' '], ## em space (real value: #8195, U+2003)
thinsp => [ 32, ' '], ## thin space (real value: #8201, U+2009)
zwnj => [ 0, '' ], ## zero width non-joiner (real value: #8204, U+200C)
zwj => [ 0, '' ], ## zero width joiner (real value: #8205, U+200D)
## Latin Extended-A entities + Mathematical symbols
sbquo => [130, ','], ## single low-9 quotation mark
fnof => [131, 'f'], ## latin small f with hook = florin
bdquo => [132, ',,'], ## double low-9 quotation mark
hellip => [133, '...'], ## horizontal ellipsis
dagger => [134, ' '], ## dagger
Dagger => [135, ' '], ## double dagger
circ => [136, '^'], ## modifier letter circumflex accent
permil => [137, 'o/oo'], ## per mille sign
Scaron => [138, 'S'], ## latin capital letter S with caron
lsaquo => [139, '<'],## single left-pointing angle quotation mark
OElig => [140, 'OE'], ## latin capital ligature OE
lsquo => [145, "'"], ## left single quotation mark
rsquo => [146, "'"], ## right single quotation mark
ldquo => [147, '"'], ## left double quotation mark
rdquo => [148, '"'], ## right double quotation mark
bull => [149, 'o'], ## bullet
ndash => [150, '-'], ## en dash
mdash => [151, '--'], ## em dash
tilde => [152, '~'], ## small tilde
trade => [153, '(tm)'], ## trademark sign
scaron => [154, 's'], ## latin small letter s with caron
rsaquo => [155, '>'],## single right-pointing angle quotation mark
oelig => [156, 'oe'], ## latin small ligature oe
Yuml => [159, 'Y'], ## latin capital letter Y with diaeresis
## ISO-Latin-1 entities
iexcl => [161, '!'],
cent => [162, '-c-'],
pound => [163, '-L-'],
curren => [164, 'CUR'],
yen => [165, 'YEN'],
brvbar => [166, '|'],
sect => [167, 'S:'],
uml => [168, '"'],
copy => [169, '(c)'],
ordf => [170, '-a'],
laquo => [171, '<<'],
'not' => [172, 'NOT'],
shy => [173, '-'],
reg => [174, '(R)'],
macr => [175, '-'],
deg => [176, 'DEG'],
plusmn => [177, '+/-'],
sup2 => [178, '^2'],
sup3 => [179, '^3'],
acute => [180, "'"],
micro => [181, 'u'],
para => [182, 'P:'],
middot => [183, '.'],
cedil => [184, ','],
sup1 => [185, '^1'],
ordm => [186, '-o'],
raquo => [187, '>>'],
frac14 => [188, ' 1/4'],
frac12 => [189, ' 1/2'],
frac34 => [190, ' 3/4'],
iquest => [191, '?'],
Agrave => [192, 'A'],
Aacute => [193, 'A'],
Acirc => [194, 'A'],
Atilde => [195, 'A'],
Auml => [196, 'Ae'],
Aring => [197, 'A'],
AElig => [198, 'AE'],
Ccedil => [199, 'C'],
Egrave => [200, 'E'],
Eacute => [201, 'E'],
Ecirc => [202, 'E'],
Euml => [203, 'E'],
Igrave => [204, 'I'],
Iacute => [205, 'I'],
Icirc => [206, 'I'],
Iuml => [207, 'I'],
ETH => [208, 'DH'],
Ntilde => [209, 'N'],
Ograve => [210, 'O'],
Oacute => [211, 'O'],
Ocirc => [212, 'O'],
Otilde => [213, 'O'],
Ouml => [214, 'Oe'],
'times' => [215, '*'],
Oslash => [216, 'O'],
Ugrave => [217, 'U'],
Uacute => [218, 'U'],
Ucirc => [219, 'U'],
Uuml => [220, 'Ue'],
Yacute => [221, 'Y'],
THORN => [222, 'P'],
szlig => [223, 'ss'],
agrave => [224, 'a'],
aacute => [225, 'a'],
acirc => [226, 'a'],
atilde => [227, 'a'],
auml => [228, 'ae'],
aring => [229, 'a'],
aelig => [230, 'ae'],
ccedil => [231, 'c'],
egrave => [232, 'e'],
eacute => [233, 'e'],
ecirc => [234, 'e'],
euml => [235, 'e'],
igrave => [236, 'i'],
iacute => [237, 'i'],
icirc => [238, 'i'],
iuml => [239, 'i'],
eth => [240, 'e'],
ntilde => [241, 'n'],
ograve => [242, 'o'],
oacute => [243, 'o'],
ocirc => [244, 'o'],
otilde => [245, 'o'],
ouml => [246, 'o'],
divide => [247, '/'],
oslash => [248, 'o'],
ugrave => [249, 'u'],
uacute => [250, 'u'],
ucirc => [251, 'u'],
uuml => [252, 'u'],
yacute => [253, 'y'],
thorn => [254, 'p'],
yuml => [255, 'y'],
);
}
#
# warning()
# -------
sub warning {
print STDERR 'html2wml: warning: ', @_
}
#
# fatal()
# -----
sub fatal {
print STDERR 'html2wml: fatal: ', @_;
exit -1;
}
#
# debug()
# -----
sub debug {
if($options{'debug'}) {
my $level = ref $_[0] ? shift->[0] : 1;
print STDERR @_ if $level <= $options{'debug'}
}
}
#
# version()
# -------
sub version {
print "$program/$version\n"; exit
}
#
# usage()
# -----
sub usage {
print STDERR <<"USAGE"; exit
usage: $0 [options] file [-o output]
options:
-a, --ascii use 7 bits ASCII emulation to convert named entities
--nocollapse don't collapse spaces and empty paragraphs
--hreftmpl=template set the template for the links reconstruction engine
-i, --ignore-images completly ignore image links
--noimg-alt-text don't replace the images by their alternative text
--nolinearize don't linearize the tables
-n, --numeric-non-ascii convert non-ASCII characters to numeric entities
-p, --nopre don't use the tag
--split-card slice the document by cards (default)
--split-deck slice the document by decks
-s, --max-card-size=size set the card size upper limit
-t, --card-split-threshold=size set the card splitting threshold
--next-card-label=label set the label of the link to the next card
--prev-card-label=label set the label of the link to the previous card
-U, --http-user set the HTTP user
-P, --http-passwd set the HTTP password
-Y, --proxy use proxy settings provided by environnement
--noproxy don't use proxy
-k, --compile compile the result in binary form
-o, --output=outfile select the outpout (stdout if none specified)
-d, --debug=n activate the debug mode (always prints to stdout)
-c, --xmlcheck activate the XML well-formedness and validity check
-h, --help show this help screen and exit
-v, --version show the program name and version and exit
Read the documentation for more information.
USAGE
}
#
# cgi_error()
# ---------
sub cgi_error {
if($options{'debug'}) {
print <<"OUTPUT"; exit
Html2Wml - Error
Html2Wml - Error
This program was called with incorrect parameters or an error occured
when processing the request. Please check your request and try again.
@_
$program v$version
OUTPUT
} else {
print $cgi->header(-type => 'text/vnd.wap.wml'), <<"OUTPUT"; exit
$defaults{wmlvers}
This program was called with incorrect parameters or an error occured
when processing the request. Please check your request and try again.
@_
_____ $program v$version
OUTPUT
}
}
1;
html2wml-0.4.11/html2wml.cgi 0100755 0000765 0000024 00000165641 07746772372 0015523 0 ustar 00maddingue staff #!/usr/bin/perl
# ########################################################################### #
# Html2Wml #
# ======== #
# Author: Sebastien Aperghis-Tramoni #
# #
# This program converts HTML pages to WML pages. #
# See the documentation for more informations. #
# #
# This program is available under the GNU General Public License. #
# #
# You can find the original archive of this program on the author's web site #
# http://www.maddingue.org/softwares/ #
# #
# and on the web site of Html2Wml on SourceForge #
# http://www.html2wml.org/ #
# #
# Copyright (c)2000, 2001, 2002 Sebastien Aperghis-Tramoni #
# ########################################################################### #
use strict;
use CGI;
use File::Basename;
use Getopt::Long;
use HTML::Parser;
use LWP::UserAgent;
use POSIX qw(isatty);
use Text::Template;
use URI;
use URI::URL;
use vars qw($program $version);
$program = 'Html2Wml';
$version = '0.4.11';
# --------------------------------------------------------------------------- #
# Static configuration #
# #
# If you want to hard-code some parameters of Html2Wml, this is the #
# place to edit. Please check the documentation for more information. #
# #
my %defaults = (
## proxy settings
'proxy-server' => '', ## proxy server
## path to the WML compiler
'wmlc' => '/usr/local/bin/wmlc',
## WML version and identifier
'wmlvers' => q||,
## characters encoding
'encoding' => 'iso-8859-1',
## links reconstruction default options
'hreftmpl' => '{FILEPATH}{FILENAME}{$FILETYPE =~ s/s?html?/wml/o; $FILETYPE}',
'srctmpl' => '{FILEPATH}{FILENAME}{$FILETYPE =~ s/gif|png|jpe?g/wbmp/o; $FILETYPE}',
## links reconstruction in CGI mode
'relative-url' => 1, ## use relative self path ?
);
my %options = (
'help' => 0, ## show the usage and exit
'version' => 0, ## show the program name and version and exit
## conversion options
'ascii' => 0, ## convert named entities to US-ASCII
'collapse' => 1, ## collapse white space characters
'compile' => 0, ## compile WML to binary tokenized data
'ignore-images' => 0, ## completly ignore image links
'img-alt-text' => 1, ## replace IMG tags with their ALT attribute
'linearize' => 1, ## suppress the tables tags
'nopre' => 0, ## don't use PRE tag
'numeric-non-ascii' => 0, ## convert non-ASCII characters to numeric entities
## links reconstruction options
'hreftmpl' => $defaults{hreftmpl},
'srctmpl' => $defaults{srctmpl},
## card splitting options
'split-card' => 1, ## slice the document by cards
'split-deck' => 0, ## slice the document by decks
'max-card-size' => 1_400, ## maximum size of data per card
'card-split-threshold' => 50, ## card split threshold
'next-card-label' => '[>>]', ## label of the link to go to the next card
'prev-card-label' => '[<<]', ## label of the link to go to the previous card
## HTTP authentication
'http-user' => '', ## HTTP user
'http-passwd' => '', ## HTTP password
## proxy support
'proxy' => 1, ## turn proxy support on/off
## debugging options
'debug' => undef,## activate the debug mode
'xmlcheck' => 0, ## perform a well-formedness check (using XML::Parser)
);
# You should not edit below this line unless you know what you are doing. #
# --------------------------------------------------------------------------- #
#
# globals
#
sub debug;
sub warning;
sub fatal;
use vars qw($cgi);
$cgi = 0;
my $agent; ## LWP user agent
my $result; ## WML deck in text format
my $binary; ## WML deck in binary format
my $xmlckres = '';
my $complres = '';
my %optname = (
'a' => 'ascii',
'c' => 'xmlcheck',
'd' => 'debug',
'i' => 'ignore-images',
#'h' => 'help', ## shell only
'k' => 'compile',
'n' => 'numeric-non-ascii',
#'o' => 'output', ## shell only
'p' => 'nopre',
'P' => 'http-passwd',
's' => 'max-card-size',
't' => 'card-split-threshold',
'U' => 'http-user',
#'v' => 'version', ## shell only
);
my %optchar = ();
## used by the html parser
use vars qw(%state);
%state = (
doc_uri => '', ## document absolute URI
doc_name => '', ## document file name
self_url => '', ## the CGI's URL for self-referencing
self_srv => '', ## the CGI's server
output => '', ## buffer for storing output
type => 'wml', ## type of the output
decks => {}, ## hash that contains the decks, indexed by their id
skip => 0, ## skip switch (on/off)
stack => [], ## tag stack
cardsize => 0, ## size of the current card/deck
cardid => 'wdf000', ## ID of the current card/deck (stands for "WML Document - Fragment 000")
title => '', ## title of the WML deck
encoding => '', ## encoding of the document
form => { ## hash that contains the current form data
href => '', ## - URL
method => '', ## - method
fields => [], ## - fields list
},
);
my %entities = (); ## named entities conversion table
my %num2ascii = (); ## non-ASCII characters to ASCII equivalent conversion table
#
# The following two hashes are based on the WML DTD. They are the hardcoded
# conversion tables which describe the legal syntax of WML tags.
#
my %dtdent = ();
$dtdent{emph} = 'em,strong,b,i,u,big,small';
$dtdent{layout} = 'br';
$dtdent{text} = $dtdent{emph};
$dtdent{flow} = "$dtdent{text},$dtdent{layout},img,anchor,a,table";
$dtdent{task} = 'go,prev,refresh,noop';
$dtdent{fields} = "$dtdent{flow},input,select,fieldset";
my %with = (
html => { action => 'replace', new_value => 'wml' },
wml => { action => 'keep', nest => 'head,template,card', unique => 1 },
## header tags
head => { action => 'keep', nest => 'meta,access', unique => 1 },
# meta => { action => 'keep', nest => 'EMPTY', attributes => 'http-equiv,name,content,forua,scheme' },
template => { action => 'keep', nest => 'do,onevent', unique => 1 },
title => { action => 'skip' },
base => { action => 'replace', new_value => '' },
style => { action => 'skip' },
script => { action => 'skip' },
## structural tags
body => { action => 'replace', new_value => 'card' },
card => { action => 'keep', nest => 'onevent,timer,do,p,pre', unique => 1 },
h1 => { action => 'replace', new_value => 'p', render => 'big,strong', special => 'nowidow' },
h2 => { action => 'replace', new_value => 'p', render => 'big', special => 'nowidow'},
h3 => { action => 'replace', new_value => 'p', render => 'strong', special => 'nowidow' },
h4 => { action => 'replace', new_value => 'p', special => 'nowidow' },
h5 => { action => 'replace', new_value => 'p', special => 'nowidow' },
h6 => { action => 'replace', new_value => 'p', special => 'nowidow' },
li => { action => 'replace', new_value => 'p' },
dt => { action => 'replace', new_value => 'p' },
dd => { action => 'replace', new_value => 'p' },
div => { action => 'replace', new_value => 'p' },
p => { action => 'keep', nest => "$dtdent{fields},do", attributes => 'align' },
br => { action => 'keep', nest => 'EMPTY' },
pre => { action => 'keep', nest => 'a,br,i,b,em,strong,input,select' },
tt => { action => 'replace', new_value => 'pre' },
## tables tags
table => { action => 'keep', nest => 'tr', attributes => 'title,align' },
caption => { action => 'skip' },
'tr' => { action => 'keep', nest => 'td' },
th => { action => 'replace', new_value => 'td' },
td => { action => 'keep', nest => "$dtdent{emph},$dtdent{layout},img,a,anchor" },
## link tags
a => { action => 'keep', nest => 'br,img', attributes => 'id,name,href,title,accesskey',
attrconv => { name => 'id' } },
anchor => { action => 'keep', nest => 'br,go,img', attributes => 'id,title,accesskey' },
img => { action => 'keep', nest => 'EMPTY', attributes => 'id,src,alt,align' },
frame => { action => 'replace', new_value => 'p' },
area => { action => 'replace', new_value => 'p' },
## style tags
em => { action => 'keep', nest => $dtdent{flow} },
strong => { action => 'keep', nest => $dtdent{flow} },
b => { action => 'keep', nest => $dtdent{flow} },
i => { action => 'keep', nest => $dtdent{flow} },
u => { action => 'keep', nest => $dtdent{flow} },
big => { action => 'keep', nest => $dtdent{flow} },
small => { action => 'keep', nest => $dtdent{flow} },
## events
'do' => { action => 'keep', nest => $dtdent{task}, attributes => 'type,label,name,optional' },
onevent => { action => 'keep', nest => $dtdent{task}, attributes => 'type' },
## tasks
'go' => { action => 'keep', nest => 'postfield,setvar', attributes => 'href,method,enctype,sendreferer,cache-control,accept-charset' },
postfield=> { action => 'keep', nest => 'EMPTY', attributes => 'name,value' },
setvar => { action => 'keep', nest => 'EMPTY', attributes => 'name,value' },
prev => { action => 'keep', nest => 'setvar' },
refresh => { action => 'keep', nest => 'setvar' },
noop => { action => 'keep', nest => 'EMPTY' },
## form tags
form => { action => 'replace', new_value => '' },
'select' => { action => 'keep', nest => 'optgroup,option', attributes => 'title,name,value,multiple' },
optgroup => { action => 'keep', nest => 'optgroup,option', attributes => 'title' },
option => { action => 'keep', nest => 'onevent', attributes => 'title,value' },
input => { action => 'keep', nest => 'EMPTY', attributes => 'name,type,value,title,size,maxlength'},
fieldset => { action => 'keep', nest => "$dtdent{fields},do", attributes => 'title' },
timer => { action => 'keep', nest => 'EMPTY', attributes => 'name,value' },
);
#
# The following hash hardcodes the parent-lookup for each element
# of the WML syntax, i.e. for each element, it gives the prefered
# parent element.
#
my %reverse = (
## head tags
wml => '', head => 'wml', meta => 'head',
access => 'head', template => 'wml', onevent => 'template',
## structural tags
card => 'wml', p => 'card', pre => 'card', br => 'p',
## tables tags
table => 'p', 'tr' => 'table', td => 'tr',
## link tags
a => 'p', anchor => 'p', img => 'p',
## style tags
b => 'p', i => 'p', u => 'p',
strong => 'p', em => 'p',
big => 'p', small => 'p',
## form tags
'select' => 'p', option => 'select', optgroup => 'select',
'do' => 'p', input => 'p', fieldset => 'p',
);
#
# main
#
$| = 1;
my $time = time;
Getopt::Long::Configure(qw(no_auto_abbrev));
fileparse_set_fstype('Unix'); ## this is because I use fileparse() to
## split the URL fragments
## CGI security options
$CGI::POST_MAX = 1024 * 1; # max 1K posts
$CGI::DISABLE_UPLOADS = 1; # no uploads
load_entities();
## create the user agent
$agent = new LWP::UserAgent protocols_forbidden => ['file'];
$agent->agent("[$program/$version ".$agent->agent.']');
for my $opt (keys %optname) {
$optchar{$optname{$opt}} = $opt
}
## constructing %num2ascii using data from %entities
for my $ent (keys %entities) {
$num2ascii{$entities{$ent}[0]} = $entities{$ent}[1]
}
if(@ARGV or isatty(\*STDOUT)) {
## launched from shell
my @opts = (
## usage options
qw(help|h|H version|v|V),
## conversion options
qw(ascii|a! collapse! ignore-images|i img-alt-text!
linearize! nopre|p numeric-non-ascii|n),
## links reconstructions options
qw(hreftmpl=s srctmpl=s),
## card splitting options
qw(split-card split-deck
max-card-size|s=i card-split-threshold|t=i
next-card-label=s prev-card-label=s),
## HTTP authentication
qw(http-user|U=s http-passwd|P=s),
## proxy support
qw(proxy|Y!),
## output options
qw(compile|k output|o=s),
## debugging options
qw(debug|d:i xmlcheck|c!)
);
## getting options
GetOptions(\%options, @opts);
version() if $options{version};
usage() if $options{help};
usage() unless @ARGV;
apply_options();
## converting the file
$result = html2wml(shift);
} else {
## launched from web
$cgi = new CGI;
$agent->agent($cgi->user_agent . ' ' . $agent->agent);
$cgi->compile(qw(param url header));
## get the options
for my $param ($cgi->param) {
my $option = length($param) == 1 ? $optname{$param} : $param;
next unless exists $options{$option};
$options{$option} = $cgi->param($param)
}
apply_options();
$state{doc_name} = 'output';
## creating static part of the self url
my $cgi_options = '';
for my $param ($cgi->param) {
next if 'url,id' =~ /\b$param\b/;
my $value = $cgi->param($param);
next unless $value;
next unless exists $options{$param} or exists $options{$optname{$param}};
my $opt = exists $optchar{$param} ? $optchar{$param} : $param;
$cgi_options .= "$param=$value;"
}
$state{self_url} = $cgi->url(-relative => $defaults{'relative-url'}) . "?$cgi_options";
## send debug header if needed
print $cgi->header if $options{'debug'};
## execute main part
$result = html2wml($cgi->param('url') || '/');
}
## special case: splitting by decks
if($cgi and $options{'split-deck'}) {
## return the desired deck (as specified by the id parameter)
## or the first deck if none has been specified
$result = $state{decks}{ $cgi->param('id') || (sort keys %{$state{decks}})[0] }
}
## XML check
if($options{xmlcheck}) {
## for development only :-)
## $result =~ s{"http://www.wapforum.org/DTD/wml12.dtd"} #XXX#
## {"/Users/madingue/Documents/Softwares/Html2Wml/devel/t/wml12.dtd"}; #XXX#
## ---
eval {
require XML::Parser;
my $parser = new XML::Parser Style => 'Tree', ErrorContext => 2;
$parser->parse($result);
};
$@ =~ /Can't locate/ and $@ = '(XML::Parser not available)';
$xmlckres = $@ ? "\nExpat errors\n$@" : "Expat: well-formed";
eval {
require XML::LibXML;
my $parser = new XML::LibXML;
$parser->validation(1);
$parser->expand_entities(1);
$parser->load_ext_dtd(1);
my $dom = $parser->parse_string($result);
die "document isn't valid" unless $dom->is_valid;
};
$@ =~ /Can't locate/ and $@ = ', (XML::LibXML not available)';
$xmlckres .= $@ ? "\nGnome-XML errors\n$@" : ", Gnome-XML: valid";
eval {
require XML::Checker::Parser;
XML::Checker::Parser::map_uri(
"-//WAPFORUM//DTD WML 1.2//EN" => "file:///Users/maddingue/Documents/Softwares/Html2Wml/devel/t/wml13.dtd"
);
my $parser = new XML::Checker::Parser Style => 'Tree',
ErrorContext => 2, ParseParamEnt => 1, #NoLWP => 1,
SkipExternalDTD => 1, KeepCDATA => 1;
$parser->parse($result);
};
$@ =~ /Can't locate/ and $@ = ', (XML::Checker not available)';
$xmlckres .= $@ ? "\nXML-Checker errors\n$@" : ", XML-Checker: valid";
}
## WML tokenization
if($options{compile}) {
$binary = '';
my $buf;
eval {
require IPC::Open2;
require IO::Handle;
my $in = new IO::Handle;
my $out = new IO::Handle;
my $pid = IPC::Open2::open2($out, $in, $defaults{wmlc}, '-', '-');
print $in $result;
close($in);
$binary = join '', <$out>;
close($out);
waitpid($pid, 0);
};
$complres = $@
}
if($options{'debug'}) { ## debug output
$time = time - $time;
my @times = times;
$times[0] += $times[2]; ## total user time
$times[1] += $times[3]; ## total system time
$times[2] = $times[0] + $times[1]; ## total time
my $i = 1;
$result .= "\n";
$result =~ s/^/@{[sprintf '%3d', $i++]}: /gm; ## add lines number
$result = simple_wrap($result);
if($cgi) {
print qq|\n\n$program -- Debug Mode \n|,
qq|\n|,
qq|\n\n$program -- Debug Mode \n|,
qq|This is the result of the conversion of the document |,
qq|$state{doc_uri} by $program v$version.
\n|,
qq| \n|,
htmlize($result),
qq| \nResult of XML check:
\n|,
htmlize($xmlckres);
print qq| \nResult of WML compilation:
\n|,
($complres ? "$complres\n" : hextype($binary)), " \n"
if $options{compile};
printf " \nTime: $time wallclock secs (%.2f usr + %.2f sys = %.2f cpu)
\n",
@times[0..2];
print qq|\n\n|
} else {
my $s = "$program -- Debug Mode\n";
print $s, '-'x length($s), "\n",
$result, "\n", ' -'x5, "\n",
$xmlckres, "\n";
print ' -'x5, "\nCompiled WML\n", ' -'x5, "\n",
($complres ? "$complres\n" : hextype($binary))
if $options{compile};
print ' -'x5, "\n";
printf "Time: $time wallclock secs (%.2f usr + %.2f sys = %.2f cpu)\n", @times[0..2];
}
} else { ## normal output
my $out = \*STDOUT;
if($options{'output'}) {
open(OUT, ">$options{output}") or fatal "cannot write to '$options{output}': $!\n";
$out = \*OUT;
}
if($options{'compile'}) {
print $out $cgi->header(
-type => 'application/vnd.wap.wmlc',
-content_length => length $result
) if $cgi;
print $out $binary;
} else {
print $out $cgi->header(
-type => "text/vnd.wap.wml; charset=$state{encoding}",
-content_length => length $result
) if $cgi;
print $out $result;
}
}
#
# apply_options()
# -------------
sub apply_options {
if($options{'linearize'}) {
delete @with{qw(table tr td th)};
$with{'caption'} = { action => 'replace', new_value => 'p', render => 'b' };
$with{'tr'} = { action => 'replace', new_value => 'p' };
delete @reverse{qw(table tr td)};
}
if($options{'ignore-images'}) {
delete $with{'img'};
}
if(not defined $options{'debug'}) {
$options{'debug'} = 0;
} elsif($options{'debug'} == 0) {
$options{'debug'} = 1;
}
if($options{'debug'}) {
$options{'xmlcheck'} = 1;
}
if($options{'nopre'}) {
delete $with{'pre'};
$with{'pre'} = { action => 'replace', new_value => 'p' };
}
if($cgi) {
$options{'split-card'} = 0;
$options{'split-deck'} = 1;
## security: don't allow to modify the templates
## when called as a CGI
$options{'hreftmpl'} = $defaults{'hreftmpl'};
$options{'srctmpl'} = $defaults{'srctmpl'};
}
## security: check if the templates contains suspicious code
## if the templates have changed
if($options{hreftmpl} ne $defaults{hreftmpl} or $options{srctmpl} ne $defaults{srctmpl}) {
my $forbidden = join '|', '[``]', map {"\\b$_\\b"}
qw(eval exec system unlink kill fork open sysopen pipe socket);
$options{hreftmpl} = $defaults{hreftmpl} if $options{hreftmpl} =~ /$forbidden/;
$options{srctmpl} = $defaults{hreftmpl} if $options{srctmpl} =~ /$forbidden/;
}
$options{'cardsize-limit'} = $options{'max-card-size'} - $options{'card-split-threshold'};
if($^O eq 'MacOS') {
$options{'compile'} = 0;
}
if($options{'compile'}) {
$options{'prev-card-label'} = '[<<]';
}
## proxy support
if($options{'proxy'}) {
if($defaults{'proxy-server'}) {
## use hardcoded settings
$agent->proxy([qw(http ftp gopher)] => $defaults{'proxy-server'});
} else {
## load from environment
$agent->env_proxy();
}
}
}
#
# html2wml()
# --------
sub html2wml {
my $url = shift;
my $file = '';
my $type = '';
my $enc = '';
my $converter = new HTML::Parser api_version => 3;
return unless $url;
## read the file
if($url =~ m{https?://}) { ## absolute uri
($file,$type,$enc) = get_url($url)
} elsif(not $cgi) { ## local file
$file = read_file($url)
} else { ## absolute url relative to the server
($file,$type,$enc) = get_url( $url = URI::URL->new($url, $cgi->url)->abs )
}
$enc ||= '';
$enc =~ s/charset=//i;
url_encode($url);
$state{doc_uri} = $url;
($state{self_srv}) = ($state{self_url} =~ m|^(https?://[\w.-]+(?::\d+)?)/|);
## strip the DOCTYPE
$file =~ s/]+>//go;
## try to get the document charset encoding
if(not $enc and $file =~ m|meta +http-equiv.+charset=["']?([a-zA-Z0-9_-]+)['"]?|i) {
$enc = lc $1
}
$state{encoding} = $enc || $defaults{'encoding'};
$type ||= '';
## if it's an image, call send_image()
if(index($type, 'image') >= 0 or $url =~ /\.(?:gif|jpg|png)$/i) {
@_ = ($file, $url);
goto &send_image
}
## get the document title
if($file =~ m|([^<]+) |i) {
$state{title} = $1;
convert_entities($state{title});
clean_spaces($state{title});
}
## WML header
$state{skip} = 0;
$state{output} = join '', q|\n$defaults{wmlvers}\n|;
## affectation of the HTML::Parser handlers
$converter->unbroken_text(1);
$converter->handler(start => \&start_tag, 'tagname, attr');
$converter->handler(end => \&end_tag, 'tagname');
$converter->handler(text => \&text_tag, 'text, is_cdata');
$converter->handler(comment => \&comment_tag, 'tokens');
#$converter->handler(declaration => \&default_handler, 'text');
#$converter->handler(process => \&default_handler, 'text');
#$converter->handler(default => \&default_handler, 'text');
## begin the conversion
$converter->parse($file);
$converter->eof;
## flush the stack
while(my $tag = pop @{$state{stack}}) {
$state{output} .= "$tag>"
}
post_conversion_cleanup();
$state{decks}{$state{cardid}} = $state{output};
return $state{output}
}
#
# post_conversion_cleanup()
# -----------------------
#
sub post_conversion_cleanup {
## convert alone ampersand characters to entities
$state{output} =~ s/\&\s/\& /go;
## correct unclosed numeric entities
$state{output} =~ s/(\\d+)([^\d;])/$1;$2/go;
## convert the named HTML entities to numeric entities
convert_entities($state{output});
## convert non-ASCII characters to numeric entities
if($options{'numeric-non-ascii'}) {
$state{output} =~ s/([\x80-\xFF])/''.ord($1).';'/eg;
}
## escape $ chars
$state{output} =~ s/\$([^(])/\$\$$1/g;
collapse($state{output}) if $options{'collapse'};
## set the title of the card
if(length $state{title}) {
my $title = $state{title};
$title =~ s/"/\"/go;
$title =~ s/\$/\$\$/go;
$title =~ s/(\\d+)([^\d;])/$1;$2/go;
$state{output} =~ s/|>|go; ## collapse spaces inside tags
$_[0] =~ s|\s+/>|/>|go; ## collapse spaces inside empty tags
$_[0] =~ s|<(\w+) +|<$1 |g; ## collapse spaces between tag and attributes
$_[0] =~ s|\s+|
|go; ## collapse spaces at the begining of a paragraph
$_[0] =~ s|\s+
||go; ## collapse spaces at the end of a paragraph
## collapse empty paragraphs
$_[0] =~ s|]*>\s*
||go;
$_[0] =~ s|]*>\s*(?: )+\s*
||go;
$_[0] =~ s|]*>\s*(?:\ \s*)+
||go;
$_[0] =~ s|]*>\s*(?:\ \s*)+
||go;
$_[0] =~ s|]*>\s*(?:\[IMG\]\s*)+
||go;
$_[0] =~ s|<(\w+)>\s*\1>||go;
## collapse multiple lines
$_[0] =~ s/\n+/\n/go;
$_[0] =~ s/(?: +\n)+/\n/go;
}
#
# get_url()
# -------
# This function gets and returns the file from the given URI.
# If called in a array context, returns the file content and the associated
# MIME type (as given by the server).
#
sub get_url {
my $uri = shift;
my $quiet = shift || 0;
if($cgi and index($uri, 'file:') == 0) {
cgi_error(q|For security reasons, the file: scheme is not allowed.|)
}
my $request = new HTTP::Request GET => $uri;
my $response = $agent->request($request);
if($response->is_error) {
if($response->status_line == 401) {
## Authorization required
my($realm) = ($response->header('WWW-Authenticate') =~ /realm=(.+)/);
my $self = "$state{self_url}url=$state{doc_uri}";
if($options{'http-user'} and $options{'http-passwd'}) {
$request->www_authenticate($response->header('WWW-Authenticate'));
$request->authorization_basic($options{'http-user'}, $options{'http-passwd'});
$response = $agent->request($request);
} else {
if($cgi) {
print $cgi->header(-type => 'text/vnd.wap.wml'), <<"PASSFORM"; exit
$defaults{wmlvers}
Please enter your user name and password for $realm.
User:
Password:
PASSFORM
} else {
fatal <<"PASSASK"
website requires authentication
The web site requires you to authenticate in order to process your request.
Please enter your user name and password for $realm.
Use the --http-user and --http-passwd options (or their short counterparts
-U and -P). Check the documentation for more information.
PASSASK
}
}
} else {
my $err = <<"ERR";
The following error occured while trying to access the following URL
-- $uri --
Error @{[ $response->status_line ]}
ERR
if($cgi) {
if($quiet) {
warning "can't fetch file:\n", $err;
return '';
} else {
cgi_error($err)
}
} else {
fatal "fetch error\n\n", $err
}
}
}
return wantarray ? ($response->content, $response->content_type,
$response->content_encoding) : $response->content
}
#
# read_file()
# ---------
# This function reads and returns the file from the local disk.
#
sub read_file {
my $filepath = shift;
my $quiet = shift || 0;
my $dir = dirname($filepath);
my $file = basename($filepath);
chdir($dir) if $dir;
open(FILE, $file) or my $failed = 1;
if($failed) {
if($quiet) {
warning("can't read file '$file': $!\n") and return ''
} else {
fatal("can't read file '$file': $!\n")
}
}
local $/ = undef;
$file = ;
close(FILE);
return $file
}
#
# send_image()
# ----------
# This function allow Html2Wml to send WBMP images to the client.
# Currently, it send an empty hardcoded image, but support for
# conversion from common formats (GIF, JPEG, PNG) will be added soon.
#
sub send_image {
my $data = shift;
my $path = shift;
my $pixel = pack 'C*', 0, 0, 1, 1, 0xFF; ## this is one white pixel
## TODO: add the code to allow conversion using an external program
print $cgi->header(-type => 'image/wbmp', -content_length => length $pixel), $pixel;
exit
}
#
# convert_entities()
# ----------------
# This function converts the named HTML entities into numeric entities.
#
sub convert_entities {
my $ascii = $options{ascii};
## try to correct unclosed named entities
$_[0] =~ s/(&\w{2,6})\b([^;])/$1;$2/go;
## convert numeric entities and non-ASCII characters
## to ASCII equivalent if requested
if($ascii) {
$_[0] =~ s/(\d+);/$num2ascii{$1}/g;
$_[0] =~ s/([\x80-\xFF])/$num2ascii{ord($1)}/g;
}
my $code = q| while($_[0] =~ /&(\w+);/g) { |
. q| my $ent = $1; |
. q| if(exists $entities{$ent}) { |
.($ascii ? q| my $chr = $entities{$ent}[1]; |
: q| my $chr = ''.$entities{$ent}[0].';'; | )
. q| $_[0] =~ s/&$ent;/$chr/g |
. q| } |
. q| } |;
eval $code;
if($_[0] =~ /&(\w{2,6});?/) {
## there are some residual unknown or incorrect named entities
while($_[0] =~ /&(\w{2,6});?/g) {
my $ent = $1;
## check if $ent is a known entity
if(exists $entities{$ent}) {
warning "unclosed entity: $ent, corrected\n";
my $chr = $ascii ? $entities{$ent}[1] : ''.$entities{$ent}[0].';';
$_[0] =~ s//$chr/;
next
}
my($e1,$e2) = ('','');
## split the entity in two parts and check if the first part
## is a valid entity name
## entities names are between 2 and 6 characters long, so this
## loop won't be executed more than 4 times
for my $i (2..length($ent)) {
$e1 = substr($ent, 0, $i);
$e2 = substr($ent, $i);
last if exists $entities{$e1}
}
if(exists $entities{$e1}) {
warning "unknown entity: $ent, replaced with $e1\n";
my $chr = $ascii ? $entities{$e1}[1] : ''.$entities{$e1}[0].';';
$_[0] =~ s//$chr$e2/;
} else {
warning "unknown entity: $ent\n";
$_[0] =~ s//&$ent/;
}
}
}
## escape the remaining ampersands
$_[0] =~ s/&(\w+[^;])/&$1/g;
}
#
# clean_spaces()
# ------------
sub clean_spaces {
$_[0] =~ s/\t+/ /go;
$_[0] =~ s/^\s+/ /go;
$_[0] =~ s/ +/ /go;
}
#
# HTML::Parser start tag handler
#
sub start_tag {
my($tag, $attr) = @_;
local $_;
return unless exists $with{$tag};
return if $state{skip};
my $curr_tag = ($with{$tag}{action} eq 'replace' ? $with{$tag}{new_value} : $tag);
my $prev_tag = scalar @{$state{stack}} ? ${$state{stack}}[-1] : 0;
## prevent incorrect auto-nesting
return if $curr_tag eq $prev_tag and $with{$curr_tag}{unique} and $with{$curr_tag}{nest} !~ /\b$curr_tag\b/;
## special case: replacing image with its alternative text when necessary
if($curr_tag eq 'img' and $options{'img-alt-text'}) {
my $alt = $attr->{alt} || $attr->{title} || $attr->{id} || $attr->{name} || '[IMG]';
text_tag($alt) and return
}
## special case: tag
if($tag eq 'frame') {
if($prev_tag eq 'p') { pop @{$state{stack}}; $state{output} .= '' }
if($prev_tag eq 'wml') { push @{$state{stack}}, 'card'; $state{output} .= '' }
my $link = xlate_url($$attr{src}, 'href');
$state{output} .= qq|Frame: $$attr{name}
|;
}
## special case: image map tag
if($tag eq 'area') {
if($prev_tag eq 'p') { pop @{$state{stack}}; $state{output} .= '' }
if($prev_tag eq 'wml') { push @{$state{stack}}, 'card'; $state{output} .= '' }
my $link = xlate_url($$attr{href}, 'href');
$state{output} .= qq|Image map: $$attr{href}
|;
}
## special case: when inside a don't allow opening tags
if($prev_tag eq 'a' and $with{a}{nest} !~ /\b$curr_tag\b/) {
return
}
## special case: is replaced by
if($curr_tag eq 'a' and not exists $attr->{href}) {
$curr_tag = $tag = 'anchor';
$attr->{id} = exists $attr->{id} ? $attr->{id} : $attr->{name};
delete $attr->{name};
}
## special case: element that defines a base URL
if($tag eq 'base') {
$state{doc_uri} = URI::URL->new($attr->{href}, $state{doc_uri})->abs;
}
## special case: form declaration
if($tag eq 'form') {
$state{form}{href} = xlate_url($attr->{action});
$state{form}{method} = lc($attr->{method}) || 'get';
$state{form}{enctype} = $attr->{enctype} || '';
return;
}
## special case: form input
if($curr_tag eq 'input' and $attr->{type} ne 'submit') {
push @{$state{form}{fields}}, $attr->{name};
## special case: hidden form input
if($attr->{type} eq 'hidden') {
$state{output} .= q| |;
return;
}
}
## special case: form submission
if($curr_tag eq 'input' and $attr->{type} eq 'submit') {
my $method = $state{form}{method};
$state{output} .= '';
if($method eq 'post') {
$state{output} .= qq||;
for my $field (@{$state{form}{fields}}) {
$state{output} .= qq| |;
}
$state{output} .= ' ';
} else {
warning("unknown method '$method'; defaulting to 'get'") if $method ne 'get';
my @query = ();
for my $field (@{$state{form}{fields}}) {
push @query, qq|$field=\$($field)|;
}
$state{output} .= join '', qq| ';
}
$state{output} .= ' ';
return;
}
## reconstruct well-formed attributes list with only the allowed ones
if(exists $with{$curr_tag}{attributes} and scalar keys %$attr) {
my @attrs = ();
for my $param (keys %$attr) {
if(index($with{$curr_tag}{attributes}, $param) >= 0) {
$param = $with{$tag}{attrconv}{$param} if exists $with{$tag}{attrconv}{$param};
my $value = $attr->{$param};
if($param eq 'align' or $param eq 'type') {
$value = lc $value
} elsif($param eq 'href' or $param eq 'src') {
$value = xlate_url($value, $param);
}
convert_entities($value);
push @attrs, qq|$param="$value"|;
}
}
$attr = join ' ', '', @attrs;
} else {
$attr = ''
}
## set the skip mode state
$state{skip} = 1 if $with{$curr_tag}{action} eq 'skip';
#debug [2], "\n(start tag) <$tag> => action: ", ($with{$tag}{action} ? $with{$tag}{action} : 'clear'), ($curr_tag ne $tag ? " with $curr_tag " : ''), ($attr? ", attributes:$attr" : ''), "\n";
if($with{$curr_tag}{action} eq 'keep') {
# TODO: this part of the syntax repairing engine will have to be
# re-written. Maybe a loop on the stack to check whether the
# tree is correct, and in case not, insert the missing ones
if(scalar @{$state{stack}}) {
#debug [2], " -> syntax repair: closing tags ";
## syntax repair: close the tags that were left opened
while($prev_tag = pop @{$state{stack}}) {
if($with{$prev_tag}{nest} =~ /\b$curr_tag\b/
or $with{$prev_tag}{nest} =~ /\b$reverse{$curr_tag}\b/) {
push @{$state{stack}}, $prev_tag;
last
}
#debug [2], "$prev_tag> ";
$state{output} .= "$prev_tag>";
}
#debug [2], "\n";
}
## syntax repair: open the tags that should have been opened
if($with{$prev_tag}{nest} !~ /\b$curr_tag\b/) {
#debug [2], " -> syntax repair: opening tags ";
my($inner_tag,$outter_tag) = ($curr_tag,$prev_tag);
my @nesting_tags = ();
while($reverse{$inner_tag} and $reverse{$inner_tag} ne $outter_tag) {
$inner_tag = $reverse{$inner_tag};
#debug [2], "<$inner_tag> ";
unshift @nesting_tags, $inner_tag;
}
push @{$state{stack}}, @nesting_tags;
for my $t (@nesting_tags) { $state{output} .= "<$t>" }
#debug [2], "\n";
#debug [2], " new stack: (@{$state{stack}})\n";
}
}
## clean up a little
collapse($state{output}) if $options{'collapse'};
## split the card if needed
if($state{cardsize} > $options{'cardsize-limit'}
and exists $with{$tag}{special} and $with{$tag}{special} =~ /nowidow/) {
split_card()
}
## simple tag translation
if($with{$curr_tag}{action} eq 'keep') {
if($with{$curr_tag}{nest} eq 'EMPTY') {
$state{cardsize} += length($curr_tag) + length($attr);
$state{output} .= "<$curr_tag$attr/>"
} else {
$state{cardsize} += length($curr_tag) + length($attr);
$state{output} .= "<$curr_tag$attr>";
push @{$state{stack}}, $curr_tag;
}
} else {
## do nothing
}
## additional rendering effects
if(defined $with{$tag}{render}) { ## note that it's $tag, not $curr_tag
for my $t (split ',', $with{$tag}{render}) {
$state{cardsize} += length $t;
$state{output} .= "<$t>"
}
}
}
#
# HTML::Parser end tag handler
#
sub end_tag {
my($tag) = @_;
return unless exists $with{$tag};
my $curr_tag = ($with{$tag}{action} eq 'replace' ? $with{$tag}{new_value} : $tag);
## special case: anchors
if($tag eq 'a' and ${$state{stack}}[-1] eq 'anchor') { $curr_tag = $tag = 'anchor'}
## special case: form
if($tag eq 'form') { return }
#debug [2], "( end tag ) $curr_tag> stack = (@{$state{stack}})\n\n";
$state{skip} = 0 if $with{$tag}{action} eq 'skip';
return if $state{skip};
return if exists $with{$tag}{nest} and $with{$tag}{nest} eq 'EMPTY';
## additional rendering effects
if(defined $with{$tag}{render}) { ## note that it's $tag, not $curr_tag
for my $t (reverse split ',', $with{$tag}{render}) {
$state{cardsize} += length $t;
$state{output} .= "$t>"
}
}
## special case: /card cleans up the stack
if($curr_tag eq 'card') {
while(${$state{stack}}[-1] ne $curr_tag) {
my $t = pop @{$state{stack}};
$state{cardsize} += length $t;
$state{output} .= "$t>";
}
my $s = qq|\n
|;
$state{cardsize} += length $s ;#- 25;
$state{output} .= $s;
}
## closing element
if(${$state{stack}}[-1] eq $curr_tag and $with{$curr_tag}{action} eq 'keep') {
$state{cardsize} += length $curr_tag;
$state{output} .= "$curr_tag> ";
pop @{$state{stack}};
} else {
## do nothing
}
## clean up a little
collapse($state{output}) if $options{'collapse'};
## check current card size
if($curr_tag ne 'card' and $curr_tag ne 'wml' and $state{cardsize} > $options{'cardsize-limit'}) {
split_card()
}
}
#
# HTML::Parser text handler
#
sub text_tag {
my($text) = @_;
my $curr_tag = ${$state{stack}}[-1] || '';
#debug [3], "(text node) stack = (@{$state{stack}})\n- - - - -\n$text\n- - - - -\n";
return if $state{skip};
return if $text =~ /^\s*$/s; ## skip empty lines
## add a para tag if we're on the card node
if($curr_tag eq 'card') {
$state{cardsize} += 4;
$state{output} .= "\n";
push @{$state{stack}}, 'p';
}
clean_spaces($text) if $options{'collapse'} and $curr_tag ne 'pre';
#
# TODO: add the code that split too long chunks of text
#
$state{output} .= $text;
$state{cardsize} += length $text;
}
#
# HTML::Parser comment tag handler
#
sub comment_tag {
my($comment) = @_;
local $_;
$comment = join '', @$comment;
#debug [3], "( comment ) stack = (@{$state{stack}})\n $comment\n";
## Actions engine
if($comment =~ /^\s*\[(\w+)\s*(.*)\]\s*$/) {
my $action = $1;
my %attributes = map { /\G(\w+)=["']([^"']+)["']/g } split /\s+/, $2;
for my $attr (keys %attributes) {
if($attr eq 'for') {
return if $attributes{$attr} ne $state{type};
}
if($attr eq 'virtual' and $attributes{virtual} !~ /^http:/) {
$attributes{virtual} = URI::URL->new( $attributes{virtual}, $state{doc_uri} )->abs
}
}
for($action) {
/include/ and do {
my $buf;
if($attributes{virtual}) {
$buf = get_url($attributes{virtual}, 1);
} elsif($attributes{file}) {
$buf = read_file($attributes{file}, 1)
}
$state{output} .= $buf;
$state{cardsize} = length $buf;
};
/skip/ and do {
$state{skip} = 1;
};
/end_skip/ and do {
$state{skip} = 0;
};
/fsize/ and do {
my $buf;
if($attributes{virtual}) {
$buf = length get_url($attributes{virtual}, 1);
} elsif($attributes{file}) {
$buf = length read_file($attributes{file}, 1)
}
$state{output} .= $buf;
$state{cardsize} = length $buf;
};
}
}
}
#
# HTML::Parser default handler
#
sub default_handler {
my($text) = @_;
#debug [2], "( default ) [$text]\n\n";
}
#
# split_card()
# ----------
# This function closes the current card and creates a new one.
#
sub split_card {
my @stack = @{$state{stack}};
shift @stack; ## shift the tag
shift @stack; ## shift the tag
my $id = $state{cardid}++;
$state{cardsize} = 0;
#debug [2], "(splitcard) stack = (@{$state{stack}})\n\n";
for my $tag (reverse @stack) { $state{output} .= "$tag>" }
my $doc_uri = $state{doc_uri};
# strip the server part if the document and this CGI are on the same server
$doc_uri =~ s/^$state{self_srv}//o if $cgi;
my $link_to_next = $options{'split-deck'} ?
"$state{self_url}url=$doc_uri;id=$state{cardid}" : "#$state{cardid}";
$state{output} .= join '', qq|\n|,
qq| |,
qq| |,
qq|
\n \n|;
if($options{'split-deck'}) {
post_conversion_cleanup();
$state{output} .= ' ';
$state{decks}{$id} = $state{output};
$state{output} = join '', q|\n$defaults{wmlvers}\n|;
}
$state{output} .= qq|\n|;
for my $tag (@stack) { $state{output} .= "<$tag>" }
}
#
# xlate_url()
# ---------
# This function translates the given url so that the pointed document will
# pass through this CGI for conversion when in CGI mode, or construct a url
# that fits the needs of the webmaster using the given template, if present.
#
sub xlate_url {
my $url = shift; ## $url is the url from a href or a src attribute
my $type = shift; ## $type is 'src' or 'href'
## URL encode special characters
url_encode($url);
## we only treat http URLs
return $url if $url =~ /^(\w+):/ and index($1, 'http') != 0;
if($cgi) {
## CGI mode
# create the absolute URL relative to the document
my $link = URI::URL->new($url, $state{doc_uri})->abs;
# strip the server part if the URL and this CGI are on the same server
$link =~ s/^$state{self_srv}//o;
return "$state{self_url}url=$link"
} else {
## shell mode
## we don't touch URLs other than http(s):
return $url if $url =~ m|^(\w+):| and index($1, 'http') < 0;
## This is where the link reconstruction engine lives... (waah... :)
if($options{"${type}tmpl"} and $url !~ m|^https?://|) {
## we don't touch absolute urls
my $tmpl = $options{"${type}tmpl"};
my $uri = new URI $url, 'http';
if($uri->path) {
my($filename,$filepath,$filetype) = fileparse($uri->path, '((?:\.\w+)+)');
my $init_vars = qq|{
sub FILEPATH { q<$filepath> }
sub FILENAME { q<$filename> }
sub FILETYPE { q<$filetype> }
sub URL { q<$url> }
}|;
my $new_url = new Text::Template TYPE => 'STRING', SOURCE => $init_vars.$tmpl
or fatal("can't construct template: $Text::Template::ERROR\n");
return $new_url->fill_in(HASH => {
'FILEPATH' => $filepath,
'FILENAME' => $filename,
'FILETYPE' => $filetype,
'URL' => $url
}) or fatal("$Text::Template::ERROR\n")
} else {
return $url
}
} else {
return $url
}
}
}
#
# url_encode()
#
sub url_encode {
$_[0] =~ s'[$]'%24'go;
$_[0] =~ s'&'%26'go;
$_[0] =~ s';'%3b'go;
$_[0] =~ s'='%3d'go;
$_[0] =~ s'[?]'%3f'go;
}
#
# htmlize()
# -------
# This function translate the given text into HTML
#
sub htmlize {
my $str = shift;
## convert special chars to entities
$str =~ s/&/\&/go;
$str =~ s/\</go;
$str =~ s/>/\>/go;
## add a small syntax highlighting
$str =~ s{(\<[!?/]?)(\w+)(.*?)([!?/]?\>)}
{$1$2 $3 $4 }gs;
$str =~ s{\<!--(.*?)--\>}{\<!--$1 --\>}gs;
$str =~ s{href="([^\"]+)"}{href="$1 "}gs;
return "$str "
}
#
# hextype()
# -------
# This function generates a human readable representation of binary data
#
sub hextype {
my $data = shift; ## data to print
my $colwidth = shift || 16; ## width of ASCII column
my $half = $colwidth/2;
my $line = 1;
my $out = '';
while(length $data) {
my @hex = unpack 'C'x$colwidth, substr($data, 0, $colwidth);
substr($data, 0, $colwidth) = '';
$out .= sprintf '%3d: '. ((('%02x 'x$half).' ')x2) .' ', $line++, @hex;
$out .= sprintf ''.('%s'x$half)x2 . "\n", map { $_ > 32 ? chr : '.' } @hex;
}
return $out
}
#
# simple_wrap()
# -----------
# This function wraps the text given in parameter.
#
sub simple_wrap {
my $orig = ref $_[0] ? $_[0] : \$_[0];
my $text = '';
my $curlen = 0;
my $beg = ' 'x5;
my $cols = 75;
while($$orig =~ m/(\s*\S+\s+)/gm) {
if($curlen + length($1) > $cols) {
$text .= "\n$beg$1";
$curlen = 1 + length($beg) + length($1)
} else {
$text .= $1;
$curlen += length $1;
}
$curlen = 0 if index($1, "\n") >= 0;
}
return $text
}
#
# load_entities()
# -------------
#
sub load_entities {
%entities = (
## Special entities
quot => [ 34, '"'],## double quote
quote => [ 34, '"'],## double quote
amp => [ 38, '&'],## ampersand
apos => [ 39, '''],## single quote
lt => [ 60, '<'],## less than sign
gt => [ 62, '>'],## greater than sign
## Spacing characters
nbsp => [ 32, ' '], ## non-breaking space (real value #160)
ensp => [ 32, ' '], ## en space (real value: #8194, U+2002)
emsp => [ 32, ' '], ## em space (real value: #8195, U+2003)
thinsp => [ 32, ' '], ## thin space (real value: #8201, U+2009)
zwnj => [ 0, '' ], ## zero width non-joiner (real value: #8204, U+200C)
zwj => [ 0, '' ], ## zero width joiner (real value: #8205, U+200D)
## Latin Extended-A entities + Mathematical symbols
sbquo => [130, ','], ## single low-9 quotation mark
fnof => [131, 'f'], ## latin small f with hook = florin
bdquo => [132, ',,'], ## double low-9 quotation mark
hellip => [133, '...'], ## horizontal ellipsis
dagger => [134, ' '], ## dagger
Dagger => [135, ' '], ## double dagger
circ => [136, '^'], ## modifier letter circumflex accent
permil => [137, 'o/oo'], ## per mille sign
Scaron => [138, 'S'], ## latin capital letter S with caron
lsaquo => [139, '<'],## single left-pointing angle quotation mark
OElig => [140, 'OE'], ## latin capital ligature OE
lsquo => [145, "'"], ## left single quotation mark
rsquo => [146, "'"], ## right single quotation mark
ldquo => [147, '"'], ## left double quotation mark
rdquo => [148, '"'], ## right double quotation mark
bull => [149, 'o'], ## bullet
ndash => [150, '-'], ## en dash
mdash => [151, '--'], ## em dash
tilde => [152, '~'], ## small tilde
trade => [153, '(tm)'], ## trademark sign
scaron => [154, 's'], ## latin small letter s with caron
rsaquo => [155, '>'],## single right-pointing angle quotation mark
oelig => [156, 'oe'], ## latin small ligature oe
Yuml => [159, 'Y'], ## latin capital letter Y with diaeresis
## ISO-Latin-1 entities
iexcl => [161, '!'],
cent => [162, '-c-'],
pound => [163, '-L-'],
curren => [164, 'CUR'],
yen => [165, 'YEN'],
brvbar => [166, '|'],
sect => [167, 'S:'],
uml => [168, '"'],
copy => [169, '(c)'],
ordf => [170, '-a'],
laquo => [171, '<<'],
'not' => [172, 'NOT'],
shy => [173, '-'],
reg => [174, '(R)'],
macr => [175, '-'],
deg => [176, 'DEG'],
plusmn => [177, '+/-'],
sup2 => [178, '^2'],
sup3 => [179, '^3'],
acute => [180, "'"],
micro => [181, 'u'],
para => [182, 'P:'],
middot => [183, '.'],
cedil => [184, ','],
sup1 => [185, '^1'],
ordm => [186, '-o'],
raquo => [187, '>>'],
frac14 => [188, ' 1/4'],
frac12 => [189, ' 1/2'],
frac34 => [190, ' 3/4'],
iquest => [191, '?'],
Agrave => [192, 'A'],
Aacute => [193, 'A'],
Acirc => [194, 'A'],
Atilde => [195, 'A'],
Auml => [196, 'Ae'],
Aring => [197, 'A'],
AElig => [198, 'AE'],
Ccedil => [199, 'C'],
Egrave => [200, 'E'],
Eacute => [201, 'E'],
Ecirc => [202, 'E'],
Euml => [203, 'E'],
Igrave => [204, 'I'],
Iacute => [205, 'I'],
Icirc => [206, 'I'],
Iuml => [207, 'I'],
ETH => [208, 'DH'],
Ntilde => [209, 'N'],
Ograve => [210, 'O'],
Oacute => [211, 'O'],
Ocirc => [212, 'O'],
Otilde => [213, 'O'],
Ouml => [214, 'Oe'],
'times' => [215, '*'],
Oslash => [216, 'O'],
Ugrave => [217, 'U'],
Uacute => [218, 'U'],
Ucirc => [219, 'U'],
Uuml => [220, 'Ue'],
Yacute => [221, 'Y'],
THORN => [222, 'P'],
szlig => [223, 'ss'],
agrave => [224, 'a'],
aacute => [225, 'a'],
acirc => [226, 'a'],
atilde => [227, 'a'],
auml => [228, 'ae'],
aring => [229, 'a'],
aelig => [230, 'ae'],
ccedil => [231, 'c'],
egrave => [232, 'e'],
eacute => [233, 'e'],
ecirc => [234, 'e'],
euml => [235, 'e'],
igrave => [236, 'i'],
iacute => [237, 'i'],
icirc => [238, 'i'],
iuml => [239, 'i'],
eth => [240, 'e'],
ntilde => [241, 'n'],
ograve => [242, 'o'],
oacute => [243, 'o'],
ocirc => [244, 'o'],
otilde => [245, 'o'],
ouml => [246, 'o'],
divide => [247, '/'],
oslash => [248, 'o'],
ugrave => [249, 'u'],
uacute => [250, 'u'],
ucirc => [251, 'u'],
uuml => [252, 'u'],
yacute => [253, 'y'],
thorn => [254, 'p'],
yuml => [255, 'y'],
);
}
#
# warning()
# -------
sub warning {
print STDERR 'html2wml: warning: ', @_
}
#
# fatal()
# -----
sub fatal {
print STDERR 'html2wml: fatal: ', @_;
exit -1;
}
#
# debug()
# -----
sub debug {
if($options{'debug'}) {
my $level = ref $_[0] ? shift->[0] : 1;
print STDERR @_ if $level <= $options{'debug'}
}
}
#
# version()
# -------
sub version {
print "$program/$version\n"; exit
}
#
# usage()
# -----
sub usage {
print STDERR <<"USAGE"; exit
usage: $0 [options] file [-o output]
options:
-a, --ascii use 7 bits ASCII emulation to convert named entities
--nocollapse don't collapse spaces and empty paragraphs
--hreftmpl=template set the template for the links reconstruction engine
-i, --ignore-images completly ignore image links
--noimg-alt-text don't replace the images by their alternative text
--nolinearize don't linearize the tables
-n, --numeric-non-ascii convert non-ASCII characters to numeric entities
-p, --nopre don't use the tag
--split-card slice the document by cards (default)
--split-deck slice the document by decks
-s, --max-card-size=size set the card size upper limit
-t, --card-split-threshold=size set the card splitting threshold
--next-card-label=label set the label of the link to the next card
--prev-card-label=label set the label of the link to the previous card
-U, --http-user set the HTTP user
-P, --http-passwd set the HTTP password
-Y, --proxy use proxy settings provided by environnement
--noproxy don't use proxy
-k, --compile compile the result in binary form
-o, --output=outfile select the outpout (stdout if none specified)
-d, --debug=n activate the debug mode (always prints to stdout)
-c, --xmlcheck activate the XML well-formedness and validity check
-h, --help show this help screen and exit
-v, --version show the program name and version and exit
Read the documentation for more information.
USAGE
}
#
# cgi_error()
# ---------
sub cgi_error {
if($options{'debug'}) {
print <<"OUTPUT"; exit
Html2Wml - Error
Html2Wml - Error
This program was called with incorrect parameters or an error occured
when processing the request. Please check your request and try again.
@_
$program v$version
OUTPUT
} else {
print $cgi->header(-type => 'text/vnd.wap.wml'), <<"OUTPUT"; exit
$defaults{wmlvers}
This program was called with incorrect parameters or an error occured
when processing the request. Please check your request and try again.
@_
_____ $program v$version
OUTPUT
}
}
1;
html2wml-0.4.11/INSTALL 0100644 0000765 0000024 00000004562 07746772372 0014311 0 ustar 00maddingue staff INSTALLATION NOTES
Installation should be quite simple once you have installed the required
modules (see below).
Edit the Makefile to check the location where you want to install Html2Wml
(variable PREFIX), then type "make" to check your configuration, then
"make install".
REQUIREMENTS
Html2Wml requires Perl 5.004 or later.
In addition, Html2Wml needs the following Perl modules in order to run.
You can find them on the CPAN (http://www.cpan.org/).
* CGI, FileHandle, File::Basename, Getopt::Long, IPC::Open2, POSIX
included in standard Perl distribution
* HTML::Parser
included in the HTML-Parser package
* LWP::UserAgent
included in the libwww-perl package
* URI, URI::URL
included in the URI package
* Text::Template
included in the Text-Template package
Optionaly, you can also install these modules, but they are *not* needed,
unless you want to use the --xmlcheck option.
* XML::Parser
included in the XML-Parser package
* XML::LibXML
included in the XML-Parser package
* XML::Checker
included in the XML-Parser package
If you want to create compiled WML decks, please check the README file
in the wml_compilation/ directory.
TESTED PLATFORMS
Html2Wml is currently developped under Mac OS X.
It has been tested on the following platforms:
_________________________________________________________________
| Operating System | WML conversion | WML compilation (*) |
|--------------------------|----------------|---------------------|
| Mac OS X / Darwin | OK | WT: OK, KG: n/a |
| Mac OS Classic (PowerPC) | OK | WT: n/a, KG: n/a |
| GNU/Linux based distros | OK | WT: OK, KG: OK |
| FreeBSD, NetBSD, OpenBSD | OK | WT: OK, KG: ??? |
-----------------------------------------------------------------
(*) WT = WML Tools compiler, KG = Kannel Gateway based compiler
If your system isn't listed here and that Html2Wml works without problem,
you can mail me so that I add it on the list.
On the contrary, if your system is listed but that you have troubles
running Html2Wml, feel free to ask me.
html2wml-0.4.11/Makefile 0100644 0000765 0000024 00000002670 07746772372 0014716 0 ustar 00maddingue staff # =========================================================================== #
# Configuration
PREFIX=/usr/local
CGIDIR=/home/httpd/cgi-bin
# =========================================================================== #
# Do not edit the following unless you know what are you are doing
# Html2Wml version
VERSION=0.4.11
# Standard subdirectories locations
BINDIR=$(PREFIX)/bin
MANDIR=$(PREFIX)/man
DOCDIR=$(PREFIX)/doc/html2wml
MAN1DIR=$(MANDIR)/man1
# Default target
default: check-config
@echo
@echo "Current configuration: "
@echo PREFIX=$(PREFIX)
@echo " => BINDIR=$(BINDIR)"
@echo " MANDIR=$(MANDIR)"
@echo " DOCDIR=$(DOCDIR)"
@echo " CGIDIR=$(CGIDIR)"
@echo
@echo "Type \"make install\" to install Html2Wml."
@echo "Type \"make installcgi\" to install Html2Wml in your CGI directory."
# Installaton targets
install: check-config directories
install -m 555 html2wml.cgi $(BINDIR)/html2wml
install -m 444 doc/html2wml.man $(MAN1DIR)/html2wml.1
install -m 444 COPYING INSTALL NEWS NOTES README TODO ChangeLog $(DOCDIR)
make -C t/ install DOCDIR=$(DOCDIR)
make -C doc/ install DOCDIR=$(DOCDIR)
installcgi: check-config
install -m 555 html2wml.cgi $(CGIDIR)
uninstall:
rm -f $(BINDIR)/html2wml
rm -f $(MAN1DIR)/html2wml.1
rm -rf $(DOCDIR)
directories:
install -d $(BINDIR) $(MAN1DIR) $(DOCDIR)
check-config:
@echo "Checking if your config looks good..."
@perl checkconfig.pl
@echo "Sounds ok."
html2wml-0.4.11/Makefile.debian 0100644 0000765 0000024 00000002415 07746772372 0016134 0 ustar 00maddingue staff # =========================================================================== #
# Configuration
DESTDIR=
CGIDIR=$(DESTDIR)/usr/lib/cgi-bin
# =========================================================================== #
# Do not edit the following unless you know what are you are doing
# Html2Wml version
VERSION=0.4.11
# Standard subdirectories locations
BINDIR=$(DESTDIR)/usr/bin
MANDIR=$(DESTDIR)/usr/share/man
DOCDIR=$(DESTDIR)/usr/share/doc/html2wml
MAN1DIR=$(MANDIR)/man1
# Default target
default: check-config
@echo
@echo "Current configuration: "
@echo DESTDIR=$(DESTDIR)
@echo " => BINDIR=$(BINDIR)"
@echo " MANDIR=$(MANDIR)"
@echo " DOCDIR=$(DOCDIR)"
@echo " CGIDIR=$(CGIDIR)"
@echo
@echo "Type \"make install\" to install Html2Wml."
@echo "Type \"make installcgi\" to install Html2Wml in your CGI directory."
# Installaton targets
install: check-config directories
install -m 555 html2wml.cgi $(BINDIR)/html2wml
installcgi: check-config
install -m 555 html2wml.cgi $(CGIDIR)
uninstall:
rm -f $(BINDIR)/html2wml
rm -f $(MAN1DIR)/html2wml.1
rm -rf $(DOCDIR)
directories:
install -d $(BINDIR) $(MAN1DIR) $(DOCDIR)
check-config:
@echo "Checking if your config looks good..."
@perl checkconfig.pl
@echo "Sounds ok."
html2wml-0.4.11/NEWS 0100644 0000765 0000024 00000006622 07746772372 0013756 0 ustar 00maddingue staff WHAT'S NEW (version 0.4.11)
Several enhancements and bugfixes in the URL translation
engine.
Extended the pure ASCII characters conversion.
Html2cHtml has been updated with the new code.
WHAT'S NEW (version 0.4.10)
Several important bugfixes. The quality of the output WML
should be greatly improved.
Added the "skip" action and the "for" action parameter.
Added proxy support.
Includes an alpha HTML to cHTML (iMode) converter.
WHAT'S NEW (version 0.4.9)
Minor bugfix.
WHAT'S NEW (version 0.4.8)
Many bugfixes and some small optimizations.
Entities conversion has been improved. URLs are now correctly
encoded. Warnings messages have been made more consistent.
Security fix: accessing local files in CGI mode should now be
completly impossible.
WHAT'S NEW (version 0.4.7)
Many bugs and issues were corrected, including the WML compilation
which now works correctly. Some small optimizations were added.
WHAT'S NEW (version 0.4.6)
Added HTTP basic authentication.
Added option --output to specifiy an output other than stdout.
The rendering has been improved in several ways to reduce the noise.
Unclosed named and numeric entities are handled and fixed.
Several bugs were fixed. Several internal clean-ups.
WHAT'S NEW (version 0.4.5)
Added support for frames and image maps. Added full support for
Latin Extended-A entities, plus some others non-standard entities.
Several bugs and warnings corrected.
WHAT'S NEW (version 0.4.4)
Added the support for deck splitting (in CGI mode only).
HTML highlighting was enhanced. Anchors are correctly tranformed.
Some bugs were corrected, plus some code speedups and some minor
imrovments.
WHAT'S NEW (version 0.4.3)
Added the support for sending the charset encoding information.
Images are replaced by their alternative text (see the --img-alt-text
option). In order to save up some bytes in the CGI mode, generated
links take use relative URLs when possible, and short options are
available.
Plus several internal cleans-up, bugs corrections and others less
important features added.
Thanks to Igor Khristophorov for his suggestions and his patches.
WHAT'S NEW (version 0.4.2)
Non-ASCII characters and ampersands are now converted to their
corresponding entities. A small bug that prevented the proper
handling of https URLs has been corrected. Plus some other minor
changes.
WHAT'S NEW (version 0.4.1)
The debug mode doesn't use Text::Wrap anymore, the result was really
too messy. I wrote a small function, simple_wrap(), which does what
I had in mind. Particularly, it preserves the line numbering of the
output.
Added support for very bad-formed HTML documents. This includes
particularly documents with no HTML or BODY tags.
Added the --ignore-images option, which tells Html2Wml to completly
ignore the image link. It's enabled by default in CGI mode.
WHAT'S NEW (version 0.4.0)
The link reconstruction engine now uses Text::Template, which offer
a great flexibility to change the links.
The warnings Html2Wml could generate have been corrected. This fix
should also act as a speed-up of the code.
The debug mode has been improved.
The documentation has been completely rewritten.
html2wml-0.4.11/NOTES 0100644 0000765 0000024 00000001055 07746772372 0014065 0 ustar 00maddingue staff NOTE FOR TESTERS
If you want to test Html2Wml using the CGI mode, you can use the file
form.html located in the t/ directory, which contains the appropriate
form to call Html2Wml. This form also allow you to test the different
options of Html2Wml.
Html2Wml can be used this way to view HTML pages converted to WML
using a WML browser or a Wap emulator.
Also, there are a few sample HTML pages in the t/samples/ directory
that can show you how Html2Wml converts different qualities of HTML
into WML.
html2wml-0.4.11/README 0100644 0000765 0000024 00000002421 07746772372 0014130 0 ustar 00maddingue staff
: : : : ::: : : :
: : : : : : : : :
: : ::: :::::: : : : : :::::: :
::::: : : : : : : : : : : : : :
: : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : :
: : : : : : : ::::: : : : : : :
Html2Wml is a Perl program that converts HTML documents
to WML decks, i.e. documents that are viewable on a Wap
device. It is available under the GNU General Public
License version 2 or later, and can be downloaded from
the web site of the author
http://www.maddingue.org/techie/
or on the web site of Html2Wml hosted by SourceForge
http://htmlwml.sourceforge.net/
For installation notes, please read the INSTALL file.
A small but quite complete documentation is available in
the doc/ directory in several formats: POD (perldoc),
nroff (man), text, HTML and PDF. They are normally
installed when you 'make install'.
If you have some questions, suggestions, comments, critics,
etc, please send a mail to htmlwml-general@lists.sourceforge.net
:. .: : : .
: : : .. .: .: . .. .. . . ... Sebastien Aperghis-Tramoni
: : :.: :.: :.: : : : :.: :.: ::. maddingue@free.fr
..:
html2wml-0.4.11/t 0040755 0000765 0000024 00000000000 07746772373 0013441 5 ustar 00maddingue staff html2wml-0.4.11/t/form.html 0100644 0000765 0000024 00000003365 07746772372 0015354 0 ustar 00maddingue staff
Html2Wml Test Page
Html2Wml Test Page
This page illustrates how to call Html2Wml from an HTML form.
It also shows that most of the options of Html2Wml can be used in
CGI mode.
If you uncheck the "Debug" option, the result will be sent as
a real WML deck, which is not viewable with standard web browsers.
html2wml-0.4.11/t/Makefile 0100644 0000765 0000024 00000000406 07746772372 0015154 0 ustar 00maddingue staff help:
@echo "available targets"
@echo " install install the examples in the documentation directory"
install:
@install -d $(DOCDIR)/t
@install -m 444 *.html $(DOCDIR)/t
@install -d $(DOCDIR)/t/samples
@install -m 444 samples/* $(DOCDIR)/t/samples
html2wml-0.4.11/t/samples 0040755 0000765 0000024 00000000000 07746772373 0015105 5 ustar 00maddingue staff html2wml-0.4.11/t/samples/inc_para.wml 0100644 0000765 0000024 00000000115 07746772372 0017452 0 ustar 00maddingue staff This paragraph has been included from another file.
html2wml-0.4.11/t/samples/inc_prev.wml 0100644 0000765 0000024 00000000107 07746772372 0017504 0 ustar 00maddingue staff
html2wml-0.4.11/t/samples/mix1.html 0100644 0000765 0000024 00000004644 07746772372 0016734 0 ustar 00maddingue staff
Html2Wml Test Document -- Typical page
Html2Wml Test Document -- Typical page
This page is supposed to contain most elements that most pages
on the Internet usually contain: some CSS, a bit of JavaScript, a
table. And some of the end tags are missing in order to test the
syntax repair engine of Html2Wml.
This text is not enclosed in a P nor in a DIV tag. It is a child
of the BODY element. This is valid in HTML, but won't in WML.
This paragraph contains a link to Html2Wml home page .
This one contains an image.
Paragraph with a break between the two dashes - - (this
text should be on next line).
This is an incorrect paragraph: it has no closing tag.
This following table is simpler than the previous but has a
very ugly HTML code. (And this paragraph is closed twice).
This paragraph is correct, but the bold and italic
are inverted (this is BAD).
Just before this paragraph, there was an horizontal rule (HR tag),
but here is currently no way to render that.
html2wml-0.4.11/t/samples/mix2.html 0100644 0000765 0000024 00000002264 07746772372 0016731 0 ustar 00maddingue staff
Html2Wml Test Document -- Typical page
Html2Wml Test Document -- Typical page
This page comes from CGI::WML test pages.
I am some text.
this is italic
bold
This parargaph is not closed
html2wml-0.4.11/t/samples/struct.html 0100644 0000765 0000024 00000003371 07746772372 0017376 0 ustar 00maddingue staff
Html2Wml Test Document -- Structured document
Html2Wml Test Document -- Structured document
This is an example of a HTML page which uses only standard structural
tags (H1, H2, P). This is typical of HTML pages created by people who
conform to W3C standards.
Pourquoi ne pas utiliser les produits Kro ?
Introduction
Ce document explique pourquoi ce N'est Pas Bien d'utiliser les
produits Kro.
Digression
Première raison : les produits Kro ne sont pas bien faits.
Résultat, la plupart sont très instables. Et quand un
produit Kro est relativement stable, cela signifie que Kro a pompé
une bonne partie d'un produit semblable mais de meilleure qualité,
fabriqué par des gens plus sérieux que Kro.
Autant donc acheter le produits des gens sérieux plutôt
que le produit Kro.
Deuxième raison : les produits Kro étant instables,
ils sont source d'un énervement prolongé qui conduit
à un stress permanent. Les produits Kro sont donc mauvais pour
la santé.
Troisième raison : le stress fait que les consommateurs des
produits Kro dégage plus de méthane, ce qui est mauvais
pour la couche d'ozone. Kro contribue donc à la destruction de
la couche d'ozone.
Conclusion
Il ne faut pas utiliser les produits de Kro.
Note anti-neuneu
Pour les ceusses qui auraient un peu de mal, c'est bien
évidemment parodique.. Il fallait juste que je mette
un texte entre les tags, alors, j'ai écris ce qui
m'est passé par la tête ;-)
html2wml-0.4.11/t/samples/t-a.html 0100644 0000765 0000024 00000000772 07746772372 0016535 0 ustar 00maddingue staff
Links
Links
normal link
link with a bold tag inside
link inside a bold tag
link inside a bold tag but end tags are inverted
link with a paragraph tag inside
link inside a paragraph tag
link inside a paragraph tag but end tags are inverted