unhtml-2.3.9/0000700000232200023220000000000010274157514013216 5ustar pbuilderpbuilderunhtml-2.3.9/debian/0000755000232200023220000000000010274163127014447 5ustar pbuilderpbuilderunhtml-2.3.9/debian/changelog0000644000232200023220000001165710401355706016331 0ustar pbuilderpbuilderunhtml (2.3.9) unstable; urgency=low * New maintainer (closes: #309264). -- Víctor Pérez Pereira Wed, 1 Mar 2006 13:36:29 -0400 unhtml (2.3.8) unstable; urgency=low * QA upload. Switch to debhelper. -- Santiago Vila Wed, 3 Aug 2005 17:43:46 +0200 unhtml (2.3.7) unstable; urgency=low * Orphaning the package. Maintainer changed to Debian QA Group -- Al Stone Sun, 15 May 2005 20:25:51 -0600 unhtml (2.3.6) unstable; urgency=low * Forgot to change the Maintainer field in debian/control. Doh. * Closes: bug#134447 -- lintian warnings about standards level * Added test cases and tests directory * Closes: bug#58137 -- arbitrary limit on tag lengths removed * Closes: bug#58135 -- too naive about <> in plain text; unhtml now checks to see if a tag is a known html tag and only removes the text if it is. -- Al Stone Sun, 25 Apr 2004 22:07:18 -0600 unhtml (2.3.5) unstable; urgency=low * New maintainer. * Closes: bug#234419 -- ITA. * Closes: bug#164613 -- typo in README.debian and long description of the package -- Al Stone Fri, 23 Apr 2004 17:28:49 -0600 unhtml (2.3.4) unstable; urgency=low * Having already orphaned this package here is my last upload for it setting the package maintainer to Debian QA Group. -- Paul Seelig Wed, 10 Mar 2004 12:51:46 +0100 unhtml (2.3.3) unstable; urgency=low * Fixed override disparity -- Paul Seelig Fri, 21 Dec 2001 16:26:55 +0200 unhtml (2.3.2) unstable; urgency=low * Fixed some of the minor lintian errors which are listed at "http://lintian.debian.org/reports/mPaul_Seelig.html". -- Paul Seelig Wed, 19 Dec 2001 19:36:43 +0200 unhtml (2.3.1) unstable; urgency=low * Correcting falsely defined Build-Depends. (closes: #90108) -- Paul Seelig Mon, 19 Mar 2001 13:50:57 +0200 unhtml (2.3.0) unstable; urgency=low * Adapt version number to reflect selfcontainedness of package source. * Added Build-Depends in debian/control file. (closes: #89150) -- Paul Seelig Thu, 15 Mar 2001 03:47:17 +0200 unhtml (2.3-1) unstable; urgency=low * Increased MAX_TAG_SIZE in ops.h from 1024 to 65536 (closes: #64215) * Since there is no upstream maintainer for the sources anymore i've decided to make this package a self contained Debian package, thus getting rid of any dependencies on external source tarballs. -- Paul Seelig Sun, 18 Feb 2001 14:10:42 +0200 unhtml (2.2-5) frozen unstable; urgency=low * Recompiled with proper glibc support for frozen -- Paul Seelig Wed, 19 Jan 2000 12:51:04 +0200 unhtml (2.2-4) unstable; urgency=low * Added some patches sent by "Prophet of the way " to enable proper conversion of HTML entities. Closes Bug#47366. -- Paul Seelig Wed, 27 Oct 1999 18:29:10 +0200 unhtml (2.2-3) unstable; urgency=low * Upstream author agreed to make his source really conform to the GPL This should address Bug#17828 and unhtml is now DFSG compliant. :-) * Changed maintainer email address to "pseelig@debian.org". * Bumped up to standards version 3.0.1 -- Paul Seelig Wed, 13 Oct 1999 13:04:07 +0200 unhtml (2.2-2) unstable frozen; urgency=low * Moved to non-free until copyright gets properly resolved. This should address Bug#17828. * Changed maintainer email address to "pseelig@debian.org". -- Paul Seelig Sun, 06 Dec 1998 16:09:59 +0200 unhtml (2.2-1) unstable; urgency=low * New upstream release -- Paul Seelig Wed, 04 Feb 1998 16:09:59 +0200 unhtml (2.1-4) unstable; urgency=low * Corrected typos in debian/control file and debian/README.debian closing bug #16790. -- Paul Seelig Fri, 16 Jan 1998 15:31:52 +0200 unhtml (2.1-3) unstable; urgency=low * Recompiled with libc6 -- Paul Seelig Wed, 10 Dec 1997 10:31:52 +0200 unhtml (2.1-2) unstable; urgency=low * The original author agreed with the change of his programs' sources and name and provided a modified version of his sources under "ftp://kombat.acadiau.ca/pub/linux/Utils/clean-2.1/unhtml-2.1.tar.gz" which in turn was used for building this Debian package. -- Paul Seelig Mon, 29 Sep 1997 18:06:36 +0200 unhtml (2.1-1) unstable; urgency=low * Initial Release based on package clean-2.1 which was renamed to 'unhtml' for mnemonic and description reasons. * Changed all occurrences of "clean" to "unhtml" in the clean-2.1 source package. The original author agreed with the change of his programs' sources and name. -- Paul Seelig Wed, 20 Aug 1997 01:18:21 +0200 unhtml-2.3.9/debian/compat0000644000232200023220000000000210274156330015643 0ustar pbuilderpbuilder4 unhtml-2.3.9/debian/control0000644000232200023220000000075210401355476016060 0ustar pbuilderpbuilderSource: unhtml Section: text Priority: extra Maintainer: Víctor Pérez Pereira Build-Depends: debhelper (>= 4) Standards-Version: 3.6.2 Package: unhtml Architecture: any Depends: ${shlibs:Depends} Description: Remove the markup tags from an HTML file This program removes all HTML tags from an HTML file and directs its output to stdout. It can be used as a filter for getting the text content of an HTML file without the need of firing up a web browser. unhtml-2.3.9/debian/copyright0000644000232200023220000000602710401355476016411 0ustar pbuilderpbuilderThis package was debianized by Paul Seelig on Wed, 20 Aug 1997 01:18:21 +0200. It is currently maintained by Víctor Pérez Pereira It is based on the package 'clean-2.1' which was renamed in agreement with the original author to 'unhtml-2.1' for mnemonic and description reasons and was downloaded under the name of "unhtml-2.1.tar.gz" from the author's original site at "ftp://kombat.acadiau.ca/pub/linux/Utils/clean-2.1/". Copyright: Name: unhtml Version: 2.1 Written By: VisionWare (Kevin Swan) Copyright (c) 1996 Completed: Last revisions made April 25, 1996 This software is freely distributable under the GNU Public License, and may be altered in any way. The only condition for redistribution is that this README file be included with the revised/redistributed software, with the only modifications being additions describing what's been changed. The upstream author authorized me to rerelease his source in order to make it really conform to the terms of the GPL. Here is his e-mail in this regard making "unhtml" really DFSG compliant: ---------- snip --------------- Date: Wed, 9 Dec 1998 02:51:18 -0400 (AST) From: Kevin Swan <013639s@dragon.acadiau.ca> To: pseelig@mail.uni-mainz.de Subject: Re: Uploaded unhtml-2.2-2 (source i386) to master Resent-Date: Sun, 13 Dec 1998 23:38:03 +0100 (CET) Resent-From: Paul Seelig Resent-To: pseelig@ietpd1.sowi.uni-mainz.de Resent-Subject: Re: Uploaded unhtml-2.2-2 (source i386) to master > Can i take it for granted now that unhtml can be released under the > GPL now? May i include any mail reply from you to this message into > the Debian package? Yes, you may. :) You are likely much more familiar with the terms of the GPL than I, and I would appreciate it if you would make the necessary modifications to my copyright agreement to make it agree with the GPL spirit. Thank you in advance, Sincerely, Kevin Swan. -- ========================================================================= Kevin Swan Acadia University 013639s@dragon.acadiau.ca Fifth Year http://dragon.acadiau.ca/~013639s/ BCSH ========================================================================= ---------- snip --------------- This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License with the Debian GNU/Linux distribution in file /usr/share/common-licenses/GPL; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA. unhtml-2.3.9/debian/README.debian0000644000232200023220000000064310274155745016562 0ustar pbuilderpbuilderunhtml for Debian ----------------- This program removes all HTML tags from a HTML file and directs its output to stdout. It can be used as a filter for getting the text content of a HTML file without the need of firing up a web browser. Additional patches for removal of HTML entities were kindly added by "Prophet of the way ". Paul Seelig , Wed, 20 Aug 1997 01:18:21 +0200 unhtml-2.3.9/debian/rules0000755000232200023220000000115710274156524015536 0ustar pbuilderpbuilder#!/usr/bin/make -f package = unhtml CC = gcc CFLAGS = -g -Wall ifeq (,$(findstring noopt,$(DEB_BUILD_OPTIONS))) CFLAGS += -O2 endif build: $(MAKE) CC="$(CC)" CFLAGS="$(CFLAGS)" touch build clean: dh_clean rm -f build -$(MAKE) clean binary-indep: build binary-arch: build dh_clean dh_installdirs usr/bin install unhtml `pwd`/debian/$(package)/usr/bin dh_installdocs Readme.html Readme.txt dh_installman unhtml.1 dh_installchangelogs dh_strip dh_compress dh_fixperms dh_shlibdeps dh_gencontrol dh_md5sums dh_builddeb binary: binary-indep binary-arch .PHONY: binary binary-arch binary-indep clean unhtml-2.3.9/Readme.html0000600000232200023220000000326407243730632015310 0ustar pbuilderpbuilder Unhtml - Document Parser

Unhtml

Kevin Swan, 013639s@dragon.acadiau.ca
Version 2.3
Last revised: February 3, 1998

DESCRIPTION

unhtml is a program developed by me in April of 1996 to remove the HTML formatting from documents and print the output to the standard output stream. It treated any occurrences of a greater than or less than sign as HTML tags and removes them. In this version, it is more intelligent in the sense that it recognizes <SCRIPT> blocks.

Please report any bugs you find or comments you have to 013639s@dragon.acadiau.ca

DISTRIBUTION

This software is freely distributable under the GNU Public License, and may be altered in any way. The only condition for redistribution is that this README file be included with the revised/redistributed software, with the only modifications being additions describing what's been changed.

The software's original author Kevin Swan 013639s@dragon.acadiau.ca is agreeing to negotiate different licensing terms with interested parties for commercial reuse of the source code.

INSTALLATION

You can compile and install unhtml by doing the following as root:


         make && make install

Kevin Swan unhtml-2.3.9/Makefile0000600000232200023220000000127110043051657014654 0ustar pbuilderpbuilder# # Makefile for unhtml # # If you are parsing an HTML file that has a large Javascript program # in it, you may need to set MAX_TAG_SIZE to something higher and # recompile # CC = gcc MAKE = make RM = rm -f CFLAGS = -Wall # CFLAGS = -Wall -DDEBUG all: unhtml unhtml: unhtml.o ops.o esc.o $(CC) -o unhtml unhtml.o ops.o esc.o unhtml.o: unhtml.c ops.h esc.h $(CC) $(CFLAGS) -c unhtml.c ops.o: ops.c ops.h $(CC) $(CFLAGS) -c ops.c esc.o: esc.c esc.h $(CC) $(CFLAGS) -c esc.c check: all $(MAKE) -C tests clean: $(RM) core *.o unhtml install: cp unhtml /usr/local/bin chmod 755 /usr/local/bin/unhtml cp unhtml.1 /usr/local/man/man1 chmod 644 /usr/local/man/man1/unhtml.1 unhtml-2.3.9/Readme.txt0000600000232200023220000000230607243730632015157 0ustar pbuilderpbuilder ======================== Unhtml - Document Parser ======================== Kevin Swan, 013639s@dragon.acadiau.ca Version 2.3 Last revised: February 21, 1999 DESCRIPTION unhtml is a program developed by me in April of 1996 to remove the HTML formatting from documents and print the output to the standard output stream. It treated any occurrences of a greater than or less than sign as HTML tags and removes them. In this version, it is more intelligent in the sense that it recognizes <SCRIPT> blocks. Please report any bugs you find or comments you have to 013639s@dragon.acadiau.ca DISTRIBUTION This software is freely distributable under the GNU Public License, and may be altered in any way. The only condition for redistribution is that this README file be included with the revised/redistributed software, with the only modifications being additions describing what's been changed. The software's original author Kevin Swan <013639s@dragon.acadiau.ca> is agreeing to negotiate different licensing terms with interested parties for commercial reuse of the source code. INSTALLATION You can compile and install unhtml by doing the following as root: make && make install Kevin Swan unhtml-2.3.9/cnv.awk0000644000232200023220000000014107243730632014516 0ustar pbuilderpbuilder { printf("%-10s, %3d,\n","\""$1";\"",$2) } END { printf("%-10s, %3d\n","\"\"",0) } unhtml-2.3.9/esc.c0000644000232200023220000001174507243730632014156 0ustar pbuilderpbuilder#define DEBUG 0 /* 1:dummy main 2:numerical ascii codes off */ #include "esc.h" #include #include #include #define THRU 0 /* THRU mode simply prints chars */ #define HOLD 1 /* HOLD mode stocks chars in bff[] */ #define MXBF 7 /* length of longest escape sequence excluding '&' */ #define CHARSET 256 /* set to 256 if ISO-8859-1 (European) */ static int scmp(char *, char *); static int flush(int); static int mode=THRU; static char bff[MXBF]; static int index=0; #if DEBUG == 1 /* dummy main */ void main(){ char ch; while( (ch=getchar()) != EOF) m_putchar(ch); m_putchar(EOF); } #endif struct table { char * seq; int n; } ktbl[] = { {"gt;" , 62}, {"lt;" , 60}, {"amp;" , 38}, {"quot;" , 34}, #if CHARSET == 128 {"nbsp;" , 32}, {"shy;" , 45}, #elif CHARSET == 256 {"nbsp;" , 160}, {"iexcl;" , 161}, {"cent;" , 162}, {"pound;" , 163}, {"curren;" , 164}, {"yen;" , 165}, {"brvbar;" , 166}, {"sect;" , 167}, {"uml;" , 168}, {"copy;" , 169}, {"ordf;" , 170}, {"laquo;" , 171}, {"not;" , 172}, {"shy;" , 173}, {"reg;" , 174}, {"macr;" , 175}, {"deg;" , 176}, {"plusmn;" , 177}, {"sup2;" , 178}, {"sup3;" , 179}, {"acute;" , 180}, {"micro;" , 181}, {"para;" , 182}, {"middot;" , 183}, {"cedil;" , 184}, {"sup1;" , 185}, {"ordm;" , 186}, {"raquo;" , 187}, {"frac14;" , 188}, {"frac12;" , 189}, {"frac34;" , 190}, {"iquest;" , 191}, {"Agrave;" , 192}, {"Aacute;" , 193}, {"Acirc;" , 194}, {"Atilde;" , 195}, {"Auml;" , 196}, {"Aring;" , 197}, {"AElig;" , 198}, {"Ccedil;" , 199}, {"Egrave;" , 200}, {"Eacute;" , 201}, {"Ecirc;" , 202}, {"Euml;" , 203}, {"Igrave;" , 204}, {"Iacute;" , 205}, {"Icirc;" , 206}, {"Iuml;" , 207}, {"ETH;" , 208}, {"Ntilde;" , 209}, {"Ograve;" , 210}, {"Oacute;" , 211}, {"Ocirc;" , 212}, {"Otilde;" , 213}, {"Ouml;" , 214}, {"times;" , 215}, {"Oslash;" , 216}, {"Ugrave;" , 217}, {"Uacute;" , 218}, {"Ucirc;" , 219}, {"Uuml;" , 220}, {"Yacute;" , 221}, {"THORN;" , 222}, {"szlig;" , 223}, {"agrave;" , 224}, {"aacute;" , 225}, {"acirc;" , 226}, {"atilde;" , 227}, {"auml;" , 228}, {"aring;" , 229}, {"aelig;" , 230}, {"ccedil;" , 231}, {"egrave;" , 232}, {"eacute;" , 233}, {"ecirc;" , 234}, {"euml;" , 235}, {"igrave;" , 236}, {"iacute;" , 237}, {"icirc;" , 238}, {"iuml;" , 239}, {"eth;" , 240}, {"ntilde;" , 241}, {"ograve;" , 242}, {"oacute;" , 243}, {"ocirc;" , 244}, {"otilde;" , 245}, {"ouml;" , 246}, {"divide;" , 247}, {"oslash;" , 248}, {"ugrave;" , 249}, {"uacute;" , 250}, {"ucirc;" , 251}, {"uuml;" , 252}, {"yacute;" , 253}, {"thorn;" , 254}, {"yuml;" , 255}, #endif {"" , 0} }; int m_putchar(int chr){ struct table *ptr; if ( mode == THRU ) switch (chr){ case '&': mode=HOLD; return '&'; case EOF: return EOF; default : return putchar(chr); /* most chars pass through here */ } /* mode == HOLD */ else switch (chr) { case '&': return flush(HOLD); case EOF: return flush(THRU); case ';': bff[index++]=';'; /* delimiter */ for(ptr=ktbl; !scmp(ptr->seq,bff); ptr++) ; chr= ptr->n; /* chr == 0 if no match */ #if DEBUG == 0 || DEBUG == 2 if(chr==0 && bff[0]=='#') /* numerical ascii code */ /* without the following strong tests seqs like "K;" would fall through */ { if( (bff[1]=='X' || bff[1]=='x') && isxdigit(bff[2]) && isxdigit(bff[3]) ) /* hexadecimal */ /* /&#[Xx][0-9A-Da-d][0-9A-Da-d];/ */ chr=strtoul( &bff[2], NULL, 16 ); else if( (bff[2]==';' && isdigit(bff[1]) ) /* decimal */ /* /&#[0-9];/ */ || (bff[3]==';' && isdigit(bff[1]) && isdigit(bff[2]) ) /* decimal */ /* /&#[0-9][0-9];/ */ || (bff[4]==';' && (bff[1]=='1' || bff[1]=='2') && isdigit(bff[2]) && isdigit(bff[3]) ) /* decimal */ /* /&#[12][0-9][0-9];/ */ ) chr=strtoul( &bff[1], NULL, 10); } #endif if( ( chr<0 ) || ( chr==0) || ( 0<=chr && chr<= 8 ) || ( 11<=chr && chr<= 12 ) || ( 14<=chr && chr<= 31 ) || ( 127<=chr && chr<=159 ) || ( CHARSET<=chr ) ) return flush(THRU); /* no match or undefined */ else { index=0; mode=THRU; return putchar(chr); } /* print converted character */ default : bff[index++]=chr; return(index==MXBF)?flush(THRU):chr; } } static int flush(int md){ /* sends chars in bff to output stream */ int r; int idx; register int i; idx=index; index=0; mode=md; if((r=putchar('&'))==EOF) return EOF; for(i=0;i #include #include #include "ops.h" /* * Checks if a given tag is an HTML script opening tag, . * It checks in a case-insensitive manner. * * Given: a string that is an HTML tag. * Return: 1 if that tag is a script closing tag, 0 otherwise. */ int isScriptClosingTag (char *tag) { int i = 0; while (tag[i] != '\0') { switch (i) { case 0: if (tag[i] != '<') return 0; break; case 1: if (tag[i] != '/') return 0; break; case 2: if ((tag[i] != 's') && (tag[i] != 'S')) return 0; break; case 3: if ((tag[i] != 'c') && (tag[i] != 'C')) return 0; break; case 4: if ((tag[i] != 'r') && (tag[i] != 'R')) return 0; break; case 5: if ((tag[i] != 'i') && (tag[i] != 'I')) return 0; break; case 6: if ((tag[i] != 'p') && (tag[i] != 'P')) return 0; break; case 7: if ((tag[i] != 't') && (tag[i] != 'T')) return 0; break; default: return 1; } /* switch */ i++; } /* while */ return 0; } /* * Checks if a given tag is an actual HTML tag * It checks in a case-insensitive manner. * * Given: a string that could be an HTML tag. * Return: 1 if that tag is an HTML tag, 0 otherwise */ int isRealHtmlTag (char *tag) { static char *html[] = { "a", "abbrev", "acronym", "address", "applet", "area", "b", "base", "basefont", "bdo", "big", "blockquote", "body", "br", "button", "caption", "center", "cite", "code", "col", "colgroup", "dd", "del", "dfn", "dir", "div", "dl", "dt", "em", "fieldset", "font", "form", "frame", "frameset", "h1", "h2", "h3", "h4", "h5", "h6", "head", "hr", "html", "i", "iframe", "img", "input", "ins", "isindex", "kbd", "label", "legend", "li", "link", "map", "menu", "meta", "noframes", "noscript", "object", "ol", "optgroup", "option", "p", "param", "pre", "q", "s", "samp", "script", "select", "small", "span", "strike", "strong", "style", "sub", "table", "tbody", "td", "textarea", "tfoot", "th", "thead", "title", "tr", "tt", "u", "ul", "var", 0 }; int ii, jj, len; int result; static char *ptr = 0; static int taglen = 256; result = 0; /* * keep around a static buffer for the tag -- initialize it the * first time thru */ len = strlen(tag) + 1; len = (len > taglen) ? len : taglen; if (!ptr) { ptr = (char *)malloc(len); if (!ptr) { fprintf (stderr, "Cannot malloc in tag test (%d bytes)\n", len); exit(1); } memset(ptr, 0, len); taglen = len; } /* * if the tag's longer than the space we've already set aside, * make a bigger space */ if (len > taglen) { free(ptr); while (taglen < len) taglen *= 2; ptr = (char *)malloc(taglen); if (!ptr) { fprintf (stderr, "Cannot malloc in tag test (%d bytes)\n", len); exit(1); } } memset(ptr, 0, taglen); /* * copy the useful parts of the tag into our buffer */ jj = 0; for (ii = 0; ii < len; ii++) { if (tag[ii] == '<') continue; if (tag[ii] == '/' && ii == 1) continue; if (tag[ii] == ' ' || tag[ii] == '>' || tag[ii] == '\n') break; ptr[jj++] = tag[ii]; } /* * see if the tag is an actual html tag */ for (ii = 0; html[ii] != 0; ii++) if (strcasecmp(ptr, html[ii]) == 0) { result = 1; break; } return result; } unhtml-2.3.9/ops.h0000600000232200023220000000205710043073145014165 0ustar pbuilderpbuilder/* * File: ops.h * Program: unhtml * Written by: Kevin Swan, 013639s@dragon.acadiau.ca) * Completed: February 3, 1998 * Version: 2.3 */ #include #include #ifndef OPS_H #define OPS_H #ifndef MAX_TAG_SIZE #define MAX_TAG_SIZE 256 #endif /* * Checks if a given tag is an HTML script opening tag, . * It checks in a case-insensitive manner. * * Given: a string that is an HTML tag. * Return: 1 if that tag is a script closing tag, 0 otherwise. */ int isScriptClosingTag (char *tag); /* * Checks if a given tag is an actual HTML tag * It checks in a case-insensitive manner. * * Given: a string that could be an HTML tag. * Return: 1 if that tag is an HTML tag, 0 otherwise */ int isRealHtmlTag (char *tag); #endif unhtml-2.3.9/unhtml.10000600000232200023220000000250106465767436014630 0ustar pbuilderpbuilder.\" @(#)unhtml.1 1.16 90/02/15 SMI; from Linux 1.2.8 and up .TH UNHTML 1 "3 February 1998" .SH NAME unhtml \- strip the HTML formatting from a document or the standard input stream and display it to the standard output .SH SYNOPSIS .B unhtml .B \-version | [ .I filename ] .LP .SH DESCRIPTION .LP Parses text read from the standard input, or a file if a file name is supplied, and removes any HTML formatting it finds. Prints the resulting cleansed text to the standard output for easy redirection. The version included with this man page has been improved to handle comments and scripts. .LP .SH OPTIONS .TP .B \-version Version. .B unhtml will display its version and exit. .SH EXAMPLES .LP This example simply scans a file called "index.html" and prints the file to the standard output with the HTML formatting removed. The standard output is redirected to a file called "index.txt" which, after running, will contain the plain text of the .html file. .LP .RS .ft B .nf example% unhtml index.html > index.txt .fi .ft R .LP .SH BUGS Currently, if the output is redirected to a file of the same name as the input file, the result will be an empty file of the same name, but this is really an idiosyncracy of the redirect operator, and cannot be corrected in the program. .SH DEVELOPMENT This document is Copyright (C) 1998 by Kevin Swan. unhtml-2.3.9/unhtml.c0000600000232200023220000001515110043105221014654 0ustar pbuilderpbuilder/* * File: unhtml.c * Program: unhtml * Written by: Kevin Swan, 013639s@dragon.acadiau.ca * Completed: February 3, 1998 * Version: 2.3 * * Usage: * unhtml -version | [ filename ] * * Specification: * unhtml is a program which removes HTML formatting from a stream, writing * to output to a stream. If it is invoked with a file name, it attempts * to read from the named file. If it is invoked without a filename, it * reads from stdin. It writes all output to stdout. */ #include #include #include #include "ops.h" #include "esc.h" char *VERSION = "unhtml Version 2.3 Copyright (C) 1998 by Kevin Swan"; char *USAGE = "unhtml -version | [ filename ]"; int main(int argc, char *argv[]) { /* * Variables local to the program. */ FILE *inStream; char *tag; char *tmp; int tag_size; char ch; int i, j; /* * Do argument checking. If more than one command line argument was * given, print a usage error. */ if (argc > 2) fprintf (stderr, "Usage: %s\n", USAGE); /* * If the user simply requested the version of the program, print that * information and terminate. */ if (argc == 2) if (strcmp (argv[1], "-version") == 0) { printf ("%s\n", VERSION); return 0; } /* * Allocate tag space, now that we know we need to do some actual work. */ tag_size = MAX_TAG_SIZE; tag = (char *)malloc(tag_size); if (!tag) { fprintf (stderr, "Cannot malloc tag space (%d bytes).\n", tag_size); return 1; } /* * If an input file was specified, try to open a read stream on it. */ if (argc == 2) { if ((inStream = fopen (argv[1], "r")) == NULL) { fprintf (stderr, "Error opening file [%s] for reading.\n", argv[1]); fprintf (stderr, USAGE); return 1; } } else /* * Otherwise, just use the standard input stream. */ inStream = stdin; /* * Read tokens from the stream until we hit an opener for an HTML tag. */ while (1) { ch = fgetc (inStream); /* * If we hit the end of the file, we're done. */ if (ch == EOF) break; /* * If the character is not a tag opener, just print it. */ if (ch != '<') { m_putchar (ch); continue; } /* * If we get this far, we've hit an HTML tag. Read it into the * variable tag. */ memset(tag, 0, tag_size); i = 0; while (ch != EOF) { tag[i] = ch; if (i == 1 && ch != '/' && !isalpha(ch)) { m_putchar(tag[0]); m_putchar(ch); break; } i++; if (ch == '>') { /* * If it's really an html tag, then toss it. Otherwise, it could * have been just a '<' sign in the text. */ if (!isRealHtmlTag(tag)) { fprintf(stderr, "not: %s\n",tag); for (j = 0; j < i; j++) m_putchar(tag[j]); } if (tag_size > MAX_TAG_SIZE) { free(tag); tag_size = MAX_TAG_SIZE; tag = (char *)malloc(tag_size); if (!tag) { fprintf (stderr, "Cannot malloc tag space (%d bytes).\n", tag_size); return 1; } } break; } if (i > (tag_size - 1)) { while (tag_size < i) tag_size *= 2; tmp = (char *)malloc(tag_size); if (!tmp) { fprintf (stderr, "Cannot malloc tag space (%d bytes).\n", tag_size); return 1; } memset(tmp, 0, tag_size); memcpy(tmp, tag, i); free(tag); tag = tmp; } ch = fgetc (inStream); } tag[i] = '\0'; #ifdef DEBUG fprintf (stderr, "Read in the tag \"%s\"\n", tag); #endif /* * If it was a script opener, it is a special case. We may find * '>' characters inside the pair that are not * associated with a tag. In addition, comment delimiters are * found inside the script tag pairs. So, if we get a script * tag, skip ahead to the closing script tag. */ if (isScriptOpeningTag (tag)) { #ifdef DEBUG fprintf (stderr, "\"%s\" is a script opener.\n", tag); #endif /* * This loop is necessary to ensure that we don't swallow up the * closing tag while filling the buffer if we happen to * hit a '<' character in some comparison in the scripting language. */ while (1) { #ifdef DEBUG fprintf (stderr, "1. Read till we hit a '<'.\n"); #endif /* * Read until we hit a '<'. */ ch = fgetc (inStream); while (ch != EOF) if (ch == '<') break; else ch = fgetc (inStream); if (ch == EOF) { ungetc (ch, inStream); break; } #ifdef DEBUG fprintf (stderr, "2. Read till we hit a '>' or a '<', filling the buffer.\n"); #endif /* * Hit a '<'. Read till we hit a '>' or a '<', filling the buffer. */ i = 1; memset(tag, 0, tag_size); tag[0] = '<'; ch = fgetc (inStream); while (ch != EOF) { tag[i] = ch; i++; if (i > (tag_size - 1)) { while (tag_size < i) tag_size *= 2; tmp = (char *)malloc(tag_size); if (!tmp) { fprintf (stderr, "Cannot malloc tag space (%d bytes).\n", tag_size); return 1; } memset(tmp, 0, tag_size); memcpy(tmp, tag, i); free(tag); tag = tmp; } if ((ch == '>') || (ch == '<')) break; ch = fgetc (inStream); } /* while */ if (ch == EOF) { ungetc (ch, inStream); break; } tag[i] = '\0'; #ifdef DEBUG fprintf (stderr, "Read tag: \"%s\"\n", tag); #endif if (ch == '<') { ungetc (ch, inStream); continue; } else if (isScriptClosingTag(tag)) break; } #ifdef DEBUG fprintf(stderr, "Got to the end of the script, found \"%s\"\n", tag); #endif /* * At this point, we should be ready to read the first character * after the closing '>' of the tag. */ continue; } /* * If it was a comment opener, skip to the comment closer. */ /* ch = fgetc (inStream); if (ch == EOF) break; for (i = 0 ; i < 10 ; i++) { if ((ch = fgetc (inStream)) == EOF) break; if (ch == '>') break; tag[i] = ch; } if (ch == EOF) break; */ } m_putchar(EOF); /* for the rare case in which chars remain in bff */ /* * Try to peacefully close the stream, if it is not stdin. */ if (argc == 2) if (fclose(inStream)) fprintf (stderr, "Error %d closing file.\n", errno); return 0; } unhtml-2.3.9/tests/0000755000232200023220000000000010043105311014350 5ustar pbuilderpbuilderunhtml-2.3.9/tests/tmp10000644000232200023220000000002310043105311015147 0ustar pbuilderpbuilder something here unhtml-2.3.9/tests/Makefile0000644000232200023220000000203510043072761016024 0ustar pbuilderpbuilder# # Makefile for unhtml test suite # CC = gcc MAKE = make RM = rm -f CFLAGS = -Wall # CFLAGS = -Wall -DDEBUG all: check check: clean test1 test2 test3 test4 test5 @( val=`diff -q expected.results results` ; \ if [ "$$val" ] ; \ then \ echo "one or more tests failed" ; \ exit 1 ; \ else \ echo "all tests passed" ; \ fi ; ) test1: @echo " running test1 ..." @../unhtml test1.html > tmp1 @( diff -q tmp1 test1.out && echo test1 >> results ) test2: @echo " running test2 ..." @../unhtml test2.html > tmp2 @( diff -q tmp2 test2.out && echo test2 >> results ) test3: @echo " running test3 ..." @../unhtml test3.html > tmp3 @( diff -q tmp3 test3.out && echo test3 >> results ) test4: @echo " running test4 ..." @../unhtml test4.html > tmp4 @( diff -q tmp4 test4.out && echo test4 >> results ) test5: @echo " running test5 ..." @../unhtml test5.html > tmp5 @( diff -q tmp5 test5.out && echo test5 >> results ) clean: @$(RM) core *.o unhtml results @$(RM) tmp1 tmp2 tmp3 tmp4 tmp5 unhtml-2.3.9/tests/tmp20000644000232200023220000000226010043105311015155 0ustar pbuilderpbuilder Unhtml - Document Parser Unhtml Kevin Swan, 013639s@dragon.acadiau.ca Version 2.3 Last revised: February 3, 1998 DESCRIPTION unhtml is a program developed by me in April of 1996 to remove the HTML formatting from documents and print the output to the standard output stream. It treated any occurrences of a greater than or less than sign as HTML tags and removes them. In this version, it is more intelligent in the sense that it recognizes