bidiv/Makefile0000444000076600007650000000103010360006067011611 0ustar nyhrlPREFIX=/usr/local BIN_DIR=$(PREFIX)/bin MAN_PATH=$(PREFIX)/man CC_OPT_FLAGS=-O2 -Wall CFLAGS= $(CC_OPT_FLAGS) $(DEFS) `fribidi-config --cflags` LDFLAGS=`fribidi-config --libs` all: bidiv bidiv: bidiv.o $(CC) -o bidiv bidiv.o $(LDFLAGS) clean: rm -f bidiv.o *~ clobber: clean rm -f bidiv .c.o: $(CC) -c $(CFLAGS) $< install: all strip bidiv cp bidiv $(BIN_DIR) chmod 755 $(BIN_DIR)/bidiv cp bidiv.1 $(MAN_PATH)/man1/bidiv.1 chmod 644 $(MAN_PATH)/man1/bidiv.1 uninstall: rm -f $(BIN_DIR)/bidiv $(MAN_PATH)/man1/bidiv.1 bidiv/README0000444000076600007650000000267010360007347011046 0ustar nyhrlTODO: write a decent README... In one of the Ivrix meetings we had, one of the problems raised is that we don't have (or have, but don't use) a standard method of representing Hebrew text files. Such Hebrew text files include: 1. Hebrew E-Mail 2. Simple Hebrew texts (e.g., a README in Hebrew) 3. Manual pages in Hebrew. This is why I wrote a very simple program based on Fribidi, that acts as a filter or viewer (with similar semantics as the 'cat' command) for logical- order ISO-8859-8 or UTF8 files: The logical text is converted to visual ISO-8859-8 (or UTF8) assuming a fixed number of characters per line (automatically determined or given with the -w parameter). I called this program 'bidiv', for "bidi viewer". You can get it in ftp://ftp.ivrix.org.il/pub/ivrix/src/cmdline/bidiv-1.0.tgz It's a tiny program indeed: the executable is only 5k (when using the fribidi shared library). By the way, I also put a copy of fribidi on the ivrix ftp: ftp://ftp.ivrix.org.il/pub/ivrix/src/lib/fribidi-0.1.15.tar.gz Example usages of bidiv: 1. bidiv README | less 2. man something | bidiv | less (or groff -man -Tlatin1 something.1 |sed 's/.^H\(.\)/\1/g' |../bidiv -w 65) 3. set "bidiv" as a filter for your mail program (mutt, pine, etc.) for viewing mail with the ISO 8859-8 character set. ---------------------------------------------------------------------------- Copyright (c) 2001-2006 Nadav Har'El License: GNU General Public License version 2 bidiv/WHATSNEW0000644000076600007650000000363410360006407011350 0ustar nyhrlRelease 1.5: Jan 7, 2006 * Fixed buffer size error that caused buffer overflows (and crashes) in certain cases. Thanks to Dan Kenigsberg and to Shachar Raindel for reporting the bug and suggesting patches. * Update bidiv(1) manual to reflect bidiv's UTF8 capabilities (that already existed in release 1.4). * Clarify GPL license in bidiv.c and README, and include GPL's "COPYING" license file in the bidiv distribution. Release 1.4: Oct 21, 2002 * Fix auto direction so that if the first line of a paragraph has no direction, the direction of the previous paragraph is used temporarily. See tests/1. (Nadav Har'El) * utf8 is output (instead of iso8859-8) when LC_CTYPE is something.UTF-8. Note that if both the input and output is utf8, then it may contain things other than Hebrew, with no ill-effects. Currently no option to force utf8 (or iso8859-8) output other then changing the locale. * When the first character was nonascii didn't look like a Hebrew character, we assumed it and the next character compose utf8. But when this doesn't make sense (e.g., the second character is ascii, e.g., a newline) we ruined two characters. Better ruin only one by our transformation :) Release 1.3: Apr 17, 2001 * Small changes to manual page. (Nadav Har'El) * Fix serious bug in utf8 handling. Now utf8 is recognized on a byte-by-byte basis. (Nadav Har'El) Release 1.2: Apr 14, 2001 * Improved manual page (Nadav Har'El) * experimental code to guess if text is utf8 or utf8. Actually does it on a line-per-line basis. (Nadav Har'El) Release 1.1: Feb 20, 2001 * Fixed compilation errors on some system (Nadav Har'El) * Better makefile (Tzafrir Cohen) * Rudementary README (Tzafrir Cohen, Nadav Har'El) * Fixed exit status (Nadav Har'El) * Rudementary manual page (Nadav Har'El) * RPM "spec" file (Tzafrir Cohen) Release 1.0: * Initial release (Nadav Har'El) bidiv/bidiv.10000644000076600007650000001244710360010141011334 0ustar nyhrl'\" t .\" Copyright (c) 2001, Nadav Har'El .TH bidiv 1 "7 Jan 2006" "Bidiv" "Ivrix" .SH NAME bidiv \- bidirectional text filter .SH SYNOPSIS .B bidiv [ .B \-plj ] [ .BI \-w\ width ] .BI [\| file\|.\|.\|. \|] .SH DESCRIPTION .B bidiv is a filter, or viewer, for birectional text stored in logical-order. It converts such text into visual-order text which can be viewed on terminals that do not handle bidirectionality. The output visual-order text is formatted assuming a fixed number of characters per line (automatically determined or given with the .B -w parameter). .B bidiv is oriented towards Hebrew, and assumes the input to be a Hebrew and ASCII text encoded in one of the two common logical-order encodings: ISO-8859-8-i or UTF-8. Actually, bidiv guesses the encoding of its input at a character by character basis, so the input might be a mix of ISO-8859-8-i and Hebrew UTF-8. .BR bidiv 's output is visual-order text, in either the ISO-8859-8 or UTF-8 encoding, depending on your locale setting. .B bidiv reads each .I file\^ in sequence, converts it into visual order and writes it on the standard output. Thus: .PP .RS .B "$ bidiv file" .RE .PP prints .B file on your terminal (assuming it has the appropriate fonts, but no bidirectionality support), and: .PP .RS .B "$ bidiv file1 file2 | less" .RE .PP concatenates .B file1 and .BR file2 , and shows the results using the pager .BR less . .PP If no input file is given, .B bidiv reads from the standard input file. For more ideas on how to use .BR bidiv , see the .B EXAMPLES section below. .SH OPTIONS .TP .B \-p Paragraph-based direction (default): When formatting a bidirectional output line, .B bidiv needs to be aware of that line's base direction. A line whose base direction is RTL (right to left) gets right-justified and its first element appears on the right. Otherwise, the line is left-justified and its first element appears on the left. The .B \-p option tells .B bidiv to choose a base direction per paragraph, where a paragraph is delimited by an empty line. This is bidiv's default behavior, and usually gives the expected results on most texts and emails. The direction of the entire paragraph is chosen according to the first strongly-directioned character (i.e., an alphabetic character) appearing in the paragraph. Currently, if the first output line of a paragraph has no directional characters (e.g., a line of minus signs before an email signature, or a line containing only numbers) that line is output with the same direction of the previous paragraph, but it does not determine the direction of the rest of the paragraph. If the first line of the first paragraph does not have a direction, the RTL direction is arbitrarily chosen. .TP .B \-l Line-based direction: This option choose an alternative method of choosing each output line's base direction. When this option is enabled, the base direction of each output line is determined on its own (again, according to the first character on the line with a strong direction). This method may give wrong results in the case where a line starts with a word of the opposite direction. This case is rare, but does happen under random line-splitting circumstances, or when the text is defining words of a foreign language. .\"TODO: maybe add another option to choose direction based on _input_ line. Would .\"that be useful at all? I doubt it. .TP .B \-j Do not justify: By default, RTL lines are right-justified, i.e., they are padded with spaces on the left when shorter than the required line width (see the .B \-w option). The .B \-j option tells .B bidiv not to preform this justifications, and leave short lines unpadded. .TP .BI \-w\ width .B bidiv formats its output for lines of the given width. Lines are split when longer than this width, and RTL lines are right-justfied to fill that width unless the .B \-j option is given. When the .B \-w option is not given, .B bidiv uses the value of the .B COLUMNS variable, which is usually automatically defined by the user's shell. When that both the .B \-w option and the .B COLUMNS variable are missing, the default of 80 columns is used. .SH OPERANDS The following operand is supported: .TP 8 .I file A path name of an input file. If no .I file is specified, the standard input is used. .\"If .\".I file .\"is .\".RB ` \|\-\| ', .\".B bidiv .\"will read from the standard input at that point in the sequence. .\".B bidiv .\"will not close and reopen standard input when .\"it is referenced in this way, but will accept multiple occurrences of .\".RB ` \|\-\| ' .\"as .\".IR file . .\"TODO: the part about the "-" is currently not true... .SH EXAMPLES .TP 3 1. bidiv README | less .TP 3 2. man something | bidiv | less (or groff -man -Tlatin1 something.1 |sed 's/.^H\\(.\\)/\\1/g' |../bidiv -w 65) .TP 3 3. set "bidiv" as a filter for your mail program (mutt, pine, etc.) for viewing mail with the ISO 8859-8-i character set, and Hebrew UTF-8 mail. .SH ENVIRONMENT .B COLUMNS see .B -w option. .SH "EXIT STATUS" The following exit values are returned: .TP 4 .B 0 All input files were output successfully. .TP .B >0 An error occurred. .SH "AUTHOR" Written by Nadav Har'El, http://nadav.harel.org.il. Please send bug reports and comments to nyh@math.technion.ac.il. The latest version of this software can be found in .B ftp://ftp.ivrix.org.il/pub/ivrix/src/cmdline .SH "SEE ALSO" .BR cat (1), .BR fribidi (3) bidiv/bidiv.c0000644000076600007650000001760310360010422011417 0ustar nyhrl/* Bidiv: Bidi View Converts a text file with bidirectional text (in iso8859-8 or UTF-8 encoding) into visual 8bit text (iso8859-8) (or UTF-8, depending on locale) suitable for viewing on a constant-width-font text terminal (or piped to a program like "less"). Usage: bidiv [options[ [file]... (same semantics as cat(1)) TODO: This should be a general utility which is useful not only to Hebrew: it should work for Arabic and other RL languages too. Copyright (c) 2001-2006 Nadav Har'El License: GNU General Public License version 2 This is version 1.5. See WHATSNEW file for version history. */ #include #include #include #include /* In the future, this should be done with autoconf */ #define HAVE_LOCALE #ifdef HAVE_LOCALE #include #include #include #endif char *progname; int width=80; /* auto_basedir=0 doesn't change the base direction (it remains the base_dir previously defined by a command line option, or automatically), =1 means change the base direction on every newline, and =2 means change it every paragraph (where a paragraph starts after a completely empty line. */ FriBidiCharType base_dir=FRIBIDI_TYPE_N, prev_base_dir=FRIBIDI_TYPE_LTR; int auto_basedir=2; int justify=1; int out_utf8=0; void bidiv(FILE *fp) { char *in, *out; FriBidiChar *unicode_in, *unicode_out; int len, i, c; int rtl_line; /* Note: rather than reading the entire line with gets, as we did in the first version of this program, this program reads only "width" character segments. There's a problem though: if we can't decide the base direction from the first width characters of a line (or paragraph), we don't know what to do. If this because a real problem, we should in that case read in more characters (up to some limit) until we do know what to do */ int newline=1, newpara=1; in=(char *)malloc(width+1); out=(char *)malloc(width*7+1); /* 7 is the maximum number of bytes in one UTF8 char? */ /* We use (width+1) not just width, to leave place for a double- width null character, which we might or might not need to write (we don't add one to unicode_in, but it appears that fribidi_log2vis puts one (a single null byte?) in unicode_out). But it can't hurt to be extra safe and leave 2 extra bytes on both. */ unicode_in=(FriBidiChar *)malloc(sizeof(FriBidiChar)*(width+1)); unicode_out=(FriBidiChar *)malloc(sizeof(FriBidiChar)*(width+1)); c=0; while(c!=EOF){ if(auto_basedir==1){ if(newline){ if(base_dir!=FRIBIDI_TYPE_N) prev_base_dir=base_dir; base_dir=FRIBIDI_TYPE_N; } } else if(auto_basedir==2){ if(newpara){ if(base_dir!=FRIBIDI_TYPE_N) prev_base_dir=base_dir; base_dir=FRIBIDI_TYPE_N; } } /* Get the next segment of a line, up to "width" characters or a newline */ newpara=0; for(len=0; len0177 && c<0340) { /* UTF8 2 byte character */ int c1; if((c1=getc(fp))==EOF) break; /* if c1 doesn't make sense as a second utf8 character, the whole thing isn't utf8. But c wasn't 8859-8, or we would have noticed that sooner :( So this surely won't work :( Maybe we should assume it's iso8859_1? At least this hack prevents us gobbling two characters as one invalid unicode char. TODO: in this case, maybe turn on a "not hebrew" flag, and stop filtering if both input and output are iso? */ if(c1<0x80||c1>0xbf){ ungetc(c1, fp); unicode_in[len]= fribidi_iso8859_8_to_unicode_c(c); } else unicode_in[len]=((c & 037) << 6) + (c1 & 077); newline=0; #endif } else { #ifdef TRY_UTF8 /* 240-256 (0341-0377) - presumably 8859-8. In the future we will have a language option, which will control this (as well as the output encoding). */ unicode_in[len]= fribidi_iso8859_8_to_unicode_c(c); #else in[len]=c; #endif newline=0; } } if(len==0){ if(newpara){ putchar('\n'); continue; } else { /* this is a case where we had "width" characters followed by a newline; In the previous iteration we outputted these characters, and a newline, and now we come to this newline, and we don't need to output it again! */ continue; } } #ifndef TRY_UTF8 in[len]='\0'; fribidi_iso8859_8_to_unicode(in, unicode_in); #endif /* output the line */ fribidi_log2vis(unicode_in, len, &base_dir, unicode_out, NULL, NULL, NULL); /* if base_dir is still FRIBIDI_TYPE_N (we haven't found any strong-direction character in this paragraph (or something else - depending on option) - we prefer the base direction of the previous paragraph (prev_base_dir) to the arbitrary FRIBIDI_TYPE_N */ if(base_dir==FRIBIDI_TYPE_N){ #if 0 base_dir=prev_base_dir; fribidi_log2vis(unicode_in, len, &base_dir, unicode_out, NULL, NULL, NULL); #else FriBidiCharType tmp_dir=prev_base_dir; fribidi_log2vis(unicode_in, len, &tmp_dir, unicode_out, NULL, NULL, NULL); rtl_line= (tmp_dir==FRIBIDI_TYPE_RTL); #endif } else if(base_dir==FRIBIDI_TYPE_RTL) rtl_line=1; else rtl_line=0; if(out_utf8) fribidi_unicode_to_utf8(unicode_out, len, out); else fribidi_unicode_to_iso8859_8(unicode_out, len, out); /* if rtl_line (i.e., base_dir is RL), and we didn't fill the entire width, we need to pad with spaces. Maybe in the future this should be an option. */ if(justify && rtl_line && len0) width=i; } #ifdef HAVE_LOCALE setlocale(LC_CTYPE, ""); out_utf8= !strcmp(nl_langinfo(CODESET),"UTF-8"); #endif /* parse command line options */ while((c=getopt(argc, argv, "w:lpj"))!=-1){ switch(c){ case 'w': /* set line width*/ width=atoi(optarg); if(width<=0){ fprintf(stderr, "%s: width argument must be positive: %s.\n", progname, optarg); return 2; } break; case 'l': /* choose direction every line */ auto_basedir=1; break; case 'p': /* choose direction every paragraph */ auto_basedir=2; break; case 'j': /* do not justify (pad with spaces) RTL lines */ justify=0; break; case '?': case ':': fprintf(stderr, "usage: %s [-plj] [-w width] [file]...\n", progname); return 2; break; default: fprintf(stderr,"internal error...\n"); return 2; } } argv+=optind; argc-=optind; if (argc>0){ int i; for(i=0;i Copyright (C) This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Also add information on how to contact you by electronic and paper mail. If the program is interactive, make it output a short notice like this when it starts in an interactive mode: Gnomovision version 69, Copyright (C) year name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than `show w' and `show c'; they could even be mouse-clicks or menu items--whatever suits your program. You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here is a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision' (which makes passes at compilers) written by James Hacker. , 1 April 1989 Ty Coon, President of Vice This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public License instead of this License.