unicode-2.8/000077500000000000000000000000001377312441100130145ustar00rootroot00000000000000unicode-2.8/COPYING000066400000000000000000000000101377312441100140360ustar00rootroot00000000000000GPL v3 unicode-2.8/MANIFEST.in000066400000000000000000000001341377312441100145500ustar00rootroot00000000000000include README include README-paracode include COPYING include unicode.1 include paracode.1 unicode-2.8/README000066400000000000000000000130541377312441100136770ustar00rootroot00000000000000This file is in UTF-8 encoding. To use unicode utility, you need: - python >=2.6 (str format() method is needed), preferrably wide unicode build, however, python3 is recommended - python optparse library (part of since python2.3) - UnicodeData.txt file (http://www.unicode.org/Public) which you should put into /usr/share/unicode/, ~/.unicode/ or current working directory. - apt-get install unicode-data # Debian - dnf install unicode-ucd # Fedora - if you want to see Unicode block information, you also need Blocks.txt file, which you should put into /usr/share/unicode/, ~/.unicode/ or current working directory. - if you want to see UniHan properties, you need also Unihan.txt file which should be put into /usr/share/unicode/, ~/.unicode/ or current working directory. Enter regular expression, hexadecimal number or some characters as an argument. unicode will try to guess what you want to look up, see the manpage if you want to force other behaviour (the manpage is also the best documentation). In particular, -r forces searching for regular expression in the names of characters, -s forces unicode to display information about the characters given. Here are just some examples: $ unicode.py euro U+20A0 EURO-CURRENCY SIGN UTF-8: e2 82 a0 UTF-16BE: 20a0 Decimal: ₠ ₠ Category: Sc (Symbol, Currency) Bidi: ET (European Number Terminator) U+20AC EURO SIGN UTF-8: e2 82 ac UTF-16BE: 20ac Decimal: € € Category: Sc (Symbol, Currency) Bidi: ET (European Number Terminator) $ unicode.py 00c0 U+00C0 LATIN CAPITAL LETTER A WITH GRAVE UTF-8: c3 80 UTF-16BE: 00c0 Decimal: À À (à) Lowercase: U+00E0 Category: Lu (Letter, Uppercase) Bidi: L (Left-to-Right) Decomposition: 0041 0300 You can specify a range of characters as arguments, unicode will show these characters in nice tabular format, aligned to 256-byte boundaries. Use two dots ".." to indicate the range, e.g. unicode 0450..0520 will display the whole cyrillic, armenian and hebrew blocks (characters from U+0400 to U+05FF) unicode 0400.. will display just characters from U+0400 up to U+04FF Use --fromcp to query codepoints from other encodings: $ unicode --fromcp cp1250 -d 200 U+010C LATIN CAPITAL LETTER C WITH CARON UTF-8: c4 8c UTF-16BE: 010c Decimal: Č Č (Č) Uppercase: U+010C Category: Lu (Letter, Uppercase) Bidi: L (Left-to-Right) Decomposition: 0043 030C Multibyte encodings are supported: $ unicode --fromcp big5 -x aff3 and multi-char strings are supported, too: $ unicode --fromcp utf-8 -x c599c3adc5a5 On format (--format='...'): Format string tells unicode which information should be displayed. There is one (and only one) escape character recognised, \n for a new line. You can use standard python .format() syntax. Following variables are recognized: {black} {red} {green} {yellow} {blue} {magenta} {cyan} {white} -- ANSI colours (foreground) {on_black} {on_red} ... -- ANSI colours (background) {no_colour} {default} {bold} {underline} {blink} {reverse} {concealed} -- self-explaining ANSI escape codes {ordc} -- unicode codepoint of the character (integer) {name} -- unicode name of the character {utf8} -- utf8 representation of the character (hexadecimal) {utf16be} -- utf16 representation of the character (hexadecimal) {decimal} -- decimal representation of the character {opt_additional} -- optional representation in additional charset (-c); empty string if not specified {pchar} -- the character itself {opt_flipcase} -- upper- or lowercase opposite of the character, in parentheses; empty if character is not cased {opt_uppercase}{opt_lowercase} -- optional string describing uppercase or lowercase variant of the character; empty if character is not cased {category} {category_desc} -- character category and its human readable description {opt_numeric}{numeric_desc} -- the string `Numeric value:' and the numeric value of the character; both empty if the character has no numeric value {opt_digit}{digit_desc} -- the string `Digit value:' and the digit value of the character; both empty if the character has no digit value {opt_bidi}{bidi}{bidi_desc} -- the string `Bidi:', the bidi property and a human readable description of the bidi property; empty if the character has no bidi category {mirrored_desc} -- the string 'Character is mirrored' if the character is mirrored, empty otherwise {opt_combining}{combining_desc} -- the string `Combining: ', combining class and a human readable description of the combining class; empty if the character is not combining {opt_decomp}{decomp_desc} -- the string `Decomposition: ' and a hexadecimal sequence of decomposition characters; empty if the character has no decomposition {opt_unicode_block}{opt_unicode_block_desc} -- the string `Unicode block:', range of the unicode block and description of said unicode block for the given character {opt_eaw}{eaw_desc} -- the string `East Asian width:' and the human readable value of East Asian width unicode-2.8/README-paracode000066400000000000000000000022711377312441100154520ustar00rootroot00000000000000Written by Radovan Garabík . For new versions, look at http://kassiopeia.juls.savba.sk/~garabik/software/unicode/ ------------------- paracode exploits the full power of the Unicode standard to convert the text into visually similar stream of glyphs, while using completely different codepoints. It is an excellent didactic tool demonstrating the principles and advanced use of the Unicode standard. paracode is a command line tool working as a filter, reading standard input in UTF-8 encoding and writing to standard output. Use optional -t switch to select what tables to use. Special name 'all' selects all the tables. Note that selecting 'other', 'cyrillic_plus' and 'cherokee' tables (and 'all') makes use of rather esoteric characters, and not all fonts contain them. Special table 'mirror' uses quite different character substitution, is not selected automatically with 'all' and does not work well with anything except plain ascii alphabetical characters. Example: paracode -t cyrillic+greek+cherokee paracode -t cherokee output paracode -r -t mirror output Possible tables are: cyrillic cyrillic_plus greek other cherokee all unicode-2.8/changelog000077700000000000000000000000001377312441100177552debian/changelogustar00rootroot00000000000000unicode-2.8/debian/000077500000000000000000000000001377312441100142365ustar00rootroot00000000000000unicode-2.8/debian/README.Debian000066400000000000000000000003621377312441100163000ustar00rootroot00000000000000unicode for Debian ------------------ packaged as native package, the source resides at http://kassiopeia.juls.savba.sk/~garabik/software/unicode.html -- Radovan Garabík , Fri, 7 Feb 2003 15:09:19 +0100 unicode-2.8/debian/changelog000066400000000000000000000216221377312441100161130ustar00rootroot00000000000000unicode (2.8-1) unstable; urgency=low * display ASCII table (either traditional or the EU–UK Trade and Cooperation Agreement version) * tidy up manpage (closes: #972047) (closes:#972063) * fix decoding paracode arguments (closes: #939196) -- Radovan Garabík Wed, 30 Dec 2020 17:13:32 +0100 unicode (2.7-1) unstable; urgency=low * add East Asian width * hack to consider regular expressions ending with '$' (closes: #830996) * do not flush stdout (closes: #902018) * better upper/lowercase from internal python db (closes: #848098) * convert to quilt -- Radovan Garabík Thu, 27 Dec 2018 18:17:29 +0100 unicode (2.6) unstable; urgency=low * fix crash when using Uxxxx (as opposed to U+xxxx) (closes: #836594) * improve message when there are too many characters to display (closes: #868490) * python3 by default (closes: #874672) -- Radovan Garabík Mon, 09 Apr 2018 16:41:29 +0200 unicode (2.5) unstable; urgency=low * do not display control characters in tabular output * Look for Blocks.txt in ~/.unicode -- Radovan Garabík Sun, 19 Mar 2017 09:52:06 +0100 unicode (2.4) unstable; urgency=medium * fix unihan properties listing * don't assume Blocks.txt is in ASCII, fixes crash using python2 and UnicodeData v9.0 (closes: #828020) -- Radovan Garabík Fri, 24 Jun 2016 16:23:05 +0200 unicode (2.3) unstable; urgency=low * convert to setuptools -- Radovan Garabík Thu, 02 Jun 2016 15:47:19 +0200 unicode (2.2) unstable; urgency=low * display unicode block (thanks to drastus for inspiration) -- Radovan Garabík Sun, 24 Apr 2016 10:45:41 +0200 unicode (2.1) unstable; urgency=low * add octal character code * fix crash when displaying numeric and digit properties from internal python database (closes: #810927) * fix default unicodedata path(s) -- Radovan Garabík Thu, 14 Jan 2016 17:24:58 +0100 unicode (2) unstable; urgency=low * new version, completely reworked formatting * full python3 support * drop old python2.5 compatibility * implement --brief (closes: #708318) * implement --format (closes: #605642) * update Debian packaging -- Radovan Garabík Wed, 21 Oct 2015 20:50:31 +0200 unicode (1) unstable; urgency=low * added --wt to query wiktionary * fix (somewhat) tabular display of fullwidth characters; try unicode 4000..5000 * this is the last version that tries to keep rigorously compatibility with older python versions (going even back to pre-2.3) -- Radovan Garabík Sun, 22 Mar 2015 09:15:59 +0100 unicode (0.9.8) unstable; urgency=low * update bidi categories (closes: #759346) -- Radovan Garabík Thu, 28 Aug 2014 10:55:53 +0200 unicode (0.9.7) unstable; urgency=low * add option to recognise binary input numerical codes * do not suggest console-data * change Suggest to Recommend for unicode-data (closes: #683852), both this and above suggested by Tollef Fog Heen * do not throw an exception when run under an undefined locale * on error, exit with nonzero existatus * preliminary python3 support * mention -s and -r in the README (closes: #664277) * other minor tweaks and improvements -- Radovan Garabík Sat, 24 Nov 2012 11:18:06 +0200 unicode (0.9.6) unstable; urgency=low * add option to recognise octal input numerical codes * add option to convert input numerical codes from an arbitrary charset * don't suggest perl-modules anymore (closes: #651479), thanks to mike castleman * clarify searching for hexadecimal codepoints in the manpage (closes: #643284) * better error messages if the codepoint exceeds sys.maxunicode -- Radovan Garabík Sun, 29 Jul 2012 13:46:18 +0200 unicode (0.9.5) unstable; urgency=low * do not raise an exception on empty string argument (closes: #601503), thanks to Etienne Millon for reporting the bug -- Radovan Garabík Sun, 21 Nov 2010 14:50:29 +0100 unicode (0.9.4) unstable; urgency=low * recognise split unihan files (closes: #551789) -- Radovan Garabík Sun, 07 Feb 2010 18:36:29 +0100 unicode (0.9.3) unstable; urgency=low * run pylint & pychecker – fix some previously unnoticed bugs -- Radovan Garabík Mon, 04 May 2009 22:40:51 +0200 unicode (0.9.2) unstable; urgency=low * giving "latin alpha" as an argument will now search for all the character names containing the "latin.*alpha" regular expression, not _either_ "latin" or "alpha" strings (closes: #439146), idea from martin f. krafft. * added forgotten README-paracode to the docfiles -- Radovan Garabík Thu, 30 Oct 2008 18:58:48 +0100 unicode (0.9.1) unstable; urgency=low * add package URL to debian/copyright and debian/README.Debian (closes: #495555) -- Radovan Garabík Sat, 23 Aug 2008 10:28:02 +0200 unicode (0.9) unstable; urgency=low * include paracode utility * clarify GPL version (v3) -- Radovan Garabík Wed, 19 Sep 2007 19:01:55 +0100 unicode (0.8) unstable; urgency=low * fix traceback when letter has no uppercase or lowercase forms -- Radovan Garabík Sun, 1 Oct 2006 21:42:33 +0200 unicode (0.7) unstable; urgency=low * updated to use unicode-data (closes: #386853) * data files can be bzip2'ed now * use data from unicode data files, not from python unicodedata module (the latter tends to be obsolete) -- Radovan Garabík Sat, 16 Sep 2006 21:44:34 +0200 unicode (0.6) unstable; urgency=low * fix stupid undeclared options bug (thanks to Tim Hatch) * remove absolute path from z?grep, rely on OS's default PATH to execute the command(s) * add default path to UnicodeData.txt for MacOSX systems -- Radovan Garabík Wed, 4 Jan 2006 19:57:54 +0100 unicode (0.5) unstable; urgency=low * work around browser invocations that cannot handle UTF-8 in URL's -- Radovan Garabík Sun, 1 Jan 2006 00:59:60 +0100 unicode (0.4.9) unstable; urgency=low * better directional overriding for RTL characters * query wikipedia with -w switch * better heuristics guessing argument type -- Radovan Garabík Sun, 11 Sep 2005 18:30:59 +0200 unicode (0.4.8) unstable; urgency=low * catch an exception if locale.nl_langinfo is not present (thanks to Michael Weir) * default to no colour if the system in MS Windows * put back accidentally disabled left-to-right mark - as a result, tabular display of arabic, hebrew and other RTL scripts is much better (the bug manifested itself only on powerful i18n terminals, such as mlterm) -- Radovan Garabík Fri, 26 Aug 2005 14:25:58 +0200 unicode (0.4.7) unstable; urgency=low * some UniHan support (closes: #187214) * --color as a synonum for --colour added (closes: #273503) -- Radovan Garabík Thu, 4 Aug 2005 16:36:07 +0200 unicode (0.4.6) unstable; urgency=low * change charset guessing (closes: #241889), thanks to Євгeнiй Meщepяĸoв (Eugeniy Meshcheryakov) for the patch * closes: #229857 - it has been closed together with 215267 -- Radovan Garabík Tue, 20 Apr 2004 15:39:34 +0200 unicode (0.4.5) unstable; urgency=low * catch exception if input sequence is invalid in given encoding (closes: #188438) * automatically find and symlink UnicodeData.txt from perl, if installed (thanks to LarstiQ for the patch) (closes: #215267) * change architecture to 'all' (closes: #215264) -- Radovan Garabík Wed, 21 Jan 2004 10:30:38 +0100 unicode (0.4) unstable; urgency=low * added option to choose colour output (closes: #187215) -- Radovan Garabík Wed, 9 Apr 2003 16:37:39 +0200 unicode (0.3.1) unstable; urgency=low * added python to Build-depends (closes: #183662) * properly quote hyphens in manpage (closes: #186151) * do not use UTF-8 in manpage (closes: #186193) * added versioned dependency for python2.3 (closes: #186444) -- Radovan Garabík Mon, 24 Mar 2003 14:39:31 +0100 unicode (0.3) unstable; urgency=low * Initial Release. -- Radovan Garabík Fri, 7 Feb 2003 15:09:19 +0100 unicode-2.8/debian/compat000066400000000000000000000000021377312441100154340ustar00rootroot000000000000009 unicode-2.8/debian/control000066400000000000000000000007761377312441100156530ustar00rootroot00000000000000Source: unicode Section: utils Priority: optional Maintainer: Radovan Garabík Build-Depends: debhelper (>= 4), dh-python, python3 Standards-Version: 4.3.0 Package: unicode Architecture: all Depends: ${misc:Depends}, ${python3:Depends} Suggests: bzip2 Recommends: unicode-data Description: display unicode character properties unicode is a simple command line utility that displays properties for a given unicode character, or searches unicode database for a given name. unicode-2.8/debian/copyright000066400000000000000000000006331377312441100161730ustar00rootroot00000000000000This program was written by Radovan Garabík on Fri, 7 Feb 2003 15:09:19 +0100, and packaged for Debian as a native package. The sources and package can be downloaded from: http://kassiopeia.juls.savba.sk/~garabik/software/unicode/ Copyright: © 2003-2016 Radovan Garabík released under GPL v3, see /usr/share/common-licenses/GPL unicode-2.8/debian/dirs000066400000000000000000000000101377312441100151110ustar00rootroot00000000000000usr/bin unicode-2.8/debian/docs000066400000000000000000000000301377312441100151020ustar00rootroot00000000000000README README-paracode unicode-2.8/debian/rules000077500000000000000000000033451377312441100153230ustar00rootroot00000000000000#!/usr/bin/make -f # Sample debian/rules that uses debhelper. # GNU copyright 1997 to 1999 by Joey Hess. # Uncomment this to turn on verbose mode. #export DH_VERBOSE=1 # This is the debhelper compatibility version to use. #export DH_COMPAT=4 configure: configure-stamp configure-stamp: dh_testdir # Add here commands to configure the package. touch configure-stamp build: build-arch build-indep build-arch: build-stamp build-indep: build-stamp build-stamp: configure-stamp dh_testdir # Add here commands to compile the package. #$(MAKE) #/usr/bin/docbook-to-man debian/unicode.sgml > unicode.1 touch build-stamp clean: dh_testdir dh_testroot rm -f build-stamp configure-stamp # Add here commands to clean up after the build process. #-$(MAKE) clean dh_clean install: build dh_testdir dh_testroot dh_prep dh_installdirs # Add here commands to install the package into debian/unicode. #$(MAKE) install DESTDIR=$(CURDIR)/debian/unicode cp unicode paracode $(CURDIR)/debian/unicode/usr/bin # Build architecture-dependent files here. #binary-arch: build install # We have nothing to do by default. # Build architecture-independent files here. binary-indep: build install dh_testdir dh_testroot # dh_installdebconf dh_installdocs # dh_installexamples # dh_installmenu # dh_installlogrotate # dh_installemacsen # dh_installpam # dh_installmime # dh_installinit # dh_installcron dh_installman unicode.1 paracode.1 # dh_installinfo # dh_undocumented dh_installchangelogs # dh_link dh_strip dh_compress dh_fixperms # dh_makeshlibs dh_installdeb # dh_perl # dh_shlibdeps dh_python3 dh_gencontrol dh_md5sums dh_builddeb binary: binary-indep binary-arch .PHONY: build clean binary-indep binary-arch binary install configure unicode-2.8/debian/source/000077500000000000000000000000001377312441100155365ustar00rootroot00000000000000unicode-2.8/debian/source/format000066400000000000000000000000151377312441100167450ustar00rootroot000000000000003.0 (quilt) unicode-2.8/paracode000077500000000000000000000157171377312441100145330ustar00rootroot00000000000000#!/usr/bin/python3 import sys, unicodedata from optparse import OptionParser # for python2 compatibility, decode from utf-8 if sys.version_info[0] < 3: decode = unicode encode = lambda x, enc: x.encode(enc) else: # for python3, the input is already unicode string decode = lambda x, enc: x encode = lambda x, enc: x table_cyrillic = { 'A' : u'\N{CYRILLIC CAPITAL LETTER A}', 'B' : u'\N{CYRILLIC CAPITAL LETTER VE}', 'C' : u'\N{CYRILLIC CAPITAL LETTER ES}', 'E' : u'\N{CYRILLIC CAPITAL LETTER IE}', 'H' : u'\N{CYRILLIC CAPITAL LETTER EN}', 'I' : u'\N{CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I}', 'J' : u'\N{CYRILLIC CAPITAL LETTER JE}', 'K' : u'\N{CYRILLIC CAPITAL LETTER KA}', 'M' : u'\N{CYRILLIC CAPITAL LETTER EM}', 'O' : u'\N{CYRILLIC CAPITAL LETTER O}', 'P' : u'\N{CYRILLIC CAPITAL LETTER ER}', 'S' : u'\N{CYRILLIC CAPITAL LETTER DZE}', 'T' : u'\N{CYRILLIC CAPITAL LETTER TE}', 'X' : u'\N{CYRILLIC CAPITAL LETTER HA}', 'Y' : u'\N{CYRILLIC CAPITAL LETTER U}', 'a' : u'\N{CYRILLIC SMALL LETTER A}', 'c' : u'\N{CYRILLIC SMALL LETTER ES}', 'e' : u'\N{CYRILLIC SMALL LETTER IE}', 'i' : u'\N{CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I}', 'j' : u'\N{CYRILLIC SMALL LETTER JE}', 'o' : u'\N{CYRILLIC SMALL LETTER O}', 'p' : u'\N{CYRILLIC SMALL LETTER ER}', 's' : u'\N{CYRILLIC SMALL LETTER DZE}', 'x' : u'\N{CYRILLIC SMALL LETTER HA}', 'y' : u'\N{CYRILLIC SMALL LETTER U}', } table_cyrillic_plus = { 'Y' : u'\N{CYRILLIC CAPITAL LETTER STRAIGHT U}', 'h' : u'\N{CYRILLIC SMALL LETTER SHHA}', } table_greek = { 'A' : u'\N{GREEK CAPITAL LETTER ALPHA}', 'B' : u'\N{GREEK CAPITAL LETTER BETA}', 'E' : u'\N{GREEK CAPITAL LETTER EPSILON}', 'H' : u'\N{GREEK CAPITAL LETTER ETA}', 'I' : u'\N{GREEK CAPITAL LETTER IOTA}', 'K' : u'\N{GREEK CAPITAL LETTER KAPPA}', 'M' : u'\N{GREEK CAPITAL LETTER MU}', 'N' : u'\N{GREEK CAPITAL LETTER NU}', 'O' : u'\N{GREEK CAPITAL LETTER OMICRON}', 'P' : u'\N{GREEK CAPITAL LETTER RHO}', 'T' : u'\N{GREEK CAPITAL LETTER TAU}', 'X' : u'\N{GREEK CAPITAL LETTER CHI}', 'Y' : u'\N{GREEK CAPITAL LETTER UPSILON}', 'Z' : u'\N{GREEK CAPITAL LETTER ZETA}', 'o' : u'\N{GREEK SMALL LETTER OMICRON}', } table_other = { '!' : u'\N{LATIN LETTER RETROFLEX CLICK}', 'O' : u'\N{ARMENIAN CAPITAL LETTER OH}', 'S' : u'\N{ARMENIAN CAPITAL LETTER TIWN}', 'o' : u'\N{ARMENIAN SMALL LETTER OH}', 'n' : u'\N{ARMENIAN SMALL LETTER VO}', } table_cherokee = { 'A' : u'\N{CHEROKEE LETTER GO}', 'B' : u'\N{CHEROKEE LETTER YV}', 'C' : u'\N{CHEROKEE LETTER TLI}', 'D' : u'\N{CHEROKEE LETTER A}', 'E' : u'\N{CHEROKEE LETTER GV}', 'G' : u'\N{CHEROKEE LETTER NAH}', 'H' : u'\N{CHEROKEE LETTER MI}', 'J' : u'\N{CHEROKEE LETTER GU}', 'K' : u'\N{CHEROKEE LETTER TSO}', 'L' : u'\N{CHEROKEE LETTER TLE}', 'M' : u'\N{CHEROKEE LETTER LU}', 'P' : u'\N{CHEROKEE LETTER TLV}', 'R' : u'\N{CHEROKEE LETTER SV}', 'S' : u'\N{CHEROKEE LETTER DU}', 'T' : u'\N{CHEROKEE LETTER I}', 'V' : u'\N{CHEROKEE LETTER DO}', 'W' : u'\N{CHEROKEE LETTER LA}', 'Y' : u'\N{CHEROKEE LETTER GI}', 'Z' : u'\N{CHEROKEE LETTER NO}', } table_mirror = { 'A' : u'\N{FOR ALL}', 'B' : u'\N{CANADIAN SYLLABICS CARRIER KHA}', 'C' : u'\N{LATIN CAPITAL LETTER OPEN O}', 'D' : u'\N{CANADIAN SYLLABICS CARRIER PA}', 'E' : u'\N{LATIN CAPITAL LETTER REVERSED E}', 'F' : u'\N{TURNED CAPITAL F}', 'G' : u'\N{TURNED SANS-SERIF CAPITAL G}', 'H' : u'H', 'I' : u'I', 'J' : u'\N{LATIN SMALL LETTER LONG S}', 'K' : u'\N{LATIN SMALL LETTER TURNED K}', # fixme 'L' : u'\N{TURNED SANS-SERIF CAPITAL L}', 'M' : u'W', 'N' : u'N', 'O' : u'O', 'P' : u'\N{CYRILLIC CAPITAL LETTER KOMI DE}', 'R' : u'\N{CANADIAN SYLLABICS TLHO}', 'S' : u'S', 'T' : u'\N{UP TACK}', 'U' : u'\N{ARMENIAN CAPITAL LETTER VO}', 'V' : u'\N{N-ARY LOGICAL AND}', 'W' : u'M', 'X' : u'X', 'Y' : u'\N{TURNED SANS-SERIF CAPITAL Y}', 'Z' : u'Z', 'a' : u'\N{LATIN SMALL LETTER TURNED A}', 'b' : u'q', 'c' : u'\N{LATIN SMALL LETTER OPEN O}', 'd' : u'p', 'e' : u'\N{LATIN SMALL LETTER SCHWA}', 'f' : u'\N{LATIN SMALL LETTER DOTLESS J WITH STROKE}', 'g' : u'\N{LATIN SMALL LETTER B WITH HOOK}', 'h' : u'\N{LATIN SMALL LETTER TURNED H}', 'i' : u'\N{LATIN SMALL LETTER DOTLESS I}' + u'\N{COMBINING DOT BELOW}', 'j' : u'\N{LATIN SMALL LETTER LONG S}' + u'\N{COMBINING DOT BELOW}', 'k' : u'\N{LATIN SMALL LETTER TURNED K}', 'l' : u'l', 'm' : u'\N{LATIN SMALL LETTER TURNED M}', 'n' : u'u', 'o' : u'o', 'p' : u'd', 'q' : u'b', 'r' : u'\N{LATIN SMALL LETTER TURNED R}', 's' : u's', 't' : u'\N{LATIN SMALL LETTER TURNED T}', 'u' : u'n', 'v' : u'\N{LATIN SMALL LETTER TURNED V}', 'w' : u'\N{LATIN SMALL LETTER TURNED W}', 'x' : u'x', 'y' : u'\N{LATIN SMALL LETTER TURNED Y}', 'z' : u'z', '0' : '0', '1' : u'I', '2' : u'\N{INVERTED QUESTION MARK}\N{COMBINING MACRON}', '3' : u'\N{LATIN CAPITAL LETTER OPEN E}', '4' : u'\N{LATIN SMALL LETTER LZ DIGRAPH}', '6' : '9', '7' : u'\N{LATIN CAPITAL LETTER L WITH STROKE}', '8' : '8', '9' : '6', ',' : "'", "'" : ',', '.' : u'\N{DOT ABOVE}', '?' : u'\N{INVERTED QUESTION MARK}', '!' : u'\N{INVERTED EXCLAMATION MARK}', } tables_names = ['cyrillic', 'cyrillic_plus', 'greek', 'other', 'cherokee'] table_default = table_cyrillic table_default.update(table_greek) table_all={} for t in tables_names: table_all.update(globals()['table_'+t]) def main(): parser = OptionParser(usage="usage: %prog [options]") parser.add_option("-t", "--tables", action="store", default='default', dest="tables", type="string", help="""list of tables to use, separated by a plus sign. Possible tables are: """+'+'.join(tables_names)+""" and a special name 'all' to specify all these tables joined together. There is another table, 'mirror', that is not selected in 'all'.""") parser.add_option("-r", "--reverse", action="count", dest="reverse", default=0, help="Reverse the text after conversion. Best used with the 'mirror' table.") (options, args) = parser.parse_args() if args: to_convert = decode(' '.join(args), 'utf-8') else: to_convert = None tables = options.tables.split('+') tables = ['table_'+x for x in tables] tables = [globals()[x] for x in tables] table = {} for t in tables: table.update(t) def reverse_string(s): l = list(s) l.reverse() r = ''.join(l) return r def do_convert(s, reverse=0): if reverse: s = reverse_string(s) l = unicodedata.normalize('NFKD', s) out = [] for c in l: out.append(table.get(c, c)) out = ''.join(out) out = unicodedata.normalize('NFKC', out) return out if not to_convert: if options.reverse: lines = sys.stdin.readlines() lines.reverse() else: lines = sys.stdin for line in lines: l = decode(line, 'utf-8') out = do_convert(l, options.reverse) sys.stdout.write(encode(out, 'utf-8')) else: out = do_convert(to_convert, options.reverse) sys.stdout.write(encode(out, 'utf-8')) sys.stdout.write('\n') if __name__ == '__main__': main() unicode-2.8/paracode.1000066400000000000000000000030771377312441100146630ustar00rootroot00000000000000.\" Hey, EMACS: -*- nroff -*- .TH PARACODE 1 "2005-04-16" .SH NAME paracode \- command line Unicode conversion tool .SH SYNOPSIS .B paracode .RB [ \-t .IR tables ] string .SH DESCRIPTION This manual page documents the .B paracode command. .PP \fBparacode\fP exploits the full power of the Unicode standard to convert the text into visually similar stream of glyphs, while using completely different codepoints. It is an excellent didactic tool demonstrating the principles and advanced use of the Unicode standard. .PP \fBparacode\fP is a command line tool working as a filter, reading standard input in UTF-8 encoding and writing to standard output. . .SH OPTIONS .TP .BI \-t tables .BI \-\-tables tables Use given list of conversion tables, separated by a plus sign. Special name 'all' selects all the tables. Note that selecting 'other', 'cyrillic_plus' and 'cherokee' tables (and 'all') makes use of rather esoteric characters, and not all fonts contain them. Special table 'mirror' uses quite different character substitution, is not selected automatically with 'all' and does not work well with anything except plain ascii alphabetical characters. Example: paracode \-t cyrillic+greek+cherokee paracode \-t cherokee output paracode \-r \-t mirror output Possible tables are: cyrillic cyrillic_plus greek other cherokee all . .TP .B \-r Display text in reverse order after conversion, best used together with \-t mirror. . .SH SEE ALSO .BR iconv (1) . .SH AUTHOR Radovan Garab\('ik unicode-2.8/setup.cfg000066400000000000000000000000351377312441100146330ustar00rootroot00000000000000[bdist_wheel] universal = 1 unicode-2.8/setup.py000066400000000000000000000026021377312441100145260ustar00rootroot00000000000000import io import os from setuptools import setup os.chdir(os.path.abspath(os.path.dirname(__file__))) setup(name='unicode', version='2.8', scripts=['unicode', 'paracode'], # entry_points={'console_scripts': [ # 'unicode = unicode:main', # 'paracode = paracode:main']}, description="Display unicode character properties", long_description=""" Display unicode character properties: Enter regular expression, hexadecimal number or some characters as an argument. unicode will try to guess what you want to look up. Use four-digit hexadecimal number followed by two dots to display given unicode block in a nice tabular format. """, author="Radovan Garabik", author_email='radovan.garabik@kassiopeia.juls.savba.sk', url='http://kassiopeia.juls.savba.sk/~garabik/software/unicode.html', license='GNU GPL v3', keywords=['unicode', 'character properties', 'encoding'], classifiers=[ 'Development Status :: 5 - Production/Stable', 'Environment :: Console', 'Intended Audience :: Developers', 'License :: OSI Approved :: GNU General Public License v3 (GPLv3)', 'Programming Language :: Python', 'Programming Language :: Python :: 2', 'Programming Language :: Python :: 3', 'Topic :: Text Editors :: Text Processing', 'Topic :: Utilities']) unicode-2.8/unicode000077500000000000000000001037231377312441100143760ustar00rootroot00000000000000#!/usr/bin/python3 from __future__ import unicode_literals import os, glob, sys, unicodedata, locale, gzip, re, traceback, encodings, io, codecs import webbrowser, textwrap, struct #from pprint import pprint # bz2 was introduced in 2.3, but we want this to work even if for some # reason it is not available try: import bz2 except ImportError: bz2 = None try: import lzma except ImportError: lzma = None def is_ascii(s): "test is string s consists completely of ascii characters" try: s.encode('ascii') except UnicodeEncodeError: return False return True PY3 = sys.version_info[0] >= 3 if PY3: import subprocess as cmd from urllib.parse import quote as urlquote import io def out(*args): "pring args, converting them to output charset" for i in args: #sys.stdout.flush() sys.stdout.buffer.write(i.encode(options.iocharset, 'replace')) # ord23 is used to convert elements of byte array in python3, which are already integers ord23 = lambda x: x chr_orig = chr else: # python2 # getoutput() and getstatusoutput() methods have # been moved from commands to the subprocess module # with Python >= 3.x import commands as cmd from urllib import quote as urlquote def out(*args): "pring args, converting them to output charset" for i in args: sys.stdout.write(i.encode(options.iocharset, 'replace')) ord23 = ord # python3-like chr chr_orig = chr chr = unichr str = unicode range = xrange from optparse import OptionParser VERSION='2.8' # list of terminals that support bidi biditerms = ['mlterm'] try: locale.setlocale(locale.LC_ALL, '') except locale.Error: pass # guess terminal charset try: iocharsetguess = locale.nl_langinfo(locale.CODESET) or "ascii" except locale.Error: iocharsetguess = "ascii" if os.environ.get('TERM') in biditerms and iocharsetguess.lower().startswith('utf'): LTR = chr(0x202d) # left to right override else: LTR = '' colours = { 'no_colour' : "", 'default' : "\033[0m", 'bold' : "\033[1m", 'underline' : "\033[4m", 'blink' : "\033[5m", 'reverse' : "\033[7m", 'concealed' : "\033[8m", 'black' : "\033[30m", 'red' : "\033[31m", 'green' : "\033[32m", 'yellow' : "\033[33m", 'blue' : "\033[34m", 'magenta' : "\033[35m", 'cyan' : "\033[36m", 'white' : "\033[37m", 'on_black' : "\033[40m", 'on_red' : "\033[41m", 'on_green' : "\033[42m", 'on_yellow' : "\033[43m", 'on_blue' : "\033[44m", 'on_magenta' : "\033[45m", 'on_cyan' : "\033[46m", 'on_white' : "\033[47m", 'beep' : "\007", } general_category = { 'Lu': 'Letter, Uppercase', 'Ll': 'Letter, Lowercase', 'Lt': 'Letter, Titlecase', 'Lm': 'Letter, Modifier', 'Lo': 'Letter, Other', 'Mn': 'Mark, Non-Spacing', 'Mc': 'Mark, Spacing Combining', 'Me': 'Mark, Enclosing', 'Nd': 'Number, Decimal Digit', 'Nl': 'Number, Letter', 'No': 'Number, Other', 'Pc': 'Punctuation, Connector', 'Pd': 'Punctuation, Dash', 'Ps': 'Punctuation, Open', 'Pe': 'Punctuation, Close', 'Pi': 'Punctuation, Initial quote', 'Pf': 'Punctuation, Final quote', 'Po': 'Punctuation, Other', 'Sm': 'Symbol, Math', 'Sc': 'Symbol, Currency', 'Sk': 'Symbol, Modifier', 'So': 'Symbol, Other', 'Zs': 'Separator, Space', 'Zl': 'Separator, Line', 'Zp': 'Separator, Paragraph', 'Cc': 'Other, Control', 'Cf': 'Other, Format', 'Cs': 'Other, Surrogate', 'Co': 'Other, Private Use', 'Cn': 'Other, Not Assigned', } bidi_category = { 'L' : 'Left-to-Right', 'LRE' : 'Left-to-Right Embedding', 'LRO' : 'Left-to-Right Override', 'R' : 'Right-to-Left', 'AL' : 'Right-to-Left Arabic', 'RLE' : 'Right-to-Left Embedding', 'RLO' : 'Right-to-Left Override', 'PDF' : 'Pop Directional Format', 'EN' : 'European Number', 'ES' : 'European Number Separator', 'ET' : 'European Number Terminator', 'AN' : 'Arabic Number', 'CS' : 'Common Number Separator', 'NSM' : 'Non-Spacing Mark', 'BN' : 'Boundary Neutral', 'B' : 'Paragraph Separator', 'S' : 'Segment Separator', 'WS' : 'Whitespace', 'ON' : 'Other Neutrals', 'LRI' : 'Left-to-Right Isolate', 'RLI' : 'Right-to-Left Isolate', 'FSI' : 'First Strong Isolate', 'PDI' : 'Pop Directional Isolate', } comb_classes = { 0: 'Spacing, split, enclosing, reordrant, and Tibetan subjoined', 1: 'Overlays and interior', 7: 'Nuktas', 8: 'Hiragana/Katakana voicing marks', 9: 'Viramas', 10: 'Start of fixed position classes', 199: 'End of fixed position classes', 200: 'Below left attached', 202: 'Below attached', 204: 'Below right attached', 208: 'Left attached (reordrant around single base character)', 210: 'Right attached', 212: 'Above left attached', 214: 'Above attached', 216: 'Above right attached', 218: 'Below left', 220: 'Below', 222: 'Below right', 224: 'Left (reordrant around single base character)', 226: 'Right', 228: 'Above left', 230: 'Above', 232: 'Above right', 233: 'Double below', 234: 'Double above', 240: 'Below (iota subscript)', } eaw_description = { 'F': 'fullwidth', 'H': 'halfwidth', 'W': 'wide', 'Na':'narrow', 'A': 'ambiguous', 'N': 'neutral' } def get_unicode_blocks_descriptions(): "parses Blocks.txt" unicodeblocks = {} # (low, high): 'desc' f = None for name in UnicodeBlocksFiles: f = OpenGzip(name) if f: break if not f: return {} for line in f: if line.startswith('#') or ';' not in line or '..' not in line: continue ran, desc = line.split(';') desc = desc.strip() low, high = ran.split('..') low = int(low, 16) high = int(high, 16) unicodeblocks[ (low,high) ] = desc return unicodeblocks unicodeblocks = None def get_unicode_block(ch): "return start_of_block, end_of_block, block_name" global unicodeblocks if unicodeblocks is None: unicodeblocks = get_unicode_blocks_descriptions() ch = ord(ch) for low, high in unicodeblocks.keys(): if low<=ch<=high: return low, high, unicodeblocks[ (low,high) ] def get_unicode_properties(ch): properties = {} if ch in linecache: fields = linecache[ch].strip().split(';') proplist = ['codepoint', 'name', 'category', 'combining', 'bidi', 'decomposition', 'dummy', 'digit_value', 'numeric_value', 'mirrored', 'unicode1name', 'iso_comment', 'uppercase', 'lowercase', 'titlecase'] for i, prop in enumerate(proplist): if prop!='dummy': properties[prop] = fields[i] if properties['lowercase']: properties['lowercase'] = chr(int(properties['lowercase'], 16)) if properties['uppercase']: properties['uppercase'] = chr(int(properties['uppercase'], 16)) if properties['titlecase']: properties['titlecase'] = chr(int(properties['titlecase'], 16)) properties['combining'] = int(properties['combining']) properties['mirrored'] = properties['mirrored']=='Y' else: properties['codepoint'] = '%04X' % ord(ch) properties['name'] = unicodedata.name(ch, '') properties['category'] = unicodedata.category(ch) properties['combining'] = unicodedata.combining(ch) properties['bidi'] = unicodedata.bidirectional(ch) properties['decomposition'] = unicodedata.decomposition(ch) properties['digit_value'] = str(unicodedata.digit(ch, '')) properties['numeric_value'] = str(unicodedata.numeric(ch, '')) properties['mirrored'] = unicodedata.mirrored(ch) properties['unicode1name'] = '' properties['iso_comment'] = '' properties['lowercase'] = properties['uppercase'] = properties['titlecase'] = '' ch_up = ch.upper() ch_lo = ch.lower() ch_title = ch.title() if ch_up != ch: properties['uppercase'] = ch_up if ch_lo != ch: properties['lowercase'] = ch_lo if ch_title != ch_up: properties['titlecase'] = ch_title properties['east_asian_width'] = get_east_asian_width(ch) return properties def do_init(): HomeDir = os.path.expanduser('~/.unicode') HomeUnicodeData = os.path.join(HomeDir, "UnicodeData.txt") global UnicodeDataFileNames UnicodeDataFileNames = [HomeUnicodeData, '/usr/share/unicode/UnicodeData.txt', '/usr/share/unicode-data/UnicodeData.txt', '/usr/share/unidata/UnicodeData.txt', '/usr/share/unicode/ucd/UnicodeData.txt', './UnicodeData.txt'] + \ glob.glob('/usr/share/unidata/UnicodeData*.txt') + \ glob.glob('/usr/share/perl/*/unicore/UnicodeData.txt') + \ glob.glob('/System/Library/Perl/*/unicore/UnicodeData.txt') # for MacOSX HomeUnihanData = os.path.join(HomeDir, "Unihan*") global UnihanDataGlobs UnihanDataGlobs = [HomeUnihanData, '/usr/share/unidata/Unihan*', '/usr/share/unicode-data/Unihan*', '/usr/share/unicode/Unihan*', './Unihan*'] HomeUnicodeBlocks = os.path.join(HomeDir, "Blocks.txt") global UnicodeBlocksFiles UnicodeBlocksFiles = [HomeUnicodeBlocks, '/usr/share/unicode/Blocks.txt', '/usr/share/unicode-data/Blocks.txt', '/usr/share/unidata/Blocks.txt', './Blocks.txt'] # cache where grepped unicode properties are kept global linecache linecache = {} def get_unihan_files(): fos = [] # list of file names for Unihan data file(s) for gl in UnihanDataGlobs: fnames = glob.glob(gl) fos += fnames return fos def get_unihan_properties_internal(ch): properties = {} ch = ord(ch) global unihan_fs for f in unihan_fs: fo = OpenGzip(f) for l in fo: if l.startswith('#'): continue line = l.strip() if not line: continue char, key, value = line.strip().split('\t') if int(char[2:], 16) == ch: properties[key] = value.decode('utf-8') elif int(char[2:], 16)>ch: break return properties def get_unihan_properties_zgrep(ch): properties = {} global unihan_fs ch = ord(ch) chs = 'U+%X' % ch for f in unihan_fs: if f.endswith('.gz'): grepcmd = 'zgrep' elif f.endswith('.bz2'): grepcmd = 'bzgrep' elif f.endswith('.xz'): grepcmd = 'xzgrep' else: grepcmd = 'grep' cmdline = grepcmd+' ^'+chs+r'\\b '+f status, output = cmd.getstatusoutput(cmdline) if not PY3: output = unicode(output, 'utf-8') output = output.split('\n') for l in output: if not l: continue char, key, value = l.strip().split('\t') if int(char[2:], 16) == ch: properties[key] = value elif int(char[2:], 16)>ch: break return properties # basic sanity check, if e.g. you run this on MS Windows... if os.path.exists('/bin/grep'): get_unihan_properties = get_unihan_properties_zgrep else: get_unihan_properties = get_unihan_properties_internal def error(txt): out(txt) out('\n') sys.exit(1) def get_gzip_filename(fname): "return fname, if it does not exist, return fname+.gz, if neither that, fname+.bz2, if neither that, fname+.xz, if neither that, return None" if os.path.exists(fname): return fname if os.path.exists(fname+'.gz'): return fname+'.gz' if os.path.exists(fname+'.bz2') and bz2 is not None: return fname+'.bz2' if os.path.exists(fname+'.xz') and lzma is not None: return fname+'.xz' return None def OpenGzip(fname): "open fname, try fname.gz or fname.bz2 or fname.xz if fname does not exist, return file object or GzipFile or BZ2File object" fname = get_gzip_filename(fname) fo = None if not fname: return None if fname.endswith('.gz'): fo = gzip.GzipFile(fname) elif fname.endswith('.bz2'): fo = bz2.BZ2File(fname) elif fname.endswith('.xz'): fo = lzma.open(fname) else: fo = io.open(fname, encoding='utf-8') return fo if fo: # we cannot use TextIOWrapper, since it needs read1 method not implemented by gzip|bz2 fo = codecs.getreader('utf-8')(fo) return fo def GrepInNames(pattern, prefill_cache=False): f = None for name in UnicodeDataFileNames: f = OpenGzip(name) if f != None: break if f: if pattern.endswith('$'): pattern = pattern[:-1]+';' pat = re.compile(pattern, re.I) if not f: out( """ Cannot find UnicodeData.txt, please place it into /usr/share/unidata/UnicodeData.txt, /usr/share/unicode/UnicodeData.txt, ~/.unicode/ or current working directory (optionally you can gzip it). Without the file, searching will be much slower. """ ) if prefill_cache: if f: for l in f: if pat.search(l): r = myunichr(int(l.split(';')[0], 16)) linecache[r] = l f.close() else: if f: for l in f: if pat.search(l): r = myunichr(int(l.split(';')[0], 16)) linecache[r] = l yield r f.close() else: for i in range(sys.maxunicode): try: name = unicodedata.name(chr(i)) if pat.search(name): yield myunichr(i) except ValueError: pass def valfromcp(n, cp=None): "if cp is defined, then the 'n' is considered to be from that codepage and is converted accordingly" "the output is a list of codepoints (integers)" if cp: xh = '%x' %n if len(xh) % 2: # pad hexadecimal representation with a zero xh = '0'+xh cps = ( [xh[i:i+2] for i in range(0,len(xh),2)] ) cps = ( int(i, 16) for i in cps) # we have to use chr_orig (it's original chr for python2) and not 'B' # because unicode_literals it will be unicode, which # is not permitted in struct.pack in python2.6 cps = ( struct.pack(chr_orig(0x42),i) for i in cps ) # this works in both python3 and python2, unlike bytes([i]) cps = b''.join(cps) cps = cps.decode(cp) cps = [ord(x) for x in cps] return cps else: return [n] def myunichr(n): try: r = chr(n) return r except OverflowError: traceback.print_exc() error("The codepoint is too big - it does not fit into an int.") except ValueError: traceback.print_exc() err = "The codepoint is too big." if sys.maxunicode <= 0xffff: err += "\nPerhaps your python interpreter is not compiled with wide unicode characters." error(err) def guesstype(arg): if not arg: # empty string return 'empty string', arg elif not is_ascii(arg): return 'string', arg elif arg[:2]=='U+' or arg[:2]=='u+': # it is hexadecimal number try: val = int(arg[2:], 16) if val>sys.maxunicode: return 'regexp', arg else: return 'hexadecimal', arg[2:] except ValueError: return 'regexp', arg elif arg[0] in "Uu" and len(arg)>4: try: val = int(arg[1:], 16) if val>sys.maxunicode: return 'regexp', arg else: return 'hexadecimal', arg[1:] except ValueError: return 'regexp', arg elif len(arg)>=4: if len(arg) in (8, 16, 24, 32): if all(x in '01' for x in arg): val = int(arg, 2) if val<=sys.maxunicode: return 'binary', arg try: val = int(arg, 16) if val>sys.maxunicode: return 'regexp', arg else: return 'hexadecimal', arg except ValueError: return 'regexp', arg else: return 'string', arg def process(arglist, t, fromcp=None, prefill_cache=False): # build a list of values, so that we can combine queries like # LATIN ALPHA and search for LATIN.*ALPHA and not names that # contain either LATIN or ALPHA result = [] names_query = [] # reserved for queries in names - i.e. -r for arg_i in arglist: if t==None: tp, arg = guesstype(arg_i) if tp == 'regexp': # if the first argument is guessed to be a regexp, add # all the following arguments to the regular expression - # this is probably what you wanted, e.g. # 'unicode cyrillic be' will now search for the 'cyrillic.*be' regular expression t = 'regexp' else: tp, arg = t, arg_i if tp=='hexadecimal': val = int(arg, 16) vals = valfromcp(val, fromcp) for val in vals: r = myunichr(val) result.append(r) elif tp=='decimal': val = int(arg, 10) vals = valfromcp(val, fromcp) for val in vals: r = myunichr(val) result.append(r) elif tp=='octal': val = int(arg, 8) vals = valfromcp(val, fromcp) for val in vals: r = myunichr(val) result.append(r) elif tp=='binary': val = int(arg, 2) vals = valfromcp(val, fromcp) for val in vals: r = myunichr(val) result.append(r) elif tp=='regexp': names_query.append(arg) elif tp=='string': unirepr = arg for r in unirepr: result.append(r) elif tp=='empty string': pass # do not do anything for an empty string if result and prefill_cache: hx = '|'.join('%04X'%ord(x) for x in result) list(GrepInNames(hx, prefill_cache=True)) if names_query: query = '.*'.join(names_query) for r in GrepInNames(query): result.append(r) return result def maybe_colours(colour): if options.use_colour: return colours[colour] else: return "" # format key and value def printkv(*l): for i in range(0, len(l), 2): if i options.maxcount: sys.stdout.flush() sys.stderr.write("\nToo many characters to display, more than %s, use --max 0 (or other value) option to change it\n" % options.maxcount) return properties = get_unicode_properties(c) ordc = ord(c) if properties['name']: name = properties['name'] else: name = " - No such unicode character name in database" if 0xd800 <= ordc <= 0xdfff: # surrogate utf8 = utf16be = 'N/A' else: utf8 = ' '.join([("%02x" % ord23(x)) for x in c.encode('utf-8')]) utf16be = ''.join([("%02x" % ord23(x)) for x in c.encode('utf-16be')]) decimal = "&#%s;" % ordc octal = "\\0%o" % ordc addcharset = options.addcharset if addcharset: try: in_additional_charset = ' '.join([("%02x" % ord23(x)) for x in c.encode(addcharset)] ) except UnicodeError: in_additional_charset = "NONE" category = properties['category'] category_desc = general_category[category] if category == 'Cc': # control character pchar = '' else: if properties['combining']: pchar = " "+c else: pchar = c uppercase = properties['uppercase'] lowercase = properties['lowercase'] opt_uppercase = opt_lowercase = '' flipcase = None if uppercase: ord_uppercase = ord(properties['uppercase']) opt_uppercase = '\n{green}Uppercase:{default} {ord_uppercase:04X}'.format(**locals()) flipcase = uppercase elif lowercase: ord_lowercase = ord(properties['lowercase']) opt_lowercase = '\n{green}Lowercase:{default} {ord_lowercase:04X}'.format(**locals()) flipcase = lowercase opt_numeric = '' numeric_desc = '' if properties['numeric_value']: opt_numeric = 'Numeric value: ' numeric_desc = properties['numeric_value']+'\n' opt_digit = '' digit_desc = '' if properties['digit_value']: opt_digit = 'Digit value: ' digit_desc = properties['digit_value']+'\n' opt_bidi = '' bidi_desc = '' bidi = properties['bidi'] bidi_desc = bidi_category.get(bidi, bidi) if bidi: opt_bidi = 'Bidi: ' bidi_desc = ' ({0})\n'.format(bidi_desc) mirrored_desc = '' mirrored = properties['mirrored'] if mirrored: mirrored_desc = 'Character is mirrored\n' opt_combining = '' comb = properties['combining'] combining_desc = '' if comb: opt_combining = 'Combining: ' combining_desc = "{comb} ({comb_class})\n".format(comb=comb, comb_class=comb_classes.get(comb, '?')) opt_decomp = '' decomp_desc = '' decomp = properties['decomposition'] if decomp: opt_decomp = 'Decomposition: ' decomp_desc = decomp+'\n' if properties['east_asian_width']: opt_eaw = 'East Asian width: ' eaw = properties['east_asian_width'] eaw_desc = '{eaw} ({desc})'.format(eaw=eaw, desc=eaw_description.get(eaw, eaw)) opt_unicode_block = '' opt_unicode_block_desc = '' unicode_block = get_unicode_block(c) if unicode_block: low, high, desc = unicode_block opt_unicode_block = 'Unicode block: ' opt_unicode_block_desc = "{low:04X}..{high:04X}; {desc}\n".format(low=low,high=high,desc=desc) if addcharset: opt_additional = ' {green}{addcharset}:{default} {in_additional_charset}'.format(**locals()) else: opt_additional = '' if flipcase: opt_flipcase = ' ({flipcase})'.format(**locals()) else: opt_flipcase = '' formatted_output = format_string.format(**locals()) out(formatted_output) if options.verbosity>0: uhp = get_unihan_properties(c) for key in uhp: printkv(key, uhp[key]) def get_east_asian_width(c): eaw = 'east_asian_width' in unicodedata.__dict__ and unicodedata.east_asian_width(c) return eaw def print_block(block): #header out(" "*10) for i in range(16): out(".%X " % i) out('\n') #body for i in range(block*16, block*16+16): hexi = "%X" % i if len(hexi)>3: hexi = "%07X" % i hexi = hexi[:4]+" "+hexi[4:] else: hexi = " %03X" % i out(LTR+hexi+". ") for j in range(16): c = chr(i*16+j) if unicodedata.category(c) == 'Cc': c_out = ' ' else: c_out = c if unicodedata.combining(c): c_out = " "+c # fallback for python without east_asian_width (probably unnecessary, since this script does not work with <2.6 anyway) fullwidth = get_east_asian_width(c)[0] in 'FW' if not fullwidth: c_out = ' '+c_out out(c_out) out(' ') out('\n') out('\n') def print_blocks(blocks): for block in blocks: print_block(block) def is_range(s, typ): sp = s.split('..') if len(sp)!=2: return False if not sp[1]: sp[1] = sp[0] elif not sp[0]: sp[0] = sp[1] if not sp[0]: return False low = list(process([sp[0]], typ)) # intentionally no fromcp here, ranges are only of unicode characters high = list(process([sp[1]], typ)) if len(low)!=1 or len(high)!=1: return False low = ord(low[0]) high = ord(high[0]) low = low // 256 high = high // 256 + 1 return range(low, high) def unescape(s): return s.replace(r'\n', '\n') ascii_cc_names = ('NUL', 'SOH', 'STX', 'ETX', 'EOT', 'ENQ', 'ACK', 'BEL', 'BS', 'HT', 'LF', 'VT', 'FF', 'CR', 'SO', 'SI', 'DLE', 'DC1', 'DC2', 'DC3', 'DC4', 'NAK', 'SYN', 'ETB', 'CAN', 'EM', 'SUB', 'ESC', 'FS', 'GS', 'RS', 'US') def display_ascii_table(): print('Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex Dec Hex') for row in range(0, 16): for col in range(0, 8): cp = 16*col+row ch = chr(cp) if 32<=cp else ascii_cc_names[cp] ch = 'DEL' if cp==127 else ch frm = '{:3d} {:02X} {:2s}' if cp < 32: frm = '{:3d} {:02X} {:4s}' elif cp >= 96: frm = '{:4d} {:02X} {:2s}' cell = frm.format(cp, cp, ch) print(cell, end='') print() brexit_ascii_diffs = { 30: ' ', 31: ' ', 34: "'", 123: '{}{', 125: '}}', 127: ' ', 128: ' ', 129: ' ', } def display_brexit_ascii_table(): print(' + | 0 1 2 3 4 5 6 7 8 9') print('---+-----------------------------------------------') for row in range(30, 130, 10): print('{:3d}'.format(row), end='|') for col in range(0, 10): cp = col+row ch = brexit_ascii_diffs.get(cp, chr(cp)) cell = ' {:3s} '.format(ch) print(cell, end='') print() format_string_default = '''{yellow}{bold}U+{ordc:04X} {name}{default} {green}UTF-8:{default} {utf8} {green}UTF-16BE:{default} {utf16be} {green}Decimal:{default} {decimal} {green}Octal:{default} {octal}{opt_additional} {pchar}{opt_flipcase}{opt_uppercase}{opt_lowercase} {green}Category:{default} {category} ({category_desc}); {green}{opt_eaw}{default}{eaw_desc} {green}{opt_unicode_block}{default}{opt_unicode_block_desc}{green}{opt_numeric}{default}{numeric_desc}{green}{opt_digit}{default}{digit_desc}{green}{opt_bidi}{default}{bidi}{bidi_desc} {mirrored_desc}{green}{opt_combining}{default}{combining_desc}{green}{opt_decomp}{default}{decomp_desc} ''' def main(): parser = OptionParser(usage="usage: %prog [options] arg") parser.add_option("-x", "--hexadecimal", action="store_const", const='hexadecimal', dest="type", help="Assume arg to be hexadecimal number") parser.add_option("-o", "--octal", action="store_const", const='octal', dest="type", help="Assume arg to be octal number") parser.add_option("-b", "--binary", action="store_const", const='binary', dest="type", help="Assume arg to be binary number") parser.add_option("-d", "--decimal", action="store_const", const='decimal', dest="type", help="Assume arg to be decimal number") parser.add_option("-r", "--regexp", action="store_const", const='regexp', dest="type", help="Assume arg to be regular expression") parser.add_option("-s", "--string", action="store_const", const='string', dest="type", help="Assume arg to be a sequence of characters") parser.add_option("-a", "--auto", action="store_const", const=None, dest="type", help="Try to guess arg type (default)") parser.add_option("-m", "--max", action="store", default=10, dest="maxcount", type="int", help="Maximal number of codepoints to display, default: 10; 0=unlimited") parser.add_option("-i", "--io", action="store", default=iocharsetguess, dest="iocharset", type="string", help="I/O character set, I am guessing %s" % iocharsetguess) parser.add_option("--fcp", "--fromcp", action="store", default='', dest="fromcp", type="string", help="Convert numerical arguments from this encoding, default: no conversion") parser.add_option("-c", "--charset-add", action="store", dest="addcharset", type="string", help="Show hexadecimal reprezentation in this additional charset") parser.add_option("-C", "--colour", action="store", dest="use_colour", type="string", default="auto", help="Use colours, on, off or auto") parser.add_option('', "--color", action="store", dest="use_colour", type="string", default="auto", help="synonym for --colour") parser.add_option("-v", "--verbose", action="count", dest="verbosity", default=0, help="Increase verbosity (reads Unihan properties - slow!)") parser.add_option("-w", "--wikipedia", action="count", dest="query_wikipedia", default=0, help="Query wikipedia for the character") parser.add_option("--wt", "--wiktionary", action="count", dest="query_wiktionary", default=0, help="Query wiktionary for the character") parser.add_option("--list", action="store_const", dest="list_all_encodings", const=True, help="List (approximately) all known encodings") parser.add_option("--format", action="store", dest="format_string", type="string", default=format_string_default, help="formatting string") parser.add_option("--brief", "--terse", action="store_const", dest="format_string", const='{pchar} U+{ordc:04X} {name}\n', help="Brief format") parser.add_option("--ascii", action="store_const", dest="ascii_table", const=True, help="Display ASCII table") parser.add_option("--brexit-ascii", "--brexit", action="store_const", dest="brexit_ascii_table", const=True, help="Display ASCII table (EU–UK Trade and Cooperation Agreement version)") global options (options, arguments) = parser.parse_args() format_string = unescape(options.format_string) do_init() if options.list_all_encodings: all_encodings = os.listdir(os.path.dirname(encodings.__file__)) all_encodings = set([os.path.splitext(x)[0] for x in all_encodings]) all_encodings = list(all_encodings) all_encodings.sort() print (textwrap.fill(' '.join(all_encodings))) sys.exit() if options.ascii_table: display_ascii_table() sys.exit() if options.brexit_ascii_table: display_brexit_ascii_table() sys.exit() if len(arguments)==0: parser.print_help() sys.exit() if options.use_colour.lower() in ("on", "1", "true", "yes"): # we reuse the options.use_colour, so that we do not need to use another global options.use_colour = True elif options.use_colour.lower() in ("off", "0", "false", "no"): options.use_colour = False else: options.use_colour = sys.stdout.isatty() if sys.platform == 'win32': options.use_colour = False l_args = [] # list of non range arguments to process for argum in arguments: if PY3: # in python3, argv is automatically decoded into unicode # but we have to check for surrogates argum = argum.encode(options.iocharset, 'surrogateescape') try: argum = argum.decode(options.iocharset) except UnicodeDecodeError: error ("Sequence %s is not valid in charset '%s'." % (repr(argum), options.iocharset)) is_r = is_range(argum, options.type) if is_r: print_blocks(is_r) else: l_args.append(argum) if l_args: global unihan_fs unihan_fs = [] if options.verbosity>0: unihan_fs = get_unihan_files() # list of file names for Unihan data file(s), empty if not available if not unihan_fs: out( """ Unihan_*.txt files not found. In order to view Unihan properties, please place the files into /usr/share/unidata/, /usr/share/unicode/, ~/.unicode/ or current working directory (optionally you can gzip or bzip2 them). You can get the files by unpacking ftp://ftp.unicode.org/Public/UNIDATA/Unihan.zip Warning, listing UniHan Properties is rather slow. """) options.verbosity = 0 processed_args = process(l_args, options.type, options.fromcp, prefill_cache=True) print_characters(processed_args, options.maxcount, format_string, options.query_wikipedia, options.query_wiktionary) if __name__ == '__main__': main() unicode-2.8/unicode.1000066400000000000000000000101471377312441100145270ustar00rootroot00000000000000.\" Hey, EMACS: -*- nroff -*- .TH UNICODE 1 "2003-01-31" .SH NAME unicode \- command line unicode database query tool .SH SYNOPSIS .B unicode .RI [ options ] string .SH DESCRIPTION This manual page documents the .B unicode command. .PP \fBunicode\fP is a command line unicode database query tool. .SH OPTIONS .TP .B \-h .B \-\-help Show help and exit. .TP .B \-x .B \-\-hexadecimal Assume .I string to be a hexadecimal number .TP .B \-d .B \-\-decimal Assume .I string to be a decimal number .TP .B \-o .B \-\-octal Assume .I string to be an octal number .TP .B \-b .B \-\-binary Assume .I string to be a binary number .TP .B \-r .B \-\-regexp Assume .I string to be a regular expression .TP .B \-s .B \-\-string Assume .I string to be a sequence of characters .TP .B \-a .B \-\-auto Try to guess type of .I string from one of the above (default) .TP .BI \-m MAXCOUNT .BI \-\-max= MAXCOUNT Maximal number of codepoints to display, default: 20; use 0 for unlimited .TP .BI \-i CHARSET .BI \-\-io= IOCHARSET I/O character set. For maximal pleasure, run \fBunicode\fP on UTF-8 capable terminal and specify IOCHARSET to be UTF-8. \fBunicode\fP tries to guess this value from your locale, so with properly set up locale, you should not need to specify it. .TP .BI \-\-fcp= CHARSET .BI \-\-fromcp= CHARSET Convert numerical arguments from this encoding, default: no conversion. Multibyte encodings are supported. This is ignored for non-numerical arguments. .TP .BI \-c ADDCHARSET .BI \-\-charset\-add= ADDCHARSET Show hexadecimal reprezentation of displayed characters in this additional charset. .TP .BI \-C USE_COLOUR .BI \-\-colour= USE_COLOUR USE_COLOUR is one of .B on .B off .B auto .B \-\-colour=on will use ANSI colour codes to colourise the output .B \-\-colour=off won't use colours. .B \-\-colour=auto will test if standard output is a tty, and use colours only when it is. .B \-\-color is a synonym of .B \-\-colour .TP .B \-v .B \-\-verbose Be more verbose about displayed characters, e.g. display Unihan information, if available. .TP .B \-w .B \-\-wikipedia Spawn browser pointing to English Wikipedia entry about the character. .TP .B \-\-wt .B \-\-wiktionary Spawn browser pointing to English Wiktionary entry about the character. .TP .B \-\-brief Display character information in brief format .TP .BI \-\-format= fmt Use your own format for character information display. See the README for details. .TP .B \-\-list List (approximately) all known encodings. .TP .B \-\-ascii Display ASCII table .TP .B \-\-brexit\-ascii .B \-\-brexit Display ASCII table (EU–UK Trade and Cooperation Agreement 2020 version) .SH USAGE \fBunicode\fP tries to guess the type of an argument. In particular, if the arguments looks like a valid hexadecimal representation of a Unicode codepoint, it will be considered to be such. Using \fBunicode\fP face will display information about U+FACE CJK COMPATIBILITY IDEOGRAPH-FACE, and it will not search for 'face' in character descriptions \- for the latter, use: \fBunicode\fP \-r face For example, you can use any of the following to display information about U+00E1 LATIN SMALL LETTER A WITH ACUTE (\('a): \fBunicode\fP 00E1 \fBunicode\fP U+00E1 \fBunicode\fP \('a \fBunicode\fP 'latin small letter a with acute' You can specify a range of characters as argumets, \fBunicode\fP will show these characters in nice tabular format, aligned to 256-byte boundaries. Use two dots ".." to indicate the range, e.g. \fBunicode\fP 0450..0520 will display the whole cyrillic and hebrew blocks (characters from U+0400 to U+05FF) \fBunicode\fP 0400.. will display just characters from U+0400 up to U+04FF Use \-\-fromcp to query codepoints from other encodings: \fBunicode\fP \-\-fromcp cp1250 \-d 200 Multibyte encodings are supported: \fBunicode\fP \-\-fromcp big5 \-x aff3 and multi-char strings are supported, too: \fBunicode\fP \-\-fromcp utf-8 \-x c599c3adc5a5 .SH BUGS Tabular format does not deal well with full-width, combining, control and RTL characters. .SH SEE ALSO ascii(1) .SH AUTHOR Radovan Garab\('ik