mcpp-2.7.2/ 0000777 0001750 0001750 00000000000 11114177233 007466 5 0000000 0000000 mcpp-2.7.2/ChangeLog 0000644 0001750 0001750 00000043451 11114177062 011163 0000000 0000000 ChangeLog of MCPP and its accompanying Validation Suite
2008/11/30 kmatsui
* V.2.7.2
* Enabled some CPU-specific predefined macros in compiler-
independent-build as well as compiler-specific-build, because
there are some occasions which require those macros when we use
some compiler system's header files. Created init_cpu_macro().
(configure.ac, noconfig.H, configed.H, main.c, system.c)
* Enabled -m32 and -m64 options even on 32-bits systems and on
compiler-independent-build, when the OS is UNIX-like one. These
options change some predefined macros. (system.c)
* Made -z option to output #include lines themselves. (system.c)
* Fixed a bug of source line numbering in library-build. (by
Dwayne Boone) (main.c)
* Fixed a few minor bugs.
* Dropped support for Borland C 4.0.
* Updated the shell-scripts, makefiles and difference files, and
revised some mistakes.
* Updated the documents.
2008/05/19 kmatsui
* V.2.7.1
* Fixed a bug of newline synchronization on -K option. Created
sync_linenum(). (Thanks to Benjamin Smedberg) (directive.c)
* Made GCC-specific-build on x86_64 and ppc64 have two sets of
predefines for 32bit mode and 64bit mode, and implemented -m32
and -m64 options. (Thanks to Benjamin Smedberg) (configure.ac,
set_mcpp.sh, system.c)
* Stopped to use freopen() so that a main program which links
libmcpp can use stdin, stdout and stderr. (by Benoit Foucher)
(main.c)
* Fixed a bug of file-handle leak on -MD and -MF options.
(Thanks to Masashi Fujita) (system.c)
* Added ports to Vicual C++ 6.0 and Borland C++ 5.9 (aka C++
Builder 2007). (by Dwayne Boone) (vc6.dif, bc59.dif, eval.c)
* Revised declaration of stpcpy(). (internal.H)
* Split mcpp_out.h from mcpp_lib.h.
* Made library-build install also an mcpp executable and minimal
documents as well as libmcpp for a convenience of library
package. Created main_mcpplib.c. Made all the binary packages
library-build. (by Yutaka Niibe and kmatsui) (Makefile.am, src/
Makefile.am)
* Split config/cygwin_root and config/mingw_root from
configure.ac to avoid a trouble of backslash character handling
on some version of bash. (Thanks to Aleksandar Samardzic)
* Changed autoconf 2.59 to 2.61 and automake 1.9.6 to 1.10.
* Updated the documents accordingly.
2008/03/23 kmatsui
* V.2.7
* Created macro notification mode, implemented -K option and '#
pragma MCPP debug macro_call' sub-directive to enable this mode
on STD mode. Created get_src_location(), print_macro_inf(),
print_macro_arg(), close_macro_inf(), chk_magic_balance(),
remove_magics(), some MAC_* macros to define magic characters,
struct LINE_COL, MACRO_INF. Revised many functions.
(Specifications mostly by Taras Glek, partly by Samuel,
implemented mostly by kmatsui and partly by Taras Glek and
Samuel). (internal.H, main.c, directive.c, eval.c, expand.c,
support.c, system.c)
* Created -k option to keep horizontal white spaces as they are
and convert comment to spaces of the same length on STD mode.
(Specs by Taras Glek, implemented by kmatsui). (internal,H,
main.c, directive.c, mbchar.c, support.c)
* Implemented GCC2-spec variadic macro on STD mode in GCC-
specific-build. (by Taras Glek and kmatsui). (directive.c)
* Enabled GCC-like buggy handling of macro containing 'defined'
token in #if directive on GCC-specific-build. (by Taras Glek).
(expand.c)
* Reordered initialization steps and enabled undefining of not-
Standard-required predefined macros. Created undef_macros().
Removed undef_a_predef(). (main.c, system.c)
* Enabled non-conforming predefined macros such as 'linux' by
default on GCC-specific-build for compatibility with GCC.
Removed undef_gcc_macros(). Created DEF_NOARGS_* macros for
diagnostics sake. (internal.H, system.c)
* Fixed a bug of file searching failure when a file is specified
by relative path in -include option. Split is_full_path() from
open_include(). (thanks to Benjamin Smedberg) (system.c)
* Fixed a bug of mcpplib initialization which caused problem on
CygWIN. (main.c, system.c, lib.c)
* Fixed a bug of unterminated source file handling. (thanks to
Phil Knight) (support.c)
* Made norm_path() check existence of directory/file before
normalization. As its results, non-existent directory specified
by -I option was disabled, "non-existent/../existent" was judged
as non-existent before wrongly normalizing to "existent", and #
include "directory" was made not to open. Created norm_dir().
(thanks to Taras Glek and Dave Mandelin) (system.c)
* Stopped to convert path-list on Windows to lowercase-letters.
Changed path-list comparing function on Windows from strcmp() to
strcasecmp() or stricmp(). (system.c)
* Changed allocation of buffer for -M* options and incdir[],
fnamelist[], once_list[] from fixed size to dynamically
enlarging ones. (system.c)
* Made #line output for GCC-specific-build closer to GCC.
Changed FILEINFO and DEFBUF struct, moved sharp() from main.c to
system.c, revised many functions. (system.c, support.c, main.c,
directive.c)
* Absorbed lib.c into system.c. Renamed getopt() to mcpp_getopt
(), also variables opt* to mcpp_opt*, and made static in order
to prevent linking of glibc getopt(). (thanks to Dwayne Boone)
* Fixed a bug of UTF-8 multibyte character handling, enabled 4-
bytes long sequences, and enabled checking of overlong sequences
and UTF-16 surrogate pairs. (by Matt Wozniski) (mbchar.c,
support.c)
* Fixed a bug of tokenization in KR and OLD modes. (support.c)
* Changed FILENAME_MAX to PATH_MAX and FILENAMEMAX to PATHMAX,
because FILENAME_MAX of some systems are too short. (thanks to
Dwayne Boone)
* Bundled some variables into structs (std_limits, option_flags,
etc.). Tidied up the sources removing unused codes, rewriting
old comments. (most of the sources)
* Ported to Mac OS X / Apple-GCC. Enabled searching of
"framework" directories for #include. Enabled to search "header
map" file. Enabled #import, which is #include with
unconditional "once only" feature. Implemented -F, -arch,
-isysroot options. Created init_framework(), search_framework(),
search_subdir(), search_header_map(), hmap_hash(). (system.c,
directive.c, set_mcpp.sh, unset_mcpp.sh, configure.ac, src/
Makefile.am)
* Ported to Visual C++ 2008. Enabled '$' in identifier by
default in Visual-C-specific-build and GCC-specific-build.
(system.H, internal.H, support.c, system.c)
* Added documentation on source checking of firefox 3.0pre.
Added comments on system headers in Mac OS X. (mcpp-manual.html)
* Updated all the documents. (mainly by kmatsui, partly by
Taras Glek)
2007/05/19 kmatsui
* V.2.6.4
* Fixed memory leaks in subroutine-build related to file->
filename, sharp_filename and others. (by Juergen Mueller and
kmatsui). (main.c, directive.c, support.c, system.c)
* Revised expanding() and expanding_macro[] to fix memory leaks.
Created clear_exp_mac(). (internal.H, expand.c, support.c)
* Fixed a bug of accessing non-allocated memory. (by isr).
(support.c)
* Revised output of // comment by -C option. Output // comment
as it is, not converting to /* */. (thanks to Taras Glek).
(support.c)
* Changed output of line top white spaces in other than
POST_STANDARD mode to preserve them as they are, rather than
squeezing to one space, in order to make output more human-
readable. (main.c, support.c)
* Removed the settings to be compiled with C++. (configed.H,
noconfig.H, noconfig/*.mak)
* Updated version-info for shared-library-build from 0:0:0 to 0:
1:0.
* Changed installation directory of some documents in stand-
alone-and-compiler-independent-build by configure or by binary
packages.
* Updated the documents. Note that cpp-test.html were not
updated.
2007/04/07 kmatsui
* V.2.6.3
* Fixed a bug of some #line directive handling which wrongly
affected #include path. Added a new member for real file name
to struct FILEINFO, and made #line directive does not affect
real file name. (internal.H, main.c, support.c, system.c)
* Enabled dereferencing of symbolic linked directory (as well as
file) of #include path-list and include directory. Split
deref_syml() from norm_path(). (system.c)
* Revised again diagnostic messages for some macro expansions.
(internal.H, expand.c, support.c)
* Relaxed token checking and syntax checking in lang_asm mode.
(expand,c, support.c)
* Implemented GCC3-spec variadic macro for GCC-specific-build.
(internal.H, directive.c, expand.c)
* Added some predefined macro for GCC-specific-build. (system.c)
* Revised output routines abstracting output device, and
implementing optional memory buffer output when built with
MCPP_LIB macro. Created mcpp_lib.h, mcpp_lib_fputs(),
mcpp_lib_fputc(), mcpp_lib_fprintf(), mcpp_use_mem_buffers(),
mcpp_get_mem_buffer(), mcpp_set_out_func(),
mcpp_reset_def_out_func(), mem_putc(), mem_puts(),
append_to_buffer(), function pointers mcpp_fputs, mcpp_fputc,
mcpp_fprintf and some macros. This update disabled compilation
by C++. (All were contributed by Greg Kress and slightly
modified by kmatsui) (internal.H, main.c, directive.c, eval.c,
expand.c, mbchar.c, support.c, system.c, lib.c, mcpp_lib.h)
* Renamed some global names in order to lessen the possibility
of name collisions in subroutine-build. Renamed the variables
mode, cplus, line, debug, type[] and work[] to mcpp_mode,
cplus_val, src_line, mcpp_debug, char_type[] and work_buf[]
respectively. Renamed the functions install(), eval(), expand(),
get() and unget() to install_macro(), eval_if(), expand_macro(),
get_ch() and unget_ch() respectively. (internal.H, main.c,
directive.c, eval.c, expand.c, mbchar.c, support.c, system.c)
* Added 'mcpplib' target to make subroutine (library) build in
configure.ac and noconfig/*.mak.
* Revised some other minor points. (all sources)
* Changed default setting of noconfig.H to that of FreeBSD 6.* /
stand-alone / GCC 3.4. (noconfig.H)
* Added documentation on source checking of glibc 2.4. (mcpp-
manual.html)
* Abolished 'install-data' and 'uninstall-data' targets of
configured makefile. On the other hand, made 'install' target
install also mcpp-manual.html.
* Provided stand-alone-and-compiler-independent-build binary
packages port, rpm, deb, zip and their corresponding source
packages.
2006/11/12 kmatsui
* V.2.6.2
* Renamed control.c as directive.c and renamed control() as
directive().
* Fixed a bug of #else handling in pre-Standard modes.
(directive.c)
* Fixed a bug of mcpp specific directives such as #debug or #
put_defines in pre-Standard modes. (system.c)
* Fixed a bug of warning options for GCC-specific-builds.
(system.c)
* Fixed a bug of macro expansion timing in #include directive
line. (system.c)
* Revised some other minor points, moved cur_file() from main.c
to system.c. (main.c, eval.c, system.c)
* Revised diagnostic messages for some macro expansions.
(internal.H, expand.c, support.c)
* Fixed a bug of nested includes with relative paths. (thanks
to Leo Savernik). (system.c)
* Fixed memory leaks in routines related to normalizing path-
list. (by Juergen Mueller). (system.c)
* Added MCPP_LIB setting to use mcpp as a subroutine from other
main program. Created init_main(), init_directive(), init_eval(),
init_support(), init_system(), init_lib(), clear_filelist() and
clear_symtable(). Created testmain.c as a sample source. (all
were contributed by Juergen Mueller and slightly modified by
kmatsui). (internal.H, main.c, directive.c, eval.c, expand.c,
support.c, system.c, lib.c)
* Changed the macro STAND_ALONE to INDEPENDENT.
* Changed the terminology of building methods in the documents.
(INSTALL, mcpp-porting.html, mcpp-manual.html)
* Rewrote and converted the text files in 'doc' and 'doc-jp'
directories into html files.
* Updated and corrected many points of the documents.
2006/08/12 kmatsui
* V.2.6.1
* Enabled automatic conversion from [CR+LF] to [LF]. (support.c)
* Set the limit of #include nesting to INCLUDE_NEST (default:
256) in order to prevent infinitely recursive #includes.
(system.H, system.c)
* Revised white space handling in mcpp is a C preprocessor developed by kmatsui (Kiyoshi Matsui) based on the DECUS cpp written by Martin Minow, and then rewritten entirely. mcpp means Matsui cpp. This software is supplied as source codes, and to use mcpp in any systems, a small amount of compiler-system-specific modifications are required before it can be compiled into an executable. *1 This document explains how to port the source to different compiler systems. Please refer to the separate manual called "mcpp-manual.html" for the operating instructions of the generated executable. All these sources and related documents are provided as an open-source-software. Before going into detail, some of the mcpp features are introduced here. Note: *1 mcpp V.2.6.3 onward provides some binary packages too, at the following site. This document, however, does not explain them. As for the binary packages, see the web page. mcpp is a portable preprocessor, supporting various operating systems, including Linux, FreeBSD and Windows. Its source has a wide portability, and can be compiled by any compilers which support Standard C (ANSI/ISO C). The library functions used are only the classic ones. To port mcpp to each compiler system, in many cases, one only needs to change some macro definitions in the header files and simply compile it. In the worst case, adding several dozen of lines into a source file would be enough. To process multi-byte characters (Kanji), it supports Japanese EUC-JP, shift-JIS and ISO2022-JP, Chinese GB-2312, Taiwanese Big-5 and Korean KSC-5601 (KSX 1001), as well as UTF-8. For shift-JIS, ISO2022-JP or Big-5, mcpp can complement the compiler-proper if it does not recognize them. mcpp has various behavioral modes. Other than Standard-conforming mode, there are K&R 1st mode, "Reiser" cpp mode and what I call post-Standard mode. mcpp has also an execution option for C++ preprocessor. Different from many existing preprocessors, Standard mode of mcpp has the highest conformance to Standards: all of C90, C99 and C++98. It has been developed aiming to become the reference model of the Standard C preprocessor. Those versions of the Standard can be specified by an execution option. *1 In addition, it provides several useful enhancements: #pragma MCPP debug, which traces the process of macro expansion or #if expression evaluation, and the header file "pre-preprocessing" facility. mcpp also provides several useful execution options, such as warning level or include directory specification options. Even if there are any mistakes in source, mcpp deals suitably with accurate plain diagnostic messages without running out of control or displaying misguiding error messages. It also displays warnings for portability problems. The detailed documents are also attached. In spite of its high quality, mcpp code size and memory usage is relatively small. A disadvantage of mcpp, if any, is slower processing speed. It takes two or three times time of GCC 3.*, 4.* / cc1, but seeing that its processing speed is almost the same as that of Borland C 5.5/cpp32 and that it runs a little bit faster when the header file pre-preprocessing facility is used, it cannot be described as particularly slow. mcpp puts an emphasis on standard conformance, source portability and operability in a small memory space, making this level of processing speed inevitable. Validation Suite for Standard C Preprocessing, which is used to test the extent to which a preprocessor conforms to Standard C, its documentation cpp-test.html, which contains results of applying Validation Suite to various preprocessors, are also released with mcpp. When looking through this file, you will notice that so-called Standard-conforming preprocessors have so many conformance-related problems. Note: *1 ISO/IEC 9899:1990 (JIS X 3010-1993) had been used as C Standard, but in 1999, ISO/IEC 9899:1999 was adopted as a new Standard. This document calls the former C90 and latter C99. The former is generally called ANSI C or C89 because it migrated from ANSI X3.159-1989. ISO/IEC 9899:1990 plus its Amendment 1995 is sometimes called C95. C++ Standards are ISO/IEC 14882:1998 and its corrigendum version ISO/IEC 14882:2003. This document calls both of them C++98. Though this document was text-file in the older versions, it has changed to html-file at V.2.6.2. Note: *1 The outline of the "Exploratory Software Project" can be seen at the following site (Japanese pages only). mcpp from V.2.3 through V.2.5 had been located at: In April 2006, mcpp project moved to: The older version of mcpp, cpp V.2.2 and Validation Suite V.1.2 are located in the following Vector's web site. They are in the directory called dos/prog/c, but they are not for MS-DOS exclusively. Sources are for UNIX, WIN32, MS-DOS. The documents are Japanese only. The text files in these archive files available at Vector use [CR]+[LF] as a <newline> and encode Kanji in shift-JIS for DOS/Windows. On the other hand, those from V.2.3 through V.2.5 available at SourceForge use [LF] as a <newline> and encode Kanji in EUC-JP for UNIX. From V.2.6 on two types of archive, .tar.gz file with [LF]/EUC-JP and .zip file with [CR]+[LF]/shift-JIS, are provided. The source of mcpp consists of five header files and seven *.c files. The parts which are dependent on OS or compiler system are included in the four source files configed.H, noconfig.H, system.H and system.c. Either of configed.H or noconfig.H is used depending the compiling method, they are never used simultaneously. There are also a few library function sources in system.c. When mcpp is compiled by any compiler system, these source files need to be modified to match that compiler system. There are several types of MCPP executable corresponding to its building methods.
The building methods of MCPP have following two axis: The following sections from 3.1 through 3.9 explain compiler-specific-builds. "mcpp for GCC", "implemented for Visual C" or such in this document mean GCC-specific-build, Visual C-specific-build, respectively. There are two ways to compile mcpp. The first is to automatically generate a header file named config.h and a Makefile by executing the 'configure' script. After generating them, just run 'make; make install'. The header file named configed.H will be used in this way. However, the configure script can only be used in UNIX-like systems and CygWIN or MinGW. Another way is to 'make' using a makefile for each compiler system, with the modified/edited (if required) header file by difference files. noconfig.H will be used in this case. Difference files and makefiles are in the 'noconfig' directory. Even for systems which can use the configure script, editing header files and makefiles directly allows you to control compilation in detail. However, difference files are only available for supported compiler systems. In this chapter, I explain how to compile mcpp using the difference files. Please refer to INSTALL for the configure script. Note:
*1 While V.2.6 and V.2.6.1 called this as 'stand-alone-build', V.2.6.2 changed the name according to the creation of subroutine-build. *2 mcpp V.2.6.3 and later provides some binary packages at the SourceForge site. They are all stand-alone and compiler-independent-builds. The C/C++ compiler systems I could use are the following, and mcpp has been ported to all of these. Therefore, it has been verified that this source code can be compiled, and that generated preprocessors run correctly on each compiler system. In any case the CPU used is the x86 type. The systems are all 32 bit version, except Ubuntu which is 64 bit version. In addition, there are informations from some users on Visual C++ V.6.0, Visual C++ 2002 and C++Builder 2007 (aka BCC V.5.9), and you can compile mcpp on them, too. Settings are quite easy for creating mcpp executables by these compiler systems. One only needs to change some macro definitions in noconfig.H. *.dif files in noconfig directory are difference files for modifying noconfig.H, which is by default for FreeBSD 6.* / GCC 3.4, to use with each compiler system. For Visual C++ 2005, as an example, in the src directory, doing the following command modifies these files. Patch is a standard UNIX command, and has been ported to Windows or other. Of course, you can directly edit the source file referring the difference file without using patch. Modifications to match your own systems, such as specifying include directory have to be done by yourself, apart from the modifications made by difference file. Makefiles for each compiler system which are to compile these modified sources, are also attached. (See sec. 3.7) All the following operations should be done in the src directory. These are all modifications of noconfig.H unless it is otherwise mentioned. For any of the following compiler systems, in order to make the compiler-specific-build, change the macro INDEPENDENT of the line: to the macro for the compiler system, for example: Next, change the line appropriately: as: You can also overwrite the definition of COMPILER by make option as: If you modify noconfig.H applying the difference file, the compiler-specific setting will be also modified for the compiler system, so you need not rewrite the definition of COMPILER in the file. Then, if you do 'make' with option defining COMPILER, compiler-specific-build will be made, otherwise compiler-independent-build will be made. In case of the default include directories are different from the ones in this file, the macros C_INCLUDE_DIR1 and C_INCLUDE_DIR2 should be rewritten. If C++ has its own include directories different from the ones in C, these should be written in CPLUS_INCLUDE_DIR1, CPLUS_INCLUDE_DIR2 and CPLUS_INCLUDE_DIR3. (These directories can be specified also by environment variables or the -I option at the time of execution.) All of these directories are of compiler-system-specific ones. Include directories are also set in system.c. In UNIX terms, those set by system.c are OS-specific (usually /usr/include) and site specific (usually /usr/local/include). As for Windows, nothing is set for include directories in system.c nor in noconfig.H by default, they are to be specified by environment variables INCLUDE and CPLUS_INCLUDE. If required, you should also change built-in macro names defined by the macros such as COMPILER_STD1 or COMPILER_STD2. The default setting of multi-byte character encodings is set to EUC-JP on UNIX, shift-JIS on Windows. If required, modify the macro called MBCHAR to change the encoding. (The change of multi-byte character encoding can be done also by the environment variables, execution options and #pragma.) On certain compiler systems, because they do not support encodings such as shift-JIS or Big5, the tokenization gets errors when there is the same value byte of 0x5c as '\\' within multi-byte characters. For these systems, mcpp needs special setting to compensate for an inability of the compiler. Please refer to sec 4.1.1.5 for this setting. With regard to the attached makefiles, you need to rewrite BINDIR, which is the directory where the executables of the compiler system are located. In GCC V.3, V.4, the preprocessor is absorbed into the compiler (ccl, cclplus). So, to use mcpp, you must replace the call of gcc, g++ with shell-script and set to execute first mcpp, then cc1 or cc1plus. The attached makefiles set this automatically by doing: For the details, please see mcpp-manual.html#3.9.7. When the user does not have write permission into the BINDIR, you must do 'sudo make COMPILER=GNUC install' on UNIX-like systems. On Windows, you must modify the permission of the directory by an administrator account, prior to installation. The source is to be compiled by GCC (GNU C) V.3.4 on FreeBSD 6.* and to make mcpp of compiler-independent-build. In order to make the compiler-specific-build for FreeBSD 6.* / GCC V.3.4.*, first change the line: to: Then, just complete it by compiling.
You can also overwrite COMPILER by 'make COMPILER=GNUC' command. For the other version of GCC, modify the version number of the VERSION_MSG, and For the first, write major version number of GCC, and for the second, write minor version number, both by string-literal. The third is value of the macro __GNUG__, which is the same with the first.
And for the fourth, write the same number with the first by a digit. If the version of FreeBSD is not 6.*, then change the following values. Furthermore, in case of include directories are different from the default ones of FreeBSD 6.*, you need to change the following definition. In some cases you may need to set also C_PLUS_INCLUDE_DIR3 and C_INCLUDE_DIR1. If the version of GCC is 2.7-2.95, then change the following macro to 199409L. Even for other UNIX-like OSes, if the compiler system is GCC, I suspect one only needs to change things like these version numbers, the setting of include directories or OS specific built-in macros. (See sec 4.1.1) To change the setup for GCC on FreeBSD to GCC on Linux, you should change the line: to: Then modify the macros, as on FreeBSD, COMPILER, VERSION_MSG, COMPILER_EXT_VAL, COMPILER_EXT2_VAL, COMPILER_CPLUS_VAL, GCC_MAJOR_VERSION, CPLUS_INCLUDE_DIR1, CPLUS_INCLUDE_DIR2, C_INCLUDE_DIR1. For GCC 2.* modify the value of STDC_VERSION. and change: to: You should make sure the include directories by these commands: The difference files in 'noconfig' directory named linux_gcc2953.dif, linux_gcc32.dif, linux_gcc336.dif, linux_gcc343.dif and linux_gcc411.dif are for VineLinux 4.0 / GCC V.2.95.3, V.3.2, V.3.3.6, V.3.4.3 and V.4.1.1, respectively. For the compiler-specific-build, change COMPILER too. The include directories may vary between distributions of Linux. Also, if another version is installed in addition to the system standard version of GCC, it should create another include directory for the specific version. Specify the particular directory using the above macros. The specification of getopt() of glibc is different from the standard ones such as POSIX, please use the mcpp_getopt() in system.c instead. On Mac OS X, you can install GCC by installing the Xcode package. After that installation, you will find many gcc, cc, g++, c++ or such in /usr/bin.
In Mac OS X 10.5 Leopard on Intel-Mac, i686-apple-darwin9-gcc-4.0.1 and i686-apple-darwin9-g++-4.0.1 are the native compilers for the machine.
The cross compilers are installed, too.
On Intel-Mac, powerpc-apple-darwin9-gcc-4.0.1 and powerpc-apple-darwin9-g++-4.0.1 are the cross compilers to generate a binary for PowerPc.
The names just gcc and g++ are symbolic links to gcc-4.0 and g++-4.0,
which behave as native compilers by default, but when invoked with '-arch ppc' option, call the compiler-propers cc1 or cc1plus for powerpc.
Note that these compilers are also installed into /Developer/usr/bin. These are GCCs with many extensions specific to Mac OS X made by Apple.
The compiler system of Mac OS X differs from the GCCs on other systems in some important aspects.
First, it handles the special directories called "framework" as its system header directories.
Second, it can generate both of the binaries for Intel-Mac and PowerPc-Mac on either machine.
Moreover, it has a mechanism to make a "universal binary", which is a bundle of both binaries and is able to run on either machine.
In fact, gcc-4.0, i686-apple-darwin9-gcc-4.0.1, powerpc-apple-darwin9-gcc-4.0.1 and their corresponding g++s and cc1, cc1plus and other compiler-propers in /usr/libexec/gcc/SYSTEM/4.0.1 are all universal binaries for i386 and ppc.
(SYSTEM is i686-apple-darwin9 and powerpc-apple-darwin9.)
If we copy these universal binaries from Intel-Mac to PowerPc-Mac, they will run reversing native and cross positions, I suppose.
In addition, Intel-Mac even executes most of ppc binaries automatically translating to x86 codes. To sum up: there are many gccs and g++s in /usr/bin and /Developer/usr/bin: also there are links to them;
there are libexec directories for x86 and ppc:
an executable contains two binaries bundled in it:
a binary for ppc runs on x86.
Such being the situation, we easily lose which is which.
Be careful.
Here, I take examples of Mac OS X 10.5 (Leopard) on Intel-Mac.
On PowerPc-Mac, read these sections swapping i686 and powerpc (ppc).
On Mac OS X 10.4 (Tiger), read darwin9 as darwin8. It is quite simple to install mcpp by the native compiler.
To make settings for Mac OS X 10.5 / GCC 4.0.1 on Intel-Mac, apply mac_gcc401_i686.dif to noconfig.H.
Use mac_osx.mak as a Makefile.
The command sequence 'make; sudo make install' will generate a compiler-independent-build, and 'make COMPILER=GNUC; sudo make COMPILER=GNUC install' will install a GCC-specific-build.
The compilers in /usr/bin will be used for these commands, since usually /Devepoler/usr/bin is not set to $PATH. To install mcpp on Intel-Mac with or for the cross compiler for PowerPc, apply mac_gcc401_powerpc.dif to noconfig.H.
Then edit the Makefile (mac_osx.mak). For a compiler-independent-build, change the definition of variables NAME, CC and CXX to those containing "powerpc" as noted by the comments in mac_osx.mak.
Then, do 'make; sudo make install'.
The binary is the one compiled "with the cross compiler", so should run on ppc-Mac. For a GCC-specific-build, change the definition of variables NAME, INCDIR, BINDIR, target_cc and arch to those containing "powerpc" (do not change CC and CXX), and do 'make COMPILER=GNUC; sudo make COMPILER=GNUC install'.
This is a binary "for the cross compiler" running on Intel-Mac, hence runs on Intel-Mac. On PowerPc-Mac, mac_gcc401_powerpc.dif makes settings for the native compiler, and mac_gcc401_i686.dif do for the cross compiler, in reverse of Intel-Mac.
To compile mcpp with or for the cross compiler for Intel-Mac on PowerPc-Mac, change the definition of the variables above to those containing "i686". To make a universal binary, just enable the variable UFLAGS in mac_osx.mak removing the '#' which comments out the line.
All the other settings are the same with the section above. For CygWIN V.1.3.10 / GCC V.2.95.3, add the changes in cyg1310.dif to noconfig.H. Then, rewrite the macro CYGWIN_ROOT_DIRECTORY to define CygWIN's root directory on Windows as: The letters in the path-list are case-insensitive. For other versions, it should be able to be ported by modifying macros such as VERSION_MSG, C_INCLUDE_DIR?, CPLUS_INCLUDE_DIR? and CYGWIN_ROOT_DIRECTORY. Although CygWIN is a system on Windows, it simulates UNIX file system. Therefore, mcpp treats CygWIN/GCC in almost the same way with UNIX/GCC, and presets include directories as mcpp on UNIX. For MinGW / GCC V.3.4.5, add the changes in mingw345.dif to noconfig.H. Then, rewrite the macro MSYS_ROOT_DIRECTORY and MINGW_DIRECTORY to define MSYS's / and /mingw directory on Windows as: The letters in the path-list are case-insensitive. For other versions, it should be able to be ported by modifying macros such as VERSION_MSG, C_INCLUDE_DIR?, CPLUS_INCLUDE_DIR?, MSYS_ROOT_DIRECTORY and MINGW_DIRECTORY. The path-list for the include directories may be either of absolute path as "c:/dir/mingw/include" or MinGW's own path as "/mingw/include". Since MinGW does not support symbolic link, GCC-specific-build of mcpp cannot be invoked from gcc through symbolic link. Moreover, MinGW / gcc rejects to invoke a shell-script even if it is named cc1. Therefore, the compiling of mcpp generates an executable named cc1.exe instead of shell-script. In execution, gcc invokes this cc1.exe from which mcpp.exe or GCC's cc1.exe/cc1plus.exe are invoked. Although the include directories are preset on GCC-specific-build, they are not set on compiler-independent-build, hence you should specify them by the environment variables INCLUDE and CPLUS_INCLUDE. In LCC-WIN32 2003-08 or 2006-03, it needs to be changed as per lcc0308.dif, lcc0603.dif respectively. In other versions, the VERSION_MSG macro needs to be modified. In Visual C++ 6.0, 2002, 2003, 2005, 2008, it needs modifications as vc6.dif, vc2002.dif, vc2003.dif, vc2005.dif, vc2008.dif respectively. For the compiler-specific-build, modify COMPILER or overwrite it by the nmake option, of course. For other versions of Visual C, besides modifying VERSION_MSG macro, the values of predefined macros, _MSC_VER and _MSC_FULL_VER, should be changed by modifying the definition of COMPILER_EXT_VAL and COMPILER_EXT2_VAL respectively. In Borland C V.5.5, V.5.9 (C++Builder 2007) / bcc32, it needs to be changed with bc55.dif, bc59.dif respectively. In other versions of Borland C++, besides the VERSION_MSG macro, the values of predefined macros, __TURBOC__, __BORLANDC__ and __BCPLUSPLUS__ should be changed by modifying macros COMPILER_STD2_VAL, COMPILER_EXT_VAL and COMPILER_CPLUS_VAL, in noconfig.H. (Refer Sec 4.1.1.1.) If the version can handle digraphs, the definition of HAVE_DIGRAPHS needs to be changed. If the version has __STDC_VERSION__ macro, change the definition of STDC_VERSION. The DECUS cpp seems to had supported RT-11/DECUS C and RSX/DECUS C on PDP-11, VMS/VAX-11C, PDP-11/UNIX and VAX/ULTRIX - some kind of C on VAX. It also seemed to have supported a quite old version of Microsoft C and Lattice C on MS-DOS. I removed these, as I suppose it is no longer required and I cannot maintain them. system.H includes configed.H when the macro HAVE_CONFIG_H is defined to non-0, otherwise it includes noconfig.H. PART 1 and PART 2 of the mcpp setting are in configed.H and noconfig.H, and PART 3 is in system.H. In these files, some macros which are required to port to each compiler system are defined. When porting to compiler systems which have not been ported to yet, one needs to add from a few lines to a dozen lines in Part 1. Part 1 is the definition dependent on OS and target compilers, Part 2 is the definition dependent on host systems, and Part 3 is the definition of the mcpp behavior specification. In configed.H and noconfig.H, the target compiler system is assumed to be the same as the host, so PART 2 needs to be modified when it is different. When you do porting with different configurations from the default, please make sure to look through these files. system.c absorbs the discrepancies of OS or compiler which cannot be absorbed solely by configed.H (noconfig.H) or system.H macros. To port to a new compiler system, adding tens of lines of source into this file may be required. This file includes functions such as options for mcpp invocation, usage message, include directory, the handling of OS unique directory paths when opening header files or source files, processing of #pragma, and processing of compiler system unique extension directives. Most of them are setup for the target OS and target compiler systems. Of library functions, C source code for getopt() and stpcpy(), which are not in Standard, are written in system.c. Though mcpp uses also getcwd(), stat() and in UNIXes readlink(), they are not included here, because they are functions dependent on OS, so cannot be written portably. They are only three low-level functions used in mcpp. Though they are not Standard C function, they are required by POSIX. Every compiler system seems to provide it. *1, *2 Usage of library functions in mcpp does not depend on the specification difference on different compiler systems, so those functions of any compiler systems will not cause a problem unless there is a bug. Note: *1 On MinGW, spawnv() is used too. *2 mcpp up to V.2.6.4 had a separate source file lib.c.
But, it was absorbed into system.c on V.2.7, since the functions written in it decreased to only two (getopt() and stpcpy()).
At the same time, getopt() was renamed to mcpp_getopt() in order to prevent linking troubles.
In the source code of mcpp, stdio.h, string.h, stdlib.h, ctype.h, errno.h, limits.h, time.h, sys/types.h, sys/stat.h are included unconditionally. For UNIX-like systems, unistd.h is also included. There should not be a compiler system which does not have these. *.mak are the makefiles for each compiler system, and a detailed setup is possible. 'make' command itself is assumed to the one which is attached to each compiler system or the standard for the system. For Visual C, 'nmake' should be used instead of 'make'. Except for FreeBSD/GCC, modify the noconfig.H as follows: (Assume the system is xyz) Then edit macros COMPILER and VERSION_MSG, and edit the macros such as C_INCLUDE_DIR? in noconfig.H to suit your own system. After copying the corresponding noconfig/xyz.mak to Makefile, and setting up the target directory to match your system, run as For other compiler systems, please write the necessary makefile referring to these files. The dependencies of the source files are simple: system.H needs to be included before internal.H. To recompile mcpp using mcpp itself, place the executable into the location where the preprocessor of the compiler system exists. For instance, in the cases of GCC 2.95, rename the resident cpp0 to something like cpp0_gnuc and link cpp0 to whichever cpp you use at the time. Therefore, if mcpp is the preprocessor you are going to use, you need to do For Windows, you need to copy the one you are going to use, to cpp32.exe or such. *1 You can name the executable of mcpp as: (The same thing needs to be done in BC make requires make -DNAME=mcpp. For UCB make, -D can be either added or not. For GNU make, -D should not be added.) Using the attached makefiles, 'make install' does not do any detailed work. Except for GCCs (i.e.: freebsd.mak, linux.mak, mac_osx.mak, cygwin.mak and mingw.mak), please do rest of the work manually. Please copy the resident preprocessor into the other name beforehand, so as to prevent being deleted by 'make install'. When you recompile mcpp using the one-pass compiler such as Visual C or Borland C, you should supply the output file of mcpp as the source file to the compiler. (For instance, output the preprocessed result of source file main.c as main.i, and compile that with cl or bcc32.) When recompiling using mcpp, if the "pre-preprocess" functionality for the header file is used, the preprocess time will be reduced dramatically. When you use the attached makefile, for UCB make, GNU make or MS nmake, you run for BC make, you run which automatically pre-preprocesses the header files, next preprocess, then compiles. For LCC-Win32's 'make', 'if' statement cannot be used, so you need to edit the makefile and recompile. The details of the modification are in the makefile itself as comments. In BSD make, GNU make or MS nmake, if you run make with the option MALLOC=KMMALLOC, this links the malloc() which I wrote. About this, please refer to 4.extra. For BC make, the same thing can be done by the option -DKMMALLOC. To link my malloc() with the make of LCC-Win32, you need to edit the makefile. Note: *1 In FreeBSD, the standard directory in which preprocessor is located is /usr/libexec. See mcpp-manual.html#2.1. In Linux, it is located in the really deep directory as /usr/lib/gcc-lib/i686-redhat-linux/3.3.2. In Linux/GCC, according to the distribution or the version, this directory setting in the makefile needs to be modified. There are various different include directories, for which you need to check. Also, in Linux or FreeBSD, there is /usr/bin/cpp which calls cpp0 or cc1, and gcc also calls cpp0 or cc1. Though some configuration is required to port to each compiler system, compiling mcpp's source code can be done by any compiler system which satisfies C90 specifications. *1, *2 The char type can be either signed or unsigned. Floating point operation is not necessary. This source code is written so as not be affected by the minor discrepancies of the compiler systems. Of course, it is necessary to avoid the compiler system's own bugs in order to actually compile with the compiler system. This cannot be found out until it has to be done. When I was porting to some compiler systems, there were a few cases which took me a long time to trace the bug and to find out a work around. The compiler systems which mcpp does not support are those with special character sets or special CPU, as well as pre-C90 compilers. EBCDIC is not supported. The CPUs for which integer operation is not two's complement are also not supported. If it is not two's complement, it may run incorrectly when an overflow has occurred at a #if expression. Note: *1 Up to V.2.5, mcpp source was compilable even by K&R 1st compiler. From V.2.6, it presupposes C90 compiler, because K&R spec is no longer required by current compiler systems. I tidied up the source and this document accordingly. *2 Up to V.2.6.2, mcpp source was compilable by C++, too. V.2.6.3 and later, however, is compilable only by C. There is no need for the compiler system which compiles the mcpp source code (host) and the compiler system which will use the generated mcpp executable (target) to be the same. If these are different, select the target by SYSTEM and COMPILER and the host by HOST_SYSTEM and HOST_COMPILER within noconfig.H (configed.H). Also, the definitions in PART 1 are the settings for the target, and the ones in PART 2 are for the host. system.c is mainly for the target. However, there are the following limitations. By the way, the host and the target stated here are nothing to do with the ones in the cross-compiler. Cross-compiling is the job of the compiler itself, and in principle the preprocessor is not concerned about that. When mcpp is ported "to a cross-compiler", this cross-compiler is the target compiler system in here. As for the host compiler, you need to use the one which is not the cross-compiler. When mcpp is compiled "by a cross-compiler", the cross-compiler is the host compiler system, and the target of the cross-compiler becomes the target compiler system. This section describes about the compiler systems which the older versions of mcpp had supported and later stopped support. Although the following systems were supported in the older versions, mcpp V.2.4 removed the settings for them. The documents on the following compiler systems were removed in V.2.5. In V.2.6, the codes for the above two were removed, and codes and documents on the following compiler systems were removed. V.2.6 also removed all the settings for MS-DOS and other small memory systems, and removed the settings for pre-C90 compiler systems. V.2.7.2 removed a setting for the following system. The above compiler systems are old ones, and users of these systems probably have gotten small now. As for DJGPP, define SYSTEM, HOST_SYSTEM to SYS_WIN32 and HAVE_INTMAX_T, HAVE_INTTYPES_H to FALSE in noconfig.H, define NBUFF as about 1/4 of the default value in system.H. *1 As for the compiler systems on MS-DOS, define SYSTEM, HOST_SYSTEM to SYS_WIN32 and HAVE_LONG_LONG to FALSE in noconfig.H, define NBUFF as about 1/16 of the default, IDMAX as about 1/4 of the default in system.H, define SBSIZE as about 1/8 of the default in directive.c. Then compile with large memory model. Note:
*1 It was reported that DJGPP / GCC 4.1.0 successfully compiled mcpp V.2.6.1 with this setting. mcpp can be built as an independent preprocessor which behaves on its own not depending on any compiler systems. Making a compiler-independent-build is quite easy, because the almost only requirement is that the compiler system can compile mcpp's source successfully. The invocation options and other specifications of the compiler-independent-build are the same across the compilers with which mcpp is compiled. The include directories are not preset except /usr/include and /usr/local/include in UNIX-like systems, hence you have to specify the rest of them by environment variables or by -I option. *1, *2 To make a compiler-independent-build with GCC, simply do in the mcpp's root directory: In this case, the header file configed.H is used. For further details of configuring, see the document INSTALL. On a system where configure is not applicable, you can patch noconfig.H using the certain difference file in noconfig directory, if mcpp has already been ported to the compiler system. No other modification of source is needed. As a makefile, you can copy the corresponding *.mak file in noconfig directory, and edit the variable BINDIR to specify the installation directory. Then, in src directory, do 'make' and 'make install'. In case of the version of the compiler differs a little from the already ported version, first apply the patch for the nearest version, then edit noconfig.H. For the compiler systems to which mcpp is not yet ported, edit noconfig.H and modify or add several macros. First, define HOST_COMPILER appropriately. Next, define COMPILER as INDEPENDENT, and define VERSION_MSG appropriately. There is no target compiler for the compiler-independent-build, so nothing is required in PART 1. PART 2 depends on the extent to which the host compiler implements the Standard's specifications, and also depends on whether the necessary functions are provided.
The most often encountered discrepancy among the compilers is implementation of 'long long' or its corresponding data type.
In Visual C 2002, 2003, 2005 and Borland C 5.5, the type is '__int64'.
It's length modifier for printf() is 'I64', not 'j' nor 'll', except Visual C 2005.
Hence, define the macro LL_FORM as "I64" for these compilers.
On MinGW, the specifier is also "I64", though it has long long.
In Visual C 2008, type name 'long long' is available. If the compiler's library has not the function stpcpy(), define HOST_HAVE_STPCPY to FALSE. Write makefile yourself referring the *.mak files. (Refer 3.7 too.) Mac OS X provides a cross compiler as well as a native compiler so that we can make binaries for both of x86 and powerpc on either machine.
Moreover, it provides a mechanism to make a "universal binary", which is a bundle of binaries for x86 and powerpc.
Therefore, the settings necessary to compile mcpp on Mac OS X is a bit complex.
I explain compiler-independent-build and compiler-specific-build together at 3.1.4. By the way, at the SourceForge site, mcpp V.2.6.3 and later supplies some binary packages of compiler-independent-build. These are the ones packaged after each packaging method, and their packaging specifications are found in the setting files contained in each corresponding source package. All the packages on FreeBSD, Linux and Mac OS X use configure script to compile. Note: *1 In mcpp V.2.4 and V.2.5, the specification of the compiler-independent-build was a compromise with the compiler's specification. From V.2.6 on, the specification is its own and independent from the compiler. *2 Even if you could compile mcpp on MS-DOS, you may run into shortage of memory on execution, since the memory is very scarce on MS-DOS. You must largely cut down translation limits before compiling. Refer 3.10. mcpp can be compiled as a subroutine to be called from some other main program. Like a stand-alone preprocessor, this subroutine accepts execution options from the caller, preprocesses the specified input file, writes its result into the output file, then returns to the caller. It can be called repeatedly on several input files, if required. It does not, however, pass on the output token by token to its caller. The subroutine-build also can write output into on-memory buffer.
mcpp is compiled as a subroutine when a macro MCPP_LIB is defined to non-0. The entry to this subroutine is mcpp_lib_main() which is a rename of main() in stand-alone-build, and is declared as: If you use GCC, do not define the macro COMPILER to GNUC, leave it defined as INDEPENDENT, because GCC-specific setting generates an executable to be installed into GCC libexec directory and to be called from gcc command, which is not the desired one. On the other hand, if you use Visual C, Borland C or LCC-Win32, you can define COMPILER to either of INDEPENDENT or one of MSC, BORLANDC, LCC, because these compilers have no independent preprocessor and they never call mcpp even if it is compiled with compiler-specific settings. Each compiler-specific setting defines the compiler-specific predefined macros, compiler-specific options, and some peculiar language specifications. You can choose either of compiler-specific or compiler-independent setting for your convenience. There are two ways of compiling subroutine (library) build: using configure script and using noconfig/*.mak makefiles. You can use 'configure' script, if you compile with GCC. On Linux and FreeBSD, these commands will generate libmcpp.a and libmcpp.so.$(SHLIB_VER) of compiler-independent-build, and will install them into /usr/local/lib by default.
Then libmcpp.so and libmcpp.so.0 will be created as links to libmcpp.so.$(SHLIB_VER).
The *.a is a static library, and the *.so is a shared library.
The file named libmcpp.la will also be created, which is for the tool 'libtool'.
$(SHLIB_VER) is 0.0.0 on mcpp V.2.6.3, 0.0.1 on V.2.6.4, 0.1.0 on V.2.7, 0.2.0 on V.2.7.1 and 0.3.0 on V.2.7.2.
Also header files mcpp_lib.h and mcpp_out.h are installed into /usr/local/include, which are necessary for user program to use libmcpp. On Mac OS X, the name of the shared library is *.dylib, not *.so.
And, if you specify some CPUs by -arch option, such as "make CFLAGS+='-arch i386 -arch ppc'", a universal binary for those CPUs will be generated.
In addition, you can widen the range of Mac OS X version on which the binary will run by -isysroot and -mmacosx-version-min= options.
This is an example to generate a shared library on Leopard as a universal binary for i386 and ppc usable on Tiger. On CygWIN and MinGW, DLL named *.dll will be generated instead of *.so.
On CygWIN, libmcpp.a, libmcpp.dll.a, libmcpp.la will be generated and installed into /usr/local/lib, and cygmcpp-0.dll will be generated and installed into /usr/local/bin.
On MinGW, those are the same with CygWIN, except cygmcpp-0.dll changes to libmcpp-0.dll.
To use the DLL, link your main program against libmcpp.dll.a, which is an 'import library' for the DLL. Also main_libmcpp.c is compiled and installed into /usr/local/bin under the name of mcpp which links the generated libmcpp.so.
As you see in the source, you should include mcpp_lib.h to use libmcpp. *1 Minimal documents are installed too.
You can compile testmain.c as well and link it against one of the libraries.
The configure does not do this.
See 3.12.2. Note: *1 The compiler-independent-build of stand-alone mcpp, which does not use libmcpp, is installed into the same directory under the same name.
So, they overwrite each other.
As for documents, the same ones are installed into the same directory, so overwrite each other. If you use 'noconfig.H' and the makefiles in 'noconfig' directory, (after applying an appropriate patch to noconfig.H according to the compiler and its version, and adjusting the directory settings in the makefile), do: While the above commands makes a compiler-independent-build, you can also make a compiler-specific-build by adding an option 'COMPILER=MSC' or the like, unless the compiler is GCC. On Linux, FreeBSD, Max OS X, CygWIN and MinGW, the result is the same with that of the configure above, except libmcpp.la is not created and the documents are not installed.
$(SHL_VER) is 0, 1, 2, 3 on mcpp V.2.6.3-V.2.6.4, V.2.7, V.2.7.1, V.2.7.2, respectively.
$(SHLIB_VER) is 0.0.0, 0.0.1, 0.1.0, 0.2.0, 0.3.0 on mcpp V.2.6.3, V.2.6.4, V.2.7, V.2.7.1, V.2.7.2, respectively.
$(DLL_VER) is 0 on any of mcpp V.2.6.3, V.2.6.4, V.2.7, V.2.7.1 and V.2.7.2.
If the first digit of $(SHLIB_VER) or $(DLL_VER) is the same, the versions with the higher second or third digit of $(SHLIB_VER) are upper-compatible to the one with lower second or third digit. On Windows, so-called 'import library' is generated too, which is to use the DLL.
In order to use the DLL, you must link your main program against this import library.
The static library and the import library are installed into the $(LIBDIR) specified in the makefile, and the DLL itself is installed into the $(BINDIR).
Note that any DLL must be located at some of the execution directories specified by the environment variable PATH on Windows. mcpp_lib.h and mcpp_out.h are installed into $(INCDIR). main_mcpplib.c is compiled and installed into $(BINDIR) under the name of mcpp, which links libmcpp.
On Windows, an option 'DLL_IMPORT=1' specifies to link against DLL. You can test libmcpp further by compiling testmain.c as a main program and linking it against one of the libraries as follows. testmain.c also has a sample to pass output via memory buffer.
To enable the memory buffer, include the header file 'mcpp_lib.h' in your main program, and use the functions declared in it.
The option 'OUT2MEM=1' will enable this in testmain.c.
Note that the macro OUT2MEM is only for testmain.c, not for mcpp.
When you use the library-build of mcpp, write your makefile referring to noconfig/*.mak, and write your main program referring to testmain.c. There are two kinds of library: static and shared.
On Windows the latter is called DLL (dynamically-linked library).
A static library is a collection of *.o (*.obj) files, which is linked (i.e. copied) to an executable at compile time.
All the global names of functions and data in the static library are visible to the executable, and face to danger of name collisions.
This is a problem for mcpp, since it had been developed without considering library use until V.2.6.1.
On the other hand, a shared library (or a DLL) is linked at run time, and shared by several executables, if any, at the same time. On Windows, the global names in the DLL are not visible from outside, and only the names explicitly exported can be imported.
From the DLL of mcpp, for example, only the names in mcpp_lib.h can be imported. On UNIXes, the global names in the shared library are all visible from outside by default, hence you have to be sensitive to name collisions.
GCC 4.0 and later, however, can cope with this problem.
From the shared library of mcpp V.2.7 and later, compiled by GCC V.4.0 or later, only the names in mcpp_lib.h are visible. *1, *2 To sum up, it is recommended to use DLL on Windows and shared library compiled by GCC 4.0 or later on UNIXes, because you need not to worry about name collision on these libraries. Note: *1 mcpp V.2.7 and later uses '#pragma GCC visibility *' directive which has been implemented since GCC V.4.0. *2 Though GCC V.4.1 and later have '-fstack-protector' option, the option does not seem to coexist with '#pragma GCC visibility hidden' directive.
So, the option can't be used to compile libmcpp. I think you should be able to understand most of what is written in these header files if you read them. I have written lots of comments as well. In case, I write the following note. noconfig.H (configed.H) contains PART 1 and PART 2 of the settings, and PART 3 is in system.H. First, select the target system (the system for which mcpp is to be built) and the host system (the system which compiles mcpp.). Though there is a certain naming convention for SYSTEM and COMPILER, it is easier to see the source code. Though this is overstating it a bit, SYSTEM is only used for the type of path list of include files or to know the standard include directory of the OS, so one does not need to be concerned with it too much. There are some other macros predefined in system.c according to run-time options, such as CPU-dependent ones. Besides, GCC V.3.3 or later predefines many macros, hence mcpp installation auto-generates specific 4 header files named mcpp_g*.h for those macros. All the macros predefined by above settings become disabled by the -N option at execution time. The macro called MBCHAR is used to specify the type of encoding for multi-byte characters. In mcpp, all the following encodings are implemented at the same time. MBCHAR only specifies the default encoding, that can be changed by environment variables/options/#pragma at execution time (Refer mcpp-manual.#2.3, mcpp-manual.html#2.8, mcpp-manual.html#3.4 for how to use). By the way, the behavior of the compiler as regards multi-byte characters may vary depending on the environment at execution time. Set these macros to match your environment. Regarding this, please refer to mcpp-manual.#2.8. The next two are written in PART 2 for convenience. Set these TRUE when both target and host systems have the nominated type, otherwise set to FALSE. In noconfig.H and configed.H, the target system is assumed to be the same as the host system. If not, PART 2 needs to be rewritten. Also in PART 1, there are a few parts which assume the target is the same as the host. Modify it, if required. For example, the line using the predefined macro of the host compiler as: In system.H, there are macro definitions to specify the behavioral specification of mcpp. There is a variable named 'mcpp_mode' in mcpp source, and the value of this variable determines the frame of mcpp behavior, such as macro expansion method, available preprocessing directives and predefined macros. There are following 4 modes (4 values of variable 'mcpp_mode') in mcpp. The mode of mcpp is specified by the run-time options. Therefore, in compiling mcpp, nothing is required to be set concerning these 4 macros. Nevertheless, you must understand the difference of these behavioral modes in order to set the other macros correctly. In this document, I group OLD_PREP and KR into pre-Standard mode, and group STD and POST_STD into Standard mode. For the details of the specification of these modes, refer to mcpp-manual.html#2.1. Note: *1 UCN is a C++98, C99 specification, notation of Unicode character value by hexadecimal escape sequence beginning with \u or \U. (See mcpp-manual.html#3.7, cpp-test.html#2.8, cpp-test.html#4.6). The specification becomes better with bigger sizes for each, but the bigger the size of NWORK, NBUFF, NMACWORK or SBSIZE thus uses more memory. Other than the buffer, the actual memory consumption increases with the number of macro definitions. (Specifically, this is not the actual number of macro definitions themselves, but the total of each macro definition length, which is a problem. The internal format of macro definitions are written as 'struct defbuf' in internal.H) NMACWORK, NEXP and RESCAN_LIMIT consumes stack. Other settings do not need much memory, but it may be meaningless in real processing if the values are set to over the default ones within system.H. The minimal limitations of translation limits required by C90 or C99 are written towards to the end of system.H. The translation limits of the C++98 are also written, but this is not the required specification, unlike the C Standards. Some settings, mainly for the target compiler systems, are written here. Source code for getopt() and stpcpy() among non-Standard library functions are written here. getopt() is renamed to mcpp_getopt() in order to prevent linking troubles. The stpcpy() is used when HOST_HAVE_STPCPY == FALSE. "kmmalloc -- malloc() with debugging functions" is a portable source of malloc(), free(), realloc() and calloc() which I wrote. I wrote this to improve the memory efficiency and debugging convenience. I also attach the debug routine. Unexpected bugs can be caught if this is linked. *1, *2 The reason why I provide -DKMMALLOC -D_MEM_DEBUG -DXMALLOC options in noconfig/*.mak, is to link my malloc() which has debug routines. If the mcpp, linked with this, exits with error number EFREEP, EFREEBLK, EALLOCBLK, EFREEWRT or ETRAILWRT, it indicates a mcpp bug. If you define any of BSD_MALLOC, DB_MALLOC or MALLOC_DBG to 1 and compile mcpp, with each debugging malloc() will be used, not my malloc(). In any case, to use the malloc() other than system library, you have to make the library before you compile. About this, please see the document of kmmalloc. (This document is written in Japanese only, sorry.) Note: *1 kmmalloc is at the following location. *2 In CygWIN, my malloc() cannot be used as other malloc()s are not allowed to be used by the library structure.
Also, Visual C 2005 and 2008 have the same kind of problem.
"The Validation Suite for the Standard C conformance of preprocessing" is also made public with mcpp. I tried to make this be able to verify all the specifications of Standard C preprocessing. Of course, mcpp is checked by this suite. It was also compiled by all the above mentioned compiler systems and verified. Therefore, I don't think there are much bugs or wrong specifications, but there may have been some left. When porting to new compiler systems never ported before, it may be that there are some bugs of the compiler systems. If you find unusual behavior, please contact me. Please check the following points. If the diagnostic message of "Bug: ..." is displayed, that is definitely a bug of the mcpp or compiler systems (more like mcpp). Even if the mcpp goes out of control by processing jumbled "source", that is also a bug. Of course, mcpp of modes other than STD behave "incorrectly" in the Validation Suite, as that is the specification. (Even that should not run uncontrollably). Please see 4.1.3 for details of the specifications. There is a library called kmmalloc which I wrote, with functions such as malloc(). (Refer 4.extra If mcpp is linked to my malloc() and exits out with the error number 120-124 (or 2120-2124 for some compilers), that is definitely the mcpp or compiler bug. (Possibly the library function's.) Also, if you write, somewhere in the sample source used in the test, the information for the heap memory will be output at that location and at the end. If the error message "Heap error: ..." is shown there, then that is also the mcpp or compiler system's bug. If any bugs are found, please repeat the test by enclosing each part of the sample source by #if 0 and #endif, and mark out where the bug is. Please attach the following data for the bug report. I tried to write mcpp to be able to be ported relatively easily to any compiler system. However, I only have a small number of the compiler systems. Porting to other compiler systems will require adding some source code. I am looking forward to hearing about the porting reports to those compiler systems. I would like to feedback the reports into source. Please include the following data in the porting report. For the compiler-specific-build, to check whether it has been ported correctly, it may be easiest replacing the preprocessor first and then re-compiling mcpp itself by using the pre-preprocess functionality. Furthermore, use the Validation Suite for STD mode. However, this requires lots of effort when repeating the debug since there are so many files. During the debug, at first, compile 'n_std.c' to see if this compiles and executes correctly. Some compiler drivers attached to the system may not have the option to pass to the mcpp, please refer to mcpp-manual.html#2.2 for that. Alternatively, you can first preprocess with mcpp, then pass the output to the compiler. If it failed, check manually where the problem is by using the sample n_std.t. If this is a success, check e_std.t, m_*.t, unspcs.t, warns.t and misc.t. In "post-Standard" mode, n_post.t and e_post.t should be used. Process these with mcpp -QCz23 option (except -3 for post-Standard). If mcpp is compiled with STDC == 0, add -S1 -V199409L option as well. As the comments will also be outputted by the -C option, you should be able to see that the process result is the expected one or not. If you use the program cpp_test.c in the Validation Suite, you can run the sample test of n_*.c, i_*.c automatically. (This is just to check yes and no, and this doesn't tell the details. Also, other tests such as e_*.?, u_*.?, unspcs.?, warns.? are not included. To test mcpp itself, it is quicker to compile n_std.c.) Validation Suite has testcases for GCC / testsuite. Therefore, when mcpp is ported to one of the versions of GCC, mcpp's automatic test can be done by replacing the preprocessor of GCC to mcpp, if GCC / testsuite is installed. About this, please see cpp-test.html#3.2.3 and mcpp-manual.html#3.9.7. mcpp provides the configure script available in UNIX-like systems. However, I do not have any idea for other compiler systems besides GCC in UNIX systems, so some options need to be specified to configure compiler-specific-build. Someone who is using these compiler systems should know or be able to check about details of specifying these options. If you know, please let me know. I would like to do further work with the configure script. Please refer INSTALL for the configure script. When you can't port successfully, please let me know what is happening. If you attach the following data, I may be able to return the ported source. In environments where the configure can be used, you can find out lots of data through its use. By the way, from mcpp V.2.6 onward, pre-C90 compiler system is excluded from supporting. If the host compiler and the target compiler are different, I need all the above data for both systems. To look at it like this, there are so many things to check, but practically, most of the compiler systems should have common characteristics with the ones already successfully ported, so it should not have too many problems to port for just running. The implementations of the execution options, #pragma and the non-standard specification will be the relatively time consuming ones. These can be done gradually after porting just to be able to run. The only annoying aspects are when one gets caught by compiler bugs. The Validation Suite results of preprocessors for the compiler systems I have are summarized in cpp-test.#6. Please let me know the result of testing with other compiler systems. It may be a bit of effort, as there are so many items. The test by cpp_test.c does not take long, please send me this. In case of GCC, the automatic test can be done by the Validation Suite. Besides reporting bugs, please send me feedback for anything, such as the handiness of mcpp, diagnostic messages, source code, Validation Suite, my interpretation of Standard C or the document writing method. This preprocessor was created as a hobby, but it was the result of having devoted six and a half years, with lots of ideas even up until V.2.0. I want to make this the best, as much as I can, after such a work. About the C preprocessor, I think I have done almost everything meaningful, except testing and porting to the compiler systems I don't have. I would like to improve it if there are any problems that exist. The code of Martin Minow was very clear, viceless and easy to understand, and I learned a lot by just reading this source code. The people who are interested in this field may be very limited, but I am looking forward the feedback and the information. Please send the information and the feedback to "Open Discussion Forum" at: or by e-mail. When I started messing about with DECUS cpp in Jan 1992, I had never even dreamed to take this long a stretch. I just thought I would change it a bit in the new years break. Once I started, I realized I had to read the source properly and it took me about two months to read through. I did it because the source was worth reading as well. Then I revised some of the specification to adapt to C90. It was as planned till this point. Then, I realized I did not really know the preprocessor specification of C90 precisely. When I read P. J. Plauger & Jim Brodie "Standard C" (1989), the function-like macro expansion methods turned my prejudice around completely. (A Japanese translation version of this part was miss-translated.) So I bought a copy of Standard C and I repeatedly read those difficult sentences related to preprocessing. As a result, I found the preprocessing of C90 is different in many points from the traditional one. The addition of #, ## operators are only a small part of them. Significantly, I puzzled my brain about the function-like macro expansion routine. I thought it over for 2-3 weeks consulting the cpp source of E. Ream, and then I wrote the new macro expansion routine for C90. I have never used my brain so hard as for thinking the algorithm of the program. That was April, 1992. Well, I thought I was over the hump and that the cpp playing was finished, but it took almost a further six years since then. There were not many problems that made me suffer during the rest. Nevertheless, it took so long. That was partly because I got bored thinking and couldn't concentrate on messing around with cpp. But that wasn't all. I did the following things. In this list, the documentation took a long time. Especially the last four years, the time changing the source was only a little bit while most of the time was dominated by writing the documentation. Due to that, the documentation became such a volume, but the time taken was not only because of the volume. When I was writing the documents, the uncertain parts of the specification kept coming up. Each time I re-read the Standards, I revised the source code. The length of time revising the source was not a lot, but the number of times revising the source was a lot. Not only the preprocess specifications, but I also read well whole of the Standard including the Rationale of ANSI C. It's like I learned C90 by creating the preprocessor. Also, I could understand the problems of the C90 standard through this. At first, I wrote a few test programs as samples. However, I found unexpected bugs each time I wrote samples and tested on mcpp. Then I decided to write the Validation Suite which would test every specification of the C90 preprocess. The problems of C90 became obvious by writing this Validation Suite. To comply to the irregular parts of C90 was such trouble and a bit meaningless for myself, but I am sure there were more meaningful things. What I learned through this work are the following things. This thinking is a sort of perfectionism. Things in the world mostly cannot be achieved by perfectionism, and software is not an exception. Nevertheless, there are some areas for which perfectionism has a very important role. The language processing systems may be one of them. I can say that I could spend so many years, through and through, because this was my hobby. But six and half years was too long. I kept thinking about who would be going to use this after I spent so many years to create a perfect program. I think this must be the limit of the size for making a program as a hobby. Anyway, as I have already done mcpp, I will keep maintaining it. Therefore, could everyone please send me feedback, bug reports or porting reports. After releasing mcpp V.2.0, I updated to V.2.1, V.2.2 and then V.2.3. These updates were adapting to C99 or officially approved ISO / C++, increasing the supported systems or fixing bugs. I could update quite easily until V.2.2. It only took three months from V.2.0 to V.2.2. In contrast, it took nearly four years from V.2.2 to V.2.3. The main reason was that I became busy and didn't have enough time to spend. I cut down my jobs to 4 days a week after turning to 60 years of age in July 2000, then I went back to playing with cpp again. V.2.3 not only took time but took quite a lot of work as well. When I ported to GCC V.2.9x, I found out that I had to modify a lot to keep the compatibility with GCC/cpp. I added many options and implemented the extended specifications. Also I eased restrictions of the Standard by downgrading some errors to warnings or removing the highly frequent warnings from the default warning class. Lots of those modifications are backward ones and were not enjoyable. Especially, maintaining both the C99 specification and the part of the "traditional" specification earlier than C90 was very much against my will. Unfortunately, this is a reality of the open source world, I had to meet to certain expectations. By relaxing the restrictions of the standard, I think mcpp became easier to use also for the other compiler systems, in replacing the system resident preprocessor. During the update to V.2.3, mcpp and Validation Suite was selected to 2002 "Exploratory Software Project" of Information-technology Promotion Agency, Japan (IPA). I found out about this project by chance and I entered. Then, the project manager Yutaka Niibe selected me. That is how the development went from July 2002 to Feb 2003 by IPA's funding and based on PM Niibe's advice. The translation of the documents is also taken by HighWell. Though this was relatively small software, it became a kind of my life work after spending so much time. I had confidence with the quality, but I was disappointed without having an opportunity to be known to the world. Finally, the opportunity was given. To accomplish this project, I cut down my job to three days a week. These were the things that I had intended to do in this project. Then, Project Manager Niibe proposed the following points: As I wanted to do these things too, I gratefully added these points to the project. Actually, however, my project had delay after delay for various reasons. First, I was hit by a disc crash. Whenever I did new things, it took a long time as I had to use new software never used before. It was also the first time to compile GCC from the source, but also I had got a few problems. The updating of massive volumes of documents and the review and the correction of English version also took a considerable time. Furthermore, my mother was admitted to hospital. As a result, a part of the project, such as the support of the commercial compiler systems, had to be given up at the end. As I had always done the way which is like digging a hole deeper and deeper, it took a long time when I had to try to widen the hole. When an amateur-programmer digs deeper into the matter, this is the only way to do it. Nevertheless, to make the result to go out into the world, the hole had to be widened to some extent. During the process of widening up the hole, I managed to learn some new software and to be in the frontline of development while receiving the advice and the encouragements from Project Manager Niibe. Also, I was delighted to see my documents coming back in a flowing English. Though being pressed for time was a painful thing, each experience was fresh and fun. This "Exploratory Software Project" did not finish there. Project Manager Ichiji also selected mcpp as a continual project for year 2003. This is how I started to do some unfinished tasks from the previous year, and also some areas which I did not have experience of before. This time, my six year old PC experienced some troubles, and there were also further troubles during the upgrade of the hardware and OS. It also took time to learn the new software, and of course, the development was getting behind schedule. The condition of my mother, who had been out of hospital and in relatively good condition, became worse along with getting closer to the end of the project. This was also a source of my anxiety. (My mother died in February, 2004.) However, thanks to Project Manager Ichiji setting the due date to a reasonable time frame, I could work the tasks through thoroughly without rushing. I accomplished tasks such as the porting to Visual C++, the creation of the configure script and supporting the various multi-byte character encodings. I also managed to do the clean-up of the source code which, though inconspicuous, can not be ignored by myself as the author. The time consuming work of updating the Japanese and English documents was accomplished with the co-operation of HighWell. With these achievements, I was evaluated as one of the "super-creator"s of software by PM Ichiji! Though it may be overestimation for my ability, I think it is mcpp development over years that was recognized, and I am very glad. I think mcpp has become the world's best quality C/C++ preprocessor, thanks to the "Exploratory Software Project" which took nearly two years. As a middle-aged amateur-programmer, I am satisfied with myself having done my best. I am keeping on updates of mcpp even after the project. Many tasks are still to be done. To achieve the remaining tasks and to make mcpp widely known, I will continue to proceed steadily. mcpp is a C preprocessor developed by kmatsui (Kiyoshi Matsui) based on DECUS cpp written by Martin Minow, and then rewritten entirely. mcpp means Matsui cpp. This software is supplied as source codes, and to use mcpp in any compiler systems, a small amount of modifications to adapt to the compiler system are required before it can be compiled into an executable. *1 This document describes the specification for mcpp executables that has been already ported to certain compiler systems. For those who want to know more about mcpp or want to port it to other compiler systems, refer to mcpp source and its document mcpp-porting.html. All these sources and related documents are provided as an open-source-software. Before going into detail, some of the mcpp features are introduced here. Note: *1 mcpp V.2.6.3 onward provides some binary packages too, at the following site. mcpp is a portable preprocessor, supporting various operating systems, including Linux, FreeBSD and Windows. Its source has a wide portability, and can be compiled by any compilers which support Standard C or C++ (ANSI/ISO C or C++). The library functions used are only the classic ones. To port mcpp to each compiler system, in many cases, one only needs to change some macro definitions in the header files and simply compile it. In the worst case, adding several dozen of lines into a source file would be enough. To process multi-byte characters (Kanji), it supports Japanese EUC-JP, shift-JIS and ISO2022-JP, Chinese GB-2312, Taiwanese Big-5 and Korean KSC-5601 (KSX 1001), as well as UTF-8. For shift-JIS, ISO2022-JP or Big-5, mcpp can complement the compiler-proper if it does not recognize them. mcpp has various behavioral modes. Other than Standard-conforming mode, there are K&R 1st mode, "Reiser" cpp mode and what I call post-Standard mode. mcpp has also an execution option for C++ preprocessor. Different from many existing preprocessors, Standard mode of mcpp has the highest conformance to Standards: all of C90, C99 and C++98. It has been developed aiming to become the reference model of the Standard C preprocessor. Those versions of the Standard can be specified by an execution option. *1 In addition, it provides several useful enhancements: '#pragma MCPP debug', which traces the process of macro expansion or #if expression evaluation, and the header file "pre-preprocessing" facility. mcpp also provides several useful execution options, such as warning level or include directory specification options. Even if there are any mistakes in the source, mcpp deals suitably with accurate plain diagnostic messages without running out of control or displaying misguiding error messages. It also displays warnings for portability problems. The detailed documents are also attached. In spite of its high quality, mcpp's code size and memory usage is relatively small. A disadvantage of mcpp, if any, is slower processing speed. It takes two or three times time of GCC 3.*, 4.* / cc1, but seeing that its processing speed is almost the same as that of Borland C 5.5/cpp32 and that it runs a little bit faster when the header file pre-preprocessing facility is used, it cannot be described as particularly slow. mcpp puts an emphasis on standard conformance, source portability and operability in a small memory space, making this level of processing speed inevitable. Validation Suite for Standard C Preprocessing, which is used to test the extent to which a preprocessor conforms to Standard C, its documentation cpp-test.html, which contains results of applying Validation Suite to various preprocessors, are also released with mcpp. When looking through this file, you will notice that so-called Standard-conforming preprocessors have so many conformance-related problems. During the course of developing mcpp V.2.3, it was selected as one of the "Exploratory Software Projects for 2002" by Information-technology Promotion Agency (IPA), Japan, along with its Validation Suite. From July 2002 to February 2003, the project, financed by IPA, proceeded under advice of Yutaka Niibe project manager. I asked "HighWell, Inc." Limited Company, Tokyo, for translation of all the documents. For technical details, I revised and corrected the translated documents. mcpp was continuously adopted to one of the "Exploratory Software Projects" in 2003 by Hiroshi Ichiji project manager. The update of mcpp proceeded into the next version, V.2.4. *2 After the project, I am still going on updating mcpp and Validation Suite. Note: *1 ISO/IEC 9899:1990 (JIS X 3010-1993) had been used as C Standard, but in 1999, ISO/IEC 9899:1999 was adopted as a new Standard. This document calls the former C90 and latter C99. The former is generally called ANSI C or C89 because it migrated from ANSI X3.159-1989. ISO/IEC 9899:1990 + Amendment 1995 is sometimes called C95. C++ Standards are ISO/IEC 14882:1998 and its corrigendum version ISO/IEC 14882:2003. This document calls both of them C++98. *2 The outline of the "Exploratory Software Project" can be seen at the following site (Japanese only). mcpp from V.2.3 through V.2.5 had been located at: In April 2006, mcpp project moved to: The old version of mcpp, cpp V.2.2 and Validation Suite V.1.2 are located in the following Vector's web site. They are in the directory called dos/prog/c, but they are not for MS-DOS exclusively. Sources are for UNIX, WIN32, MS-DOS. The documents are Japanese only. The text files in these archive files available at Vector use [CR]+[LF] as a <newline> and encode Kanji in shift-JIS for DOS/Windows. On the other hand, those from V.2.3 through V.2.5 available at SourceForge use [LF] as a <newline> and encode Kanji in EUC-JP for UNIX. From V.2.6 on two types of archive, .tar.gz file with [LF]/EUC-JP and .zip file with [CR]+[LF]/shift-JIS, are provided. Though this manual was text-file in the older versions, it has changed to html-file at V.2.6.2. There are two types of build (or compiling configuration) for mcpp executable. *1, *2 Each mcpp executable has following 5 behavioral modes regardless of the building types. The mode of mcpp is specified by the run-time options as follows: In this document, I group OLDPREP and KR into pre-Standard modes, and group STD, COMPAT and POSTSTD into Standard modes. Since COMPAT mode is almost the same with STD mode, STD includes COMPAT unless otherwise mentioned. *3 There are differences in the macro expansion methods between Standard and pre-Standard modes. Roughly speaking, this difference is the difference between C90 and pre-C90. The biggest difference is the expansion of the function-like macros (macros with arguments). For the arguments with macros, while in Standard mode, mcpp substitutes the parameter within the replacement list of the original macro after completely expanding the arguments, in pre-Standard, mcpp substitutes the parameter without expanding, then expands the argument at rescan time. Also, in Standard mode, a macro is not expanded recursively in principle, even if the macro definition is recursive directly or indirectly. If there is a recursive macro definition in pre-Standard mode, it causes infinite recursion and becomes an error at expansion time. Handling of \ at line end is also different by mode. In Standard mode, after processing the trigraph, the sequence of <backslash><newline> gets deleted before tokenization, but in pre-Standard mode, these only get deleted when they are within the string literals or in a #define line. There is a subtle difference in tokenization (token parsing, decomposition to tokens). In Standard mode, it tokenizes on "token based processing" principle. To put it concretely, in Standard mode, spaces will be inserted surrounding the expanded macro to prevent the unexpected merging with its adjacent tokens. In pre-Standard mode, traditional, convenient and tacit tokenization and the macro expansion methods of "character based text replacement" are left a trace. About these, please see cpp-test.html#2. In Standard mode, it handles the numeric token, called preprocessing number, according to the Standard specification. In pre-Standard, the numeric tokens are the same as integer constant tokens or floating point tokens. The suffix 'U', 'u', 'LL' and 'll' of the integer constant and the suffixes 'F', 'f', 'L' and 'l' of floating point are not recognized as a part of the tokens in pre-Standard. The string literals and character constants of wide characters are recognized as single tokens only in Standard mode. Digraph, #error, #pragma, and _Pragma() operator are available only in Standard mode. Also, -S <n> option (strict-ansi specs) and -+ option (the one run as C++ preprocessor) are used only in Standard mode. Pre-defined macros __STDC__, __STDC_VERSION__ are defined in Standard mode, and they don't get defined in pre-Standard. #if defined, #elif cannot be used in pre-Standard mode. Macros cannot be used within argument of #include or #line in pre-Standard. Predefined macros, __FILE__, __LINE__, __DATE__, __TIME__ are not defined at pre-Standard. On the other hand, #assert, #asm (#endasm), #put_defines and #debug are available in pre-Standard mode only. #if expression is evaluated in long / unsigned long or long long / unsigned long long at Standard mode, and in (signed) long only at pre-Standard. sizeof (type) in #if expression can be used only in pre-Standard. Trigraphs and UCN (universal character name) are available only in STD mode. The output of diagnostic messages is also slightly different between the modes. Please see chapter 5 for details. Any other items, which do not have any distinct rules between K&R 1st and the Standards, follow the C90 rules in pre-Standard mode. The difference of OLDPREP mode from KR mode and the difference of POSTSTD and COMPAT modes from STD mode are as follows: Moreover, there is a mode called lang-asm.
That is a mode to process anomalous sources which are assembler sources and nevertheless have comments, directives and macros of C embedded.
While POST_STD cannot become this mode, STD, KR and OLD get to this mode when specified by an option.
See 2.5 for its specifications. For the above reasons, there are some different specifications in mcpp executables. So, please read this manual carefully. This chapter describes first the common options, next the behavioral-mode-dependent options, then the the options common to most compiler systems, finally the compiler-dependent options for each compiler-specific-build. Note: *1 There is another one named subroutine-build which is called as a subroutine from some other main program. The behavioral specification of subroutine-build is, however, the same with either of compiler-specific-build or compiler-independent-build according to its compile time setting. Hence, this manual does not mention subroutine-build particularly.
As for subroutine-build, refer to mcpp-porting.html#3.12. *2 The binary packages provided at the SourceForge site are of compiler-independent-builds. *3 mcpp had two separate executables for Standard mode and pre-Standard mode; they were integrated into one at V.2.6. *4 This option is for compatibility with GCC, Visual C++ and other major implementations. 'compat' means "compatible mode". The <arg> and [arg] shown below indicate required and optional arguments respectively. Note that the <, >, [, or ] character itself must not be entered. mcpp invocation takes a form of: Note that you have to replace the above "mcpp" with other name, depending on how mcpp is installed. When out_file (an output path) is omitted, stdout is used unless the -o option is specified. When in_file (an input path) is omitted, stdin is used. A diagnostic message is output to stderr unless the -Q option is specified. If any of these files cannot be opened, preprocessing is terminated, issuing an error message. For an option with argument, white-space characters may or may not be inserted between the option character and an argument. In other words, both of "-I<arg>" and "-I <arg>" are acceptable. For options without argument, both of "-Qi" and "-Q -i" are valid. For an option with an argument, missing a required argument causes an error except for the -M option, If -D, -U, -I, or -W option is specified multiple times, each of them is valid. For -S, -V, or -+ option, only the first one is valid. For -2, or -3 option, its specification switches each time an option is specified. For other options, the last one is valid. The option letters are case sensitive. The switch character is '-', not '/', even under Windows. When invalid options are specified, a usage statement is displayed. To check valid options, enter a command, such as "mcpp -?". In addition to the usage message, there are several error messages, but they are self-explanatory. I will omit their explanations. This section covers common options across mcpp modes or compiler systems. The -M* options are to output source file dependency lines for makefile. When there are several source files and the -M* option is specified for each of these source files to process and merge the outputs into a file, dependency description lines are aligned. These options are similar to those of GCC, but there are several differences. *1 Note: *1 mcpp differs from GCC in that: mcpp has several behavioral modes. For their specifications refer to sec 2.1. This manual shows a list of various mcpp behaviors by mode, which may not be readable. Please be patient. In this manual, all the uppercased names that do not begin with "__" and displayed in italics, such as DIGRAPHS_INIT, TRUE, FALSE, etc, are macros defined in system.H. These macros are only used for compiling mcpp itself and a mcpp executable generated does not predefine these macros. You must understand this point clearly. The following options are available in Standard mode: The following option is available for STD mode: Note: *1 C++'s __STDC__ is not desirable and causes many problems. GCC document says that __STDC__ needs to be predefined in C++ because many header files expect __STDC__ to be defined. The header files should be blamed for this. For common parts among C90, C99 and C++, "#if __STDC__ || __cplusplus" should be used. *2 Different from C99, the C++ Standard makes much of UCN. So did C 1997/11 draft. Half-hearted implementation is not permitted. However, implementing Unicode in earnest is too much burden for preprocessor. *3 In C90 mcpp treats // as a comment but issues a warning. *4 This is for compatibility with GCC. *5 If you install GCC-specific-mcpp, cc1 (cc1plus) is set to be handed from mcpp preprocessed file with -fpreprocessed option.
Though this option means that the input is already preprocessed,
cc1 still processes comment.
Therefore, you can safely pass output of -K to cc1 with -fpreprocessed.
Furthermore, if you add -save-temps option to gcc (g++) command, preprocessed output is left as *.i (*.ii) file, and you can read it by some refactoring tool. *6 Comment insertion by -K option causes column shifts in sources, and this makes *.S file of GCC, which is not C/C++ source and compiled with -x assembler-with-cpp option, unable to be assembled.
Also comments kept by -C option are sometimes confusing with that inserted by -K option.
Therefore these options cannot be used at the same time. *7 This option fails to keep column position on some particularly complex cases.
When line splicing by a <backslash><newline> and line splicing by a line-crossing comment are intermingled on one output line, or a comment crosses over 256 lines, column position will be lost.
Note that each '\v' and '\f' is converted to a space with or without this option. The following 2 options can be used on UNIX-like systems, for either of compiler-independent-build and GCC-specific-build.
On GCC-specific-build, however, these will get an error if the GCC does not support them. Since GCC has so many options that GCC-specific-build of mcpp has some different options from the other builds in order to avoid conflicts with GCC. Note that the options in compiler-independent-build are all the same even if compiled by GCC. The options common to the builds other than GCC-specific are as follows. If the token that follows the line top # does not agree with any of C directives as above, mcpp outputs this line as it is without causing an error. If the token that follows the line top # is not even an identifier nor pp-number, mcpp discards the line with a warning, without causing an error. The above old-fashioned string literals are concatenated into "A very very\nlong long\nstring literal". These sometimes happen to GNU source code, however, this option for GCC is -x assembler-with-cpp or -lang-asm. To use mcpp replacing the compiler system's resident preprocessor, install it in the directory where the resident preprocessor is located under an appropriate name. Before copying mcpp, be sure to change the name of resident preprocessor so that it may not be overwritten. For settings on Linux, FreeBSD, or CygWIN see 3.9.5.
For settings in GCC 3.*, 4.*, see also 3.9.7, and 3.9.7.1.
For MinGW, see 3.9.7.1. Possibly the compiler driver cannot pass some options to mcpp in a normal manner. However, GCC provides the -Wp almighty option to allow you to pass any options to the preprocessor. For example, if you specify as follows: The -W31 and -Q23 options are passed to preprocessor. The options you want to pass to preprocessor have to be specified following -Wp with each option delimited by ", ". *1, *2 For other compiler systems, if their compiler driver source is available, it is recommended that this type of an almighty option should be added to the source. If you modify the compiler driver source code in the way that, for example, when -P<opt> is specified, only -<opt> is passed to preprocessor, it would be very convenient because any options can be passed. An alternative way to use all the options of mcpp is to write a makefile in which first preprocess with mcpp, then compile the output file of mcpp as a source file. For this method, refer to sections 2.9 and 2.10. The following options are available for some compiler-specific-builds. The compiler-independent-build has not these options, of course. The following options are available for the LCC-Win32-specific-build. The following options are available for the Visual C-specific-build. mcpp on Mac OS X accepts the following option, on both of GCC-specific-build and compiler-independent-build. mcpp on Mac OS X accepts the following option on GCC-specific-build. The following options (until at the end of this 2.6 section) are available for the GCC-specific-build. Note that since __STDC__ is set to 1 for GCC, the result is same with or without the -S1 option. The followings are available across the modes. The following options are available for Standard mode. For STD mode, following options are available. (These cannot be used in POSTSTD mode.) The following option is available for pre-Standard mode of GCC-specific-build. The next option is available on CygWIN GCC-specific-build. mcpp neither makes the following options an error nor does anything about them. (It sometime issues a warning.) In GCC V.3.3 or later, preprocessor has been absorbed into compiler, and independent preprocessor does not exist. Moreover, gcc often passes to preprocessor the options not for preprocessor, even if it is invoked with -no-integrated-cpp option. GCC-specific-build of mcpp for V.3.3 or later ignores the following options, if it cannot recognize them, as that kind of false options. Note: *1 -Wa and -Wl are almighty options for assembler and linker, respectively. The documentation on UNIX/System V/cc describes these options. Probably, GCC provides the -W<x> option for compatibility. *2 In GCC V.3, cpp was absorbed into cc1 (cc1plus). Therefore, the options specified with -Wp are normally passed to cc1 (cc1plus). To have cpp (cpp0), not ccl, preprocess, the -no-integrated-cpp option must be specified on gcc invocation. *3 GCC V.3.3 or later predefines several dozen of macros. -dD option does not regard these macros as predefined and output them. *4 The output of -dM option is similar to that of '#pragma MCPP put_defines' ('#put_defines') with the following differences: *5 Refer 3.9.6.3. In compiler-independent-build of mcpp, the include directories are not set up other than /usr/include and /usr/local/include in UNIX systems. Other directories, if required, must be specified using environment variables or runtime options. The environment variable in compiler-independent-build is INCLUDE for C and CPLUS_INCLUDE for C++. Searching the file starts from the includer's source directory by default. (refer to 4.2 for the search rule.) Besides, in Linux there is a confusion of include directories, hence special setup is necessary to cope with this problem.
Refer to 3.9.9 for the problem. For the default include directories on GCC-specific-build, refer to noconfig/*.dif files, and for search rule and environment variable name, refer to 4.2. For the environment variable LC_ALL, LC_CTYPE, LANG, refer to 2.8. mcpp can process various multi-byte character encodings as follows. The encoding used during execution can be specified as follows (Priority is given in this order): How to specify a <encoding> is basically same across #pragma __setlocale, -e option, and the environment variables; in the table below, the encoding on the left-side hand is specified by the <encoding> on right-hand side; <encoding> is not case sensitive; '-' and '_' are ignored. Moreover, if it has '.', the character sequence to the '.' is ignored. Therefore, EUC_JP, EUC-JP, EUCJP, euc-jp, eucjp and ja_JP.eucJP are regarded as same. '*' represents any character sequence of zero or more bytes. (iso8859-1, iso8859-2 are equivalent to iso8859*.). If any of the following encodings is specified, mcpp is no longer able to recognize multi-byte characters: C, en* (english), latin* and iso8859*. When a non-ASCII ISO-8859 Latin-<n> single-byte character set is used, one of these encodings must be specified. When an empty name is used (#pragma __setlocale( "")), the encoding is restored to the default. Only in the Visual C-specific-build, the following encoding name can be specified with '#pragma setlocale'. This is for compatibility with Visual C++. It is recommended you should use these names because the Visual C++ compiler cannot recognize encoding names other than these. ('-' can be omitted for mcpp, but not for the Visual C++ compiler-proper.) In Visual C++, the default multi-byte character encoding varies, depending on what language the language parameter and "Region and Language Option" of Windows are set to. However, the #pragma setlocale specification takes precedence over these Windows's settings. GCC sometimes fails to handle shift-JIS, ISO2022JP and BIG-FIVE encodings, which contain the byte of 0x5c value.
So, GCC-specific-build of mcpp complements it. *1 Note *1 If the --enable-c-mbchar option is specified to configure GCC itself, that GCC recognizes an encoding specified by an environmental variable LANG set to one of C-EUCJP, C-SJIS or C-JIS, gcc's info says.
This way of configuring seems to be available from 1998 onward, but it has been seldom used, and its implementation does not work.
Although GCC-specific-build of mcpp had supported these environmental variables, such as LANG=C-SJIS, it removed that feature since V.2.7. Compilers whose preprocessor is integrated into themselves are called one-pass compilers. These includes Visual C, Borland C, and LCC-Win32. Such compilers are becoming more popular because they can achieve a little higher processing speed. However, the time for preprocessing becomes shorter due to better hardware performance. In the first place, there is much point for preprocessing to be a common phase, mostly independent of run-time environment and compiler systems. It is not desirable that one-pass compilers become more popular. There will be more compiler-system-specific specifications. Anyhow, it is impossible to replace the preprocessor of a one-pass compiler with mcpp. To use mcpp, a source program is preprocessed with mcpp and then the output is passed to a one-pass compiler. As you see, preprocessing takes place twice. It is useless but inevitable. Using of mcpp still has merits of source checking and can avail functions not available in resident preprocessor. To use mcpp with a one-pass compiler, the procedure must be written in makefile. For sample procedures, refer to the makefile re-compilation settings used to compile mcpp itself, such as visualc.mak, borlandc.mak, and lcc_w32.mak. Although GCC 3 or 4 compiler now integrates its preprocessing facility into itself, gcc provides an option to use an external preprocessor. Use this option when mcpp is used. (See 3.9.7.) It is difficult to use mcpp in Integrated Development Environment (IDE) because IDE's GUI follows compiler-system-specific specifications and internal interfaces are not usually made available to third parties. Furthermore, one-pass compilers make it more difficult to insert a phase to use mcpp. This subsection describes how to make mcpp available in Windows / Visual C++ 2003, 2005, 2008 IDE. Use the compiler-specific-build for Borland C and LCC-Win32 on command lines. Also, it is described here how to make mcpp available in Mac OS X / Xcode.app / Apple-GCC. mcpp cannot be used in a normal "project" since the internal specifications of Visual C++'s IDE are not made available to third parties and the compiler is a one-pass compiler. However, once a makefile that uses mcpp is created, Visual C++'s IDE can recognize the makefile and you can create a "makefile project" using that file. This allows you to utilize most of the IDE functions, including source editing, search, and source level debugging. "Creating a Makefile Project" of a Visual C++ 2003 document, Visual C++ 2005 Express Edition Help and Visual C++ 2008 Express Edition Help describe how to make a makefile project. Perform the following procedure to create a makefile project. You can now use every functions, including Edit, Build, Rebuild and Debugging. Note: *1 On VC 2003 and 2005, to use the debugging function under Windows XP pro or Windows 2000, a user must belong to a group called "Debugger users". However, Windows XP HE does not provide such a group, so one has to login as an administrator. On VC 2008, such a limitation on users group was lifted. *2 In order to perform the source level debugging function, makefile must be written in such a way that cl.exe is called with the -Zi option appended to generate debugging information. *3 If you start Visual Studio by selecting "Start" -> "Programs", environment variables, such as for include directories, are not set. In order to have these variables set, you should open the 'Visual Studio command prompt' to start Visual Studio by typing on VC 2003: On VC 2005 express edition and VC 2008 express edition: *4 You must have a writing permission to the directory into which you install mcpp.
If you try to install into 'bin' or 'lib' directories of the compiler system, the permission should be carefully set by an administrator account.
It is recommended to make the user account belong to "Power users" or "Authenticated users" group and set "write" and "modify" permissions to the directory for the group.
Another way of controlling the permission is to install the compiler system into a directory which the user has wrinting permission on, such as a shared directory. You can use Xcode.app, which is an IDE on Mac OS X, with mcpp without problems. *1 Xcode.app uses gcc (g++) in /Developer/usr/bin rather than /usr/bin for some reason.
(/Developer is the default install directory for Xcode.)
To use mcpp in Xcode.app, you must install GCC-specific-build for the gcc (g++) in that directory.
You should do as follows to install it.
(${mcpp_dir} means the directory where the source of mcpp is placed.) The installation method is the same with that for gcc in /usr/bin, except PATH setting.
So, please refer to INSTALL for installation to cross-compiler or installation of universal binary. After installing mcpp in such a way, you can use Xcode.app without any special setting for mcpp.
Also the Apple-GCC-specific *.hmap files, which are "header map" files generated by Xcode.app, are read and processed by mcpp.
However, mcpp does not process precompiled-header.
It processes '#include "header.pch"' as an ordinary #include.
Also, mcpp does not preprocess Objective-C and Objective-C++, so *.m and *.mm source files are directly handed on to cc1obj and cc1objplus, bypassing mcpp. When you use mcpp-specific options, specify them as follows: Note: *1 Here we refer to Mac OS X Leopard / Xcode 3.0. mcpp has its own enhancements. Each compiler-system-resident preprocessor has its own enhancements, some of which are not available in mcpp. This section covers these enhancements and their compatibility problems. Principally, mcpp outputs #pragma lines as they are. This principle is applied to the #pragma lines processed by mcpp itself. This is because the compiler-proper may interpret the same #pragma for itself. However, mcpp does not outputs the lines beginning with '#pragma MCPP', since these are for mcpp only. Also, mcpp does not outputs lines of '#pragma GCC' followed by either 'poison', 'dependency' or 'system_header'. Moreover, mcpp outputs neither of '#pragma once', '#pragma push_macro', nor '#pragma pop_macro' because they are useless on the later phases.
On the other hand, '#pragma GCC visibility *' is outputted, because it is for the compiler and the linker. *1 mcpp compiled with EXPAND_PRAGMA == TRUE expands macros in #pragma line (in actual, EXPAND_PRAGMA is set TRUE only for Visual C-specific-build and Borland C-specific-build). However, #pragma lines followed by STDC, MCPP or GCC are never expanded. #pragma sub-directives are implementation-defined, hence there are risks of same name sub-directive having different meanings to different compiler-systems. Some device is necessary to avoid name collision. Moreover, when EXPAND_PRAGMA == TRUE, there should be a device to avoid the name of #pragma sub-directive itself being macro expanded. This is why mcpp-specific sub-directives begin with '#pragma MCPP' and are not subject to macro expansion. This device is adopted from '#pragma STDC' of C99 and '#pragma GCC' of GCC 3. '#pragma once' is, however, implemented as it is, since this pragma has been implemented in many preprocessors and has now no risk of name collision. '#pragma __setlocale' is prefixed with "__" instead of MCPP, because it has also meaning for compiler-proper, and because the prefix avoids user-name-space. Note: *1 The GCC-specific-build of mcpp only supports '#pragma GCC system_header' of the pragmas starting with GCC. It does not support '#pragma GCC poison' and '#pragma GCC dependency'. mcpp in Standard mode uses '#pragma MCPP put_defines', '#pragma MCPP preprocess' and '#pragma MCPP preprocessed'. Pre-Standard mode uses #put_defines, #preprocess and #preprocessed. Let me explain by taking an example of #pragma. When mcpp encounters '#pragma MCPP put_defines' directive, it outputs all the macros defined at that time in the form of #define lines. Of course, the #undef-ed macros are not output. The macros that cannot be #defined or #undef-ed, such __STDC__ and etc, are output in the form of #define lines, but are enclosed with comment marks. (Since __FILE__ and __LINE__ are special macros defined dynamically on a macro invocation, the replacement list output here means nothing.) In pre-Standard mode and POSTSTD mode mcpp do not memorize parameter names of function-like macro definitions. So, these directives mechanically represent names of the first, second, third parameters as a, b, c, ... and so on. If it reaches the 27th parameter, it begins with a1, b1, c1, ..., a2, b2, c2, ... and so on. If you enter the following directive after invoking mcpp from keyboard without specifying input and output files, all the predefined macros are listed. It also outputs a comment to indicate the source file name where each macro definition is found, as well as its line number. If you invoke mcpp with options such as -S1 or -N, you will see a different set of predefined macros. When mcpp encounters '#pragma MCPP preprocess' directive, it outputs the following line: This indicates that the source file has been already preprocessed by mcpp. When mcpp encounters a '#pragma MCPP preprocessed' directive, it determines that the source file has been preprocessed by mcpp and continues to output the code it reads as it is, until it encounters a #define line. When mcpp does encounter a #define directive, mcpp determines that the rest of the source file are all #define lines and defines macros. At this time, mcpp would memorize the source filename and line number in the comment. *1, *2 A '#pragma MCPP preprocessed' is applied only to the lines that follow the directive in the source file where the '#pragma MCPP preprocessed' directive is found. If the source file is an #included one, when control is returned to the #including file, '#pragma MCPP preprocessed' is no longer applied. Note: *1 Actual processing is a little more complex. When mcpp encounters a '#pragma MCPP preprocessed', mcpp outputs lines it has read just as they are, except for #line lines, which compiler-specific-build of mcpp converts and outputs into a format that the compiler-proper can accept. mcpp disregards predefined standard macro because its #define line is enclosed with comment marks. *2 Therefore, information on where a macro definition is found is not lost during pre-preprocessing. With above directives, you can "pre-preprocess" header files. Pre-preprocessing considerably saves the entire preprocessing time. I think the explanation so far has already given you an understanding of how to pre-preprocess header files, but to deepen your understanding, let me explain it by taking an example of mcpp's own source code. mcpp source consists of eight *.c files, of which seven files include "system.H" and "internal.H". No other headers are included. The source looks like this: The system.H includes noconfig.H or configed.H, as well as several standard header files. mcpp.H is not a source file I provide and is a "pre-preprocessed" header file I am going to generate. To generate mcpp.H (of course, after setting up noconfig.H and other headers), invoke mcpp as follows: For compiler systems, such as GCC, also specify the -b option. Enter the following directives from the keyboard: Enter "end-of-file" to terminate mcpp. This has accomplished mcpp.H, which consists of the preprocessed system.H and internal.H and a set of #define lines following them. Including mcpp.H gives the same effect as including system.H and internal.H, but its size is one-nth of the original header files including the standard ones. This is because #if and comments are eliminated. It takes far less time to include mcpp.H in seven *.c files than to include system.H and internal.H seven times. By using '#pragma MCPP preprocess', much more time can be saved. On compilation, use the -DPREPROCESSED=1 option. It is recommended that the above procedure should be written in a file and the makefile should refer to it. The makefile and preproc.c appended to mcpp sources contain the procedure. Please refer to it. Although the usage of independent preprocessor is limited for one-pass compilers like Visual C, Borland C or LCC-Win32, the pre-preprocessing facility is useful even for those. The pre-preprocessing facility of header files is similar to that of the -dD option of GCC, but it differs from it in that: As far as the pre-preprocessing facility is concerned, mcpp is more accurate and practical than GCC. #pragma once directive is available in Standard mode. #pragma once is also available for GCC, Visual C, LCC-Win32 and compiler-independent preprocessor called Wave. This directive is used when you want to include a header file only once. With the following directive in a header file, mcpp includes the header file only once even if a #include line for that file appears many times. Usually, compiler-system-specific standard header files prevent duplicate definitions by using the following code: #pragma once provides similar functionality to this. Using macros always involves reading a header file. (The preprocessor cannot skip reading the code as people do and must read the entire header file for #if's or #endif's; It must read a comment before it can determine whether a line is a directive line, that is, a line with # at the beginning followed by a preprocessing directive; To do so, the preprocessor must identify a string literal; After all, it must read through the entire header file and perform most of tokenization.) #pragma once eliminates the need of even accessing to a header file, resulting in a improved processing speed for multiple includes. To determine whether two header files are identical, file name characters, including directory names in a search path, are compared. Windows is not case sensitive. Therefore, "/DIR1/header.h" and "/DIR2/header.h" are regarded as distinct, but "header.h" and "HEADER.H" are regarded as the same on Windows, but distinct on UNIX-like systems. A directory is memorized after converting to absolute path, and a symbolic link in UNIX systems is memorized after dereferencing. Moreover, path-list is normalized by removing redundant part such as "foo/../". So, the identical files are determined always correctly. *1, *2, *3 I borrowed the idea of #pragma once from GCC V.1.*/cpp. GCC V.2.*, and V.3.* still has this functionality but it is regarded as obsolete. The specification of GCC V.2.*/cpp has been changed as follows: If the entire header file is enclosed with #ifndef _MACRO, #define _MACRO, and #endif, the cpp memorizes it and inclusion occurs only once, even without #pragma once. However, this GCC V.2 and V.3 specification sometimes does not work for commercially available compiler systems that are not based on the GCC specification, due to a difference in the standard header file notation. In addition, the GCC V.2 and V.3 specification is more complex to implement. For this reason, I decided to implement only #pragma once. As with other preprocessors, it is not advisable to rely only on #pragma once when the same header files are used. It is recommended that #pragma once should be combined with macros as follows: Note that #pragma once must not be written in <assert.h>. For the reason, see cpp-test.html#5.1.2. The same thing can be said with <cassert> and <cassert.h> of C++. Another problem is that the recent GCC/GLIBC system has header files, like <stddef.h>, which are repeatedly #included by other system headers. They define macros, such as __need_NULL, __need_size_t, and __need_ptrdiff_t, and then #include <stddef.h>. Each time they do so, definitions such as NULL, size_t, and ptrdiff_t are defined in the <stddef.h>. The same thing can be said with <errno.h> and <signal.h>, and even with <stdio.h>. Other system headers define macros, such as __need_FILE, __need___FILE, and then #include <stdio.h>. Each time they do so, definitions such as FILE may be defined in <stdio.h>. #pragma once can not be used in such header files. *4 Note: *1 The normalized result can be seen by '#pragma MCPP debug path'. See 3.5.1.
'#pragma MCPP put_defines' and diagnostics use the same result, too. *2 On CygWIN, /bin and /usr/bin are the same directory in real, also /lib and /usr/lib are the same, and supposing / is C:/dir/cygwin on Windows, /cygdrive/c/dir/cygwin is the same as /. mcpp treats these directories as the same, converting the path-list to the format of /cygdrive/c/dir/cygwin/dir-list/file. *3 On MinGW, / and /usr are the same directory in real. Supposing / is C:/dir/msys/1.0, /c/dir/msys/1.0 is the same as /, and supposing /mingw is C:/dir/mingw, /c/dir/mingw is the same with /mingw. mcpp treats each of these as the same directories, converting the path-list to the format of c:/dir/msys/1.0/dir-list/file or c:/dir/mingw/dir-list/file. *4 This is applied at least to Linux/GCC 2.9x, 3.* and 4.*/glibc 2.1, 2.2 and 2.3. FreeBSD 4, 5, 6 has much simpler system headers because it does not use glibc. With a small number of header files, writing #pragma once to them does not require much effort, but it would be tremendous work if there are many header files. I wrote a simple tool to write it automatically to header files. tool/ins_once.c is a tool written for old versions of GCC. As Borland 5.5 conform to the same standard header file notation with GCC, this tool can be used. However, it is advisable that this tool should not be used in the systems like Glibc 2 that has many exceptions shown above. Even in the compiler systems that can use the tool, some header files do not strictly conform to the GCC notation. GCC's read-once functionality also does not work properly for these header files. Compile ins_once.c and perform the following command in a directory, such as /usr/include or /usr/local/include, under UNIX. and then execute ins_once as follows: Ins_once reports header files that do not begin with #ifndef or #if !defined. Manually modify these files. Then, execute ins_once as follows: If the first directive in each header file is #ifndef or #if !defined, a #pragma once line is inserted immediate below the line. Only a root user or a user with an appropriate permission is eligible for this modification. When you modified access permission, use 'chmod -R u-w *' to restore to original permission. Ins_once provides the following options. Select the most appropriate one for your system. ins_once roughly checks to write a #pragma once line only once in the same header file even if it is executed several times, but the check is not very strict. As this ins_once is of temporary and tentative nature, it scarcely performs tokenization. It worked as I expected with FreeBSD 2.0 and 2.2.7, Borland C 5.5, but it may not work properly for special header files. So before executing this tool, be sure to make a backup of an original file. Have the shell expand a wild-card. (In case of buffer overflow, execute ins_once several times by specifying some of your system header files.) These directives are provided for compatibility with GCC. GCC provides the #include_next and #warning directives. Although these directives are non-conforming, not only some source programs sometimes use them but also some Glibc2 system header files do. Taking this situation into consideration, I implemented the #include_next and #warning directives in GCC-specific-build to allow compilation of such source programs, however, mcpp issues a warning when it finds the directives. Regardless of the compiler systems mcpp is ported to, mcpp in Standard mode also implements #pragma MCPP warning. With the following directive, mcpp skips the current file's directory and start searching header.h from the next directory of search path. CygWIN and MinGW ignores the distinctions of alphabetical case of header names. The following code outputs 'any message' to stderr as a warning message: Different from #error, this is not counted as an error. When I ported mcpp to Visual C, I implemented these directives in mcpp, and then made them available for other systems. '#pragma MCPP push_macro( "MACRO")' and '#pragma MCPP pop_macro( "MACRO")' are used to "push" or "pop" a macro definition (MACRO) to the current macro definition stack. '#pragma push_macro( "MACRO")' and '#pragma pop_macro( "MACRO")' are also available for Visual C. push_macro saves a macro definition to the stack, and pop_macro retrieves the macro definition. The pushed macro definition remains valid after push_macro. To invalidate it, use #undef or redefine the macro with a new definition. push_macro can be used multiple times for a same name macro. '#pragma __setlocale( "<encoding>")' changes the current multi-byte character encoding to <encoding>. The argument of setlocale must be a string literal. For <encoding>, refer to 2.8. This directive allows you to use several encodings in one translation unit. In Visual C++, '#pragma __setlocale' cannot be used. Use '#pragma setlocale' instead. Encoding specification must be conveyed not only to mcpp but also to the compiler-proper. The latter can recognize only #pragma setlocale. For other compiler systems, when the compiler-proper cannot recognize an encoding, mcpp complements it. There is not yet any compiler-proper which can recognize '#pragma __setlocale'. '#pragma MCPP debug' and '#pragma MCPP end_debug' are for Standard mode. #debug and #end_debug are for pre-Standard mode. The '#pragma MCPP debug <args>' directive can be written anywhere in a source program. <args> specifies a debug information type. One #pragma MCPP debug directive can take several <arg>. One or more <arg> must be specified for each directive. mcpp begins to output debug information when it finds this directive, and stops it when it encounters '#pragma MCPP end_debug <args>'. The <args> can be omitted, in which case all types of debug information is reset. If <args> contains an argument that is not supported by mcpp, mcpp issues a warning, but all the preceding arguments are regarded as valid. All the debug information is output to the same path with the preprocessing output to synchronize with it. Therefore, this directive usually prevents compilation.
Nevertheless, #pragma MCPP debug macro_call outputs informations embedding into comments, and can be re-preprocessed and compiled. When you noticed something was wrong with the preprocessing result, enclose the coding you want to debug with the following directives, for example: As this directive was originally used for debugging mcpp itself, it was not developed with end users in mind. So, you may not understand its behavior unless you read its source code, and you may sometimes feel it outputs too much information, but it is useful for tracing the preprocessing process. Be patient. The following debug information types can be specified with <arg>. With these directives, mcpp displays include directories in the search path (excluding the current and source directories with which search begins) in the order of priority, starting with the highest one first. In addition, with a #include directive, mcpp displays all the directories, including the current one, it actually searched for the #include file. With these directives, mcpp displays a source line it has read, and then displays a token and its type on the source line each time it has read. This token, more specifically, is a preprocessing-token (pp-token). Not only pp-tokens on a source line but also ones mcpp reads again internally during macro expansion are displayed repeatedly. However, the following 1-byte tokens are not displayed for mcpp program's convenience sake: A pp-token has the following types: Of SEP, other than <newline> are not normally displayed. Control codes such as <newline> are displayed as <^J> or <^M>. With these directives, mcpp traces the expansion process of a macro invocation. When mcpp in Standard mode encounters a #pragma MCPP debug, it behaves as follows: If there is a macro invocation, mcpp displays the macro definition. Each argument is read, the argument is substituted for the corresponding parameter in the replacement list and the replacement list is rescanned. mcpp displays this whole process. In case of nested macro definitions, they are rescanned and expanded one by one. If an argument has a macro, mcpp traces the above process recursively before parameter substitution. Each time control is passed to and returned from a certain set of mcpp internal functions, mcpp displays the trace information along with the function name. The following table shows the role of these functions. Reading mcpp source code will gives you a concrete idea on what each function is doing. Except for expand_macro, above functions are indirectly recursive with each other. For replace and collect_args, mcpp displays data it internally stacks during macro expansion. This data is displayed using the following internal codes: <SRC> is used only in STD mode, and is not used in POSTSTD mode nor in COMPAT mode. It is recommended that '#pragma MCPP debug token' should be also used. If you specify also '#pragma MCPP debug macro_call' or -K option, macro notifications are output embedded in comments.
However, in replace() and its subordinate routines some magic characters (internal codes) are written or removed in the input stream instead of comment.
These magic characters are displayed as: If you specify -v option too, the MAC_END and the ARG_END markers also display the same numbers with corresponding starting markers. For #debug expand, mcpp uses internal routines considerably different from those used for Standard mode. The explanations are omitted. With these directives, mcpp displays #if, #elif, #ifdef and #ifndef lines and reports their evaluation result (true or false). As for a skipped #if section, no report is made. With these directives, mcpp traces evaluation of a #if or #elif expression. DECUS cpp, based on which mcpp has been developed, provides these directives for the purpose of debugging cpp itself. I scarcely modified them. This directive outputs a very long list of internal functions, as well as variable names and their values. Unless you read the mcpp source code, you may not understand these variables. However, without the source code, you can manage to understand how the mcpp pushes onto and takes out of a evaluation stack a complex expression value. With these directives, mcpp outputs detailed data each time it calls get_ch(), a function to read one byte. When mcpp in Standard mode scans a pp-token, this routine is called to read only the first byte of the pp-token. With a #debug getc, mcpp calls this routine during token scan, resulting in a tremendous amount of data output. In any way, using these directives outputs a huge amount of data, so you scarcely need to use them. With these directives, mcpp reports the status of the heap memory it has internally allocated or released using malloc(), realloc() or free() only once. Only the kmmalloc I developed and some other types of malloc() provide this functionality. Refer to mcpp-porting.html#4.extra. In case of other malloc(), mcpp will neither cause an error nor report a status. mcpp reports the heap memory status again when it terminates with these directives on. The same thing happens when mcpp terminates due to out of memory. With this directive, mcpp starts macro notification mode.
In this mode, on each macro definition and macro expansion, its line and column information on source file are output embedded in comments.
On a macro call with arguments, location information on each arguments are reported too.
Token concatenation by macro, however, may cause loss of macro information about the tokens before concatenation. Macro notification mode is designed to allow reconstruction of the original source position from the preprocessed output. The primary purpose of this mode is to allow C/C++ refactoring tools to refactor source code without having to implement a special-purpose C preprocessor. This mode is also handy for debugging macro expansions. *1 The goal for macro expansion mode is to annotate every macro expansion, while still allowing the output to be compiled.
On the other hand, '#pragma MCPP debug expand' is to trace macro expansion and outputs detailed informations, but its output is not compilable. Note: *1 Most of the specifications of macro notification mode were proposed by Taras Glek.
He is working on refactoring of sources at mozilla project: http://blog.mozilla.com/tglek/ For example, macro definition directives such as: produce the following output: where the format means: Line and column numbers start from 1.
When you specify -K option, predefined macros are output too, which have no location information. This line produces the output:
The [lnum] and [NAME] in the format of /*undef [lnum]*//*[NAME]*/ indicate line number of the line and the undefined MACRO name. Within source code other than directives, macros are expanded with markers indicating start and stop of the macro expansion.
The format allows for HTML-like nesting. /*<...*/ signals start of macro expansion and /*>*/ the end.
The start of macro expansion takes the following format replacing /*m of start of macro definition format to /*<: On macro with arguments, markers indicating source location of each argument and markers indicating start and end of each argument expansion are output too.
The marker for argument location takes the format of /*!...*/.
When a macro is found in an argument, informations on that macro is output recursively, with its location information if it is on the source file.
Macro argument marker also have a disambiguating naming scheme. An argument name is of the format: This way, if someone calls 'BAZ(BAZ(a,b), c)', it would possible to distinguish nested macros of the same name and their arguments from each other.
The argument number starts from 0.
Then the location format follows it as: The marker for start of an argument also takes the format: The marker for end of an argument is the same with the one for end of a macro expansion: /*>*/. The following lines: expand to: Moreover, when -v option is specified along with -K option, the marker for end of macro expansion and the marker for end of an argument expansion also output the macro name and its number same with their starting markers as: As you see in this example, all the ending markers correspond to the last preceding starting markers of the same nesting level.
Hence, you can judge their correspondence automatically even without -v option. On #if (#elif, #ifdef, #ifndef) line, informations on the macros in the line are shown. For example, this is bar.h: And here is foo.c: Then, foo.c produces the following output: As you see, on #if line, an annotation starts with a /*if [lnum]*/ format where [lnum] indicates the current line number.
Then one or more /*[NAME]*/ markers follow, if some macros are found, a /*[NAME]*/ for each macro.
The annotation terminates with /*i T*/ or /*i F*/ which indicates that the directive is evaluated true or false, respectively.
The expansion result is not displayed, unlike a macro on lines other than directives.
On a line such as '#if 1' which has no macro, no /*[NAME]*/ is displayed. Also annotations on #elif, #ifdef and #ifndef start with /*elif [lnum]*/, /*ifdef [lnum]*/ and /*ifndef [lnum]*/, respectively, followed by /*[NAME]*/, if some macros are found, and terminate with /*i T*/ or /*i F*/. In any blocks where compilation is to be skipped, no annotation is displayed. On #else line, as the above examples, an information is displayed in /*else [lnum]:[C]*/ format where [lnum] is the current line number and [C] is 'T' or 'F' which indicates that the #else - #endif block is to be compiled or to be skipped. On #endif line, as the above examples, an information is displayed in /*endif [lnum]*/ format where [lnum] indicates the current line number.
Of course, the #endif corresponds to the last #if (#ifdef, #ifndef) which is not yet closed. In addition, on the macro notification mode, the output format of filename in #line line differs from the default.
It outputs the filename in the "normalized" full-path-list.
See 3.2.
This is for the convenience of refactoring tool making. #assert is available in pre-Standard mode, except the GCC-specific-build. #assert provides the functionality equivalent to the #error directive in the Standard C. The following code in the Standard C: can be expressed as: The argument of #assert is evaluated as a #if expression. If it evaluates to true (non-zero), mcpp does nothing and if false (0), it displays the following message and then the argument line (after processing line splicing and comments): mcpp counts this as error but continues processing. This #assert is quite different from that of System V or GCC. mcpp in pre-Standard mode regards a block enclosed with the #asm and #endasm directives as assembler coding. mcpp implements this functionality for Microware C/6809 only. To implement this functionality in other compiler systems, do_old() and put_asm() in system.c must be modified. For a #asm block, mcpp performs trigraphs conversion and deletes <backslash><newline> sequence, but it neither performs comment processing, checks tokens or characters, nor deletes white-space characters at the beginning of a line. Also, it does not expand a token that happens to have the same name with a macro and outputs it as it is. Other directive lines have no meaning within the #asm block. These #asm and #endasm directives do not conform to Standard C. In the first place, extension directives in the form other than "#pragma sub-directive" are not Standard C conforming. Changing their directive names to #pragma asm and #pragma endasm does not solve this problem. In Standard C, the source code must consist of a C token sequence (more precisely, a preprocessing token sequence), however, an assembler program is not a C token sequence. To use assembly code in the Standard C, there is no other way but to embed it in a string literal token. Then, you have to implement a built-in function that processes that string literal in the compiler-proper and call it as follows: However, this is not suitable for a longer assembly code, in which case, you had better write the assembly code as a separate file like a library function, and assemble and link the program. This seems to be inconvenient, but it is necessary to separate the assembler portion completely to write a portable C program. It is recommended that you should write assembly code in a separate file rather than using #asm. These features are available in Standard mode. The -V199901L option with __STDC_VERSION__ set to 199901L enables the following C99's features. The same thing can be said with C++ for the -V199901L option with __cplusplus set to 199901L or more. Although C++ Standard does not provides for the features other than 1 or 7, mcpp in Standard mode provides them for better compatibility with C99. Standard mode also allows variable argument macros even in the C90 and C++ modes. *1 Here is a macro invocation: This macro is expanded as follows: ... in the parameter list corresponds to one or more parameters. In the above example, ... corresponds to __VA_ARGS__ in the replacement list. During a macro invocation, several arguments that correspond to the ..., including ",", are concatenated to be treated as one argument. _Pragma( "foo bar") has the same effect as specifying #pragma foo bar. The argument of the _Pragma() operator must be one string literal or wide string literal. For a wide string, the prefix (L) is deleted and treated as same as a string literal. For a string literal, " enclosing that string literal is deleted, and \" and \\ in that literal is replaced with " and \, respectively, before it is treated as a #pragma argument. #pragma must be written somewhere in one logical line and its argument is not macro-expanded at least for C90. On the other hand, the _Pragma() operator can be written anywhere in source code (even in a replacement list), which gives the same effect with #pragma written in a logical line. The _Pragma() operator generated during macro expansion is also valid. This flexibility provides the pragma directive with a wide range of portability and allows a header file to absorb the difference in #pragma among compiler systems. (For this sample, see pragmas.h and pragmas.t of "Validation Suite".) *2 C99 stipulates a #if expression is of maximum integer type. As "long long" and "unsigned long long" are required types, the type of an #if expression is "long long / unsigned long long" or larger. C90 and C++98 stipulate the type is long / unsigned long. mcpp, however, evaluates it by long long / unsigned long long even in C90 and C++98, and issues a warning when the value is out of range of long / unsigned long. *1 Note: *1 This is for compatibility with GCC and Visual C++ 2005, 2008. It is difficult also for other compiler systems to implement C99 specifications all at once. Probably, they will begin to implement them little by little with __STDC_VERSION__ set to 199409L or so. *2 C99 says that a #pragma argument that begins with STDC is not macro-expanded. For other #pragma arguments, whether macro is expanded is implementation-defined. mcpp of compiler-specific-builds have some specifications peculiar to each compiler system.
Such particular specifications, other than execution option and #pragma, are explained in this section. GCC has variadic macro of its own specification from V.2 as shown in 3.9.1.6.
In this manual we call this as GCC2-spec variadic macro.
Moreover, GCC implemented one more spec of variadic from V.3 as shown in 3.9.6.3, which we call GCC3-spec variadic macro.
GCC V.2.95 and later implements C99 variadic, too.
Nevertheless, softwares such as glibc or Linux system headers do not use C99 variadic, nor even GCC3-spec one, and still use that of GCC2-spec. mcpp of GCC-specific-build on STD mode implemented GCC3-spec variadic from V.2.6.3, and GCC2-spec one from V.2.7, in order to avoid inconveniences on Linux or some other softwares.
Yet, mcpp warns at use of them.
GCC-spec variadics, especially GCC2-spec one, are not only unportable, but also syntactically unclean.
Use of them in sources to write in future is not recommendable. Visual C had not implemented variadic macro up to 2003.
But, Visual C 2005 finally implemented it.
Its specification is C99 one with a modification like GCC3-spec one.
When variable arguments are absent, Visual C removes their immediately preceding comma.
It does not use '##' token used in GCC3-spec.
The specification is illustrated as below.
Visual C document says that in the third example, a comma is not removed.
In actual, however, it removes even the comma in this case. Visual C-specific-build of mcpp implemented in STD mode this special handling from V.2.7.
Still, it warns at use of the spec.
mcpp implements the behavior on the above third example as its spec, and does not remove the comma. On a macro with 'defined' token in it, GCC expands it differently from usual when it is in #if line.
I explain this problem at 3.9.4.6. mcpp of GCC-specific-build from V.2.7 onward handles this sort of macro like GCC in STD mode.
This behavior was implemented to cope with a few wrong macros used in Linux system headers.
Yet, it warns at use of 'defined' token in macro on #if line.
You should write correct #if expression. Borland C has the asm keyword. This keyword is used to write assembly code as follows: This is quite irregular and deviates from the C grammar. If there happen to be a token with the same name as a macro, it will be macro-expanded. The same can be said with Borland C itself and mcpp. It is recommended that an assembler program should be written in a separate .asm file.
mcpp does not treat this specially. Visual C++ also has the __asm keyword, which provides the similar functionality to this. GCC provide a Standard-conforming built-in function asm() which is used as asm( " mov x,4\n"). GCC has #import directive, which is a feature of Objective-C and imported to C/C++.
#import is a #include with an implicit '#pragma once'.
It is occasionally used in C/C++ sources on Mac OS X. mcpp V.2.7 and later implements this directive only on Mac OS X, in both of GCC-specific-build and compiler-independent-build. Visual C has peculiar directives named #import and #using, which have form of preprocessing-directive, in fact, they are directives for compiler and linker.
The #import of Visual C has no relation to that of GCC. mcpp of Visual-C-specific-build outputs these lines as they are. Although I tried to develop mcpp in such manner that the GCC-specific-build provides compatibility with GCC / cpp (cc1) to the extent that it does not hinder practical use, it is still incompatible in many aspects. First of all, as shown in Chapter 2, there are many differences in execution options. mcpp implements neither -A option nor non-conforming directives, including #assert and #ident. *1 Fortunately, there seems to be quite few sources that cannot be compiled due to a lack of this compatibility. It is more problematic that there are some sources that assume special behaviors of old preprocessors. Most of such source code receives a warning when -pedantic is specified in GCC. mcpp in Standard mode, by default, provides almost the same behavior as GCC's -pedantic since it implements Standard conforming error checking. However, since GCC/cpp, by default, allows such Standard violations without issuing a diagnostic, there are some sources that take advantage of this. It is very easy to rewrite such non-conforming code to Standard-conforming code in most cases, so it is meaningless to take the trouble to write non-conforming code only to impair portability and, what is worse, to provide a hotbed of bugs. When you find such code, do not hesitate to correct it. *2 Note: *1 The functionality of #assert and #ident should be implemented using #pragma, if necessary. The same can be said with #include_next and #warning, but these directives seem to be sometimes used in GCC system, so I grudgingly implemented them in GCC-specific-build, however, a warning is issued when they are used. *2 From 3.9 through 3.9.3 sections of this document was written in 1998, when the sources depending on traditional preprocessors were frequently found.
After that time, such sources have greatly decreased,
and on the other hand, sources depending on the local features and implementation trivialities of GCC have much increased.
The 3.9.4 and later, especially 3.9.8 and later sections describe mainly such problems. (2008/03) Taking FreeBSD 2.2.2-R (1997/05) kernel source code as an example, this section explains some preprocessing problems. All the directories that appear in this section are installed in /sys (/usr/src/sys). Of the items I point out below, 3.9.1.7 and 3.9.1.8 are not necessarily Standard violations and work as expected in mcpp, but mcpp issues a warning because their coding is confusing. 3.9.1.6 is an enhancement and C99 provides the same functionality, but it differs from GNUC/cpp in notation. Assembly codes are embedded by the following manner in i386/apm/apm.c, i386/isa/npx.c, i386/isa/seagate.c, i386/scsi/aic7xxx.h, dev/aic7xxx/aic7xxx_asm.c, dev/aic7xxx/symbol.c, gnu/ext2fs/i386- bitops.h, pc98/pc98/npx.c: When no " closing a string literal appears by the end of line, GCC/cpp, by default, interprets that the string literal ends at the end of line. The above coding is based on this specification. In addition, the compiler-proper seems to interpret the whole content of asm() as a string literal spreading across lines. I think that assembler source code should be written in a separate file, but if you want to embed it in ".c" file by all means, write it in the following manner, instead of using the confusing coding shown above. Standard C conforming preprocessors will accept it. The following line appears in ddb/db_run.c, netatalk/at.h, netatalk/aarp.c, net/if-ethersubr.c, i386/isa/isa.h, i386/isa/wdreg.h, i386/isa/tw.c, i386/isa/b004.c, i386/isa/matcd/matcd.c, i386/isa/sound/sound_calls.h, i386/isa/pcvt/pcvt_drv.c, pci/meteor.c, and pc98/pc98/pc98.h: This line should be changed to: To my surprise, i386/apm/apm.c contains the following strange line: Of course, this should be written as: This code must have been neither debugged nor used. gnu/i386/isa/dgb.c has a duplicate definition of the following macro: Some of header files have a macro definition conflicting with this. The Standard C regards duplicate definitions as "violation of constraint", but how they are treated depends on compiler systems; some make the first definition valid after issuing an error message and others, like GCC 2/cpp, make the last definition valid without issuing any message by default. To make the last definition valid, the following code should be added immediately before the last definition. i386/isa/if_ze.c, and i386/isa/if_zp.c have the #warning directive. This is the only Standard violation directive I found in the kernel source. To conform to the Standard C, there is no way but to comment out this line. mcpp accepts #warning. gnu/ext2fs/ext2_fs.h and i386/isa/mcd.c have the following macro that takes variable number of arguments: This is a GCC-specific enhanced specification and cannot be applied to other compiler systems. The above "## a" can be simply written as "a". With ## and in the absence of an argument corresponding to "a..." in a macro invocation, the preceding comma is deleted. C99 also provides for variable argument macros, but their notation differs from that of GCC. The above example is written as follows in C99: The most annoying difference is that in C99 requires one or more arguments on a macro invocation corresponding to "..." while GNUC/cpp requires 0 or more arguments corresponding to "a...". To handle this, when there is no argument corresponding to "...", mcpp issues a warning, instead of making it an error. Therefore, you can change the above code as follows: This is simpler with one-to-one correspondence. However, this way of writing has a disadvantage that a comma immediately before an empty argument remains, resulting in, for example, printf( fmt, ). In this case, there is no other way but to write a macro definition in accordance with C99 specifications, or avoid using an empty argument in a macro invocation. Harmless tokens, such as NULL or 0, are used to write, for example, MCD_TRACE(fmt, NULL). *1 Note: *1 GCC 2.95.3 or later also implements variable argument macros based on the C99 syntax. It is recommended to use this syntax. GCC specific one provides the flexibility of allowing for zero number of variable argument, but its notation is bad in that (1) for the "args..." parameter, a white space must not be inserted between "args" and "...", but such a pp-token is not permitted in C/C++, and that (2) it is not desirable that the notation for a token concatenation operator is used for different meaning in a replacement list. It is desirable to allow zero number of variable arguments based on the C99 notation. GCC 3 introduced a notation for variable argument macros that is a mixture of GCC 2's traditional notation and C99 one. For details, refer to 3.9.6.3. The following macro invocations appear in nfs/nfs.h, nfs/nfsmount.h, nfs/nfsmode.h, netinet/if_ether.c, netinet/in.c, sys/proc.h, sys/socketvars.h, i386/scsi/aic7xxx.h, i386/include/pmap.h, dev/aic7xxx/scan.l, dev/aic7xxx/aic7xxx_asm.c, kern/vfs_cache.c, pci/wd82371.c, vm/vm_object.h, and vm/device/pager.c. So do in /usr/include/nfs/nfs.h. The first argument is empty. C99 approved empty arguments but C90 regarded them as undefined. Taking it consideration that an argument may happen to be empty during a nested macro invocation, empty arguments should be approved, however, it is neither necessary nor desirable to write an empty argument in source code. Note that for a one-argument macro, there is syntax ambiguity between an empty argument and a lack of argument. Taking everything into consideration, the following notation is recommended: Any Standard C conforming preprocessor will accept this notation. By the way, some of the header files (in the nfs directory) shown above neither have the macro definitions shown above nor #include any other header files. This is because such header files assume that these macro definitions exist in sys/queue.h and that *.c programs will #include sys/queue.h first. These files arise ambiguity. kern/kern_mib.c has the following macro definitions: In this case, the first argument cannot be changed to EMPTY. Because the corresponding macro definition in the sys/sysctl.h is as follows: In other words, these arguments are not macro-expanded. The arguments of the SYSCTL_OID macro shown above, including the first one, are not macro expanded. In this case, there is no way but to leave the empty argument as it is. *1 Note: *1 C99 approves empty arguments as legitimate. Taking macros, such as SYSCTL_NODE() and SYSCTL_OID(), into consideration, the EMPTY macro is not almighty and using empty arguments has some reason. In addition, even if EMPTY is used, a nested macro invocation may cause empty arguments. However, for source readability, using EMPTY is recommended whenever possible. i386/include/endian.h, as well as /usr/include/machine/endian.h, has the following macro definitions. (There are four same kinds of definitions.) The problem is the ntohl definition. Although ntohl is an object-like macro, it is expanded to a function-like macro name, then rescanned with subsequent text, and is expanded as if it were a function-like macro. This way of macro-expansion has been regarded as an implicit specification since K&R 1st, and the Standard C somehow approved it as legitimate. However, as I discuss in other documents, it is this specification that makes macro-expansion unnecessarily complicated and brings confusion to Standard documents. This is a bug specification. *1 This ntohl is actually a function-like macro, written as an object-like macro omitting the parameter list. You had better define this like a function-like macro that it is: This causes no problem. i386/isa/sound/os.h has the same kind of macro definitions: This should be written as follows: Note: *1 ISO 9899:1990 Corrigendum 1:1994 regarded the notation as undefined. C99 replaced this article with other. However, Standard documents are still confusing about this. For details, see cpp-test.html#2.7.6. Some kernel sources are contained in several ".S" files, that is, they are written in assembler. These sources contain #include's or #ifdef's, which require preprocessing. To preprocess them, in FreeBSD 2.2.2-R, 'cc' is called with the '-x assembler-with-cpp' option, and 'cc' calls '/usr/libexec/cpp' with the '-lang-asm' option and then calls 'as'. Of course, this ways of using .S files is non-conforming. This assembler source code must not contain a token that happens to have the same name with a macro. White spaces between tokens and at the beginning of a line must be retained during preprocessing. In addition, if the first token at the beginning of a line is a # indicating an assembler comment, special processing is required on the preprocessor side. This not only considerably limits available preprocessors but also increases the possibility of unknowingly introducing bugs. So, using .S files in this way is not recommended. *1 To preprocess source code for use with several types of machines, the code should be written in the following manner and be saved in not ".S" but ".c" file. 4.4BSD-Lite actually adopts this way of coding. Note: *1 In FreeBSD 2.0-R, these kernel sources are contained not in *.S but in *.s file. The Makefile is so defined as to call 'cpp', instead of 'cc', to process them. Then the 'cc' calls 'as'. When the 'cpp' is called, '/usr/bin/cpp' is invoked. '/usr/bin/cpp' is a shell-script that calls '/usr/libexec/cpp -traditional'. This method was more convenient in that it provides a way to change preprocessors to be used by modifying the script. I compiled all the source files in /usr/src/lib/libc of FreeBSD 2.2.2R. There was no problem, probably because most of them comes from 4.4BSD-Lite without much modification. It is quite rare and surprising that a huge amount of source files in excellent quality is gathered together. Only at one place, I found the following coding in gen/getgrent.c. Of course, ";" at the end of line is surplus. As seen so far, writing a Standard-conforming source code with better portability in a more secure manner neither requires much effort nor provides any demerits. In spite of it, why does source code less conforming to Standards still exist at all? When comparing the FreeBSD 2.0-R kernel sources with those of 2.2.2-R, non-conforming ones do not decrease in number. The problem is that newer sources are not necessarily more conforming to the Standards. There are few non-conforming sources in 4.4BSD-Lite. This is probably because the 4.4BSD sources were rewritten to become conforming to the Standard C and POSIX. However, during the process of implementing these sources to FreeBSD, the old writing style revived in some sources. For example, although the ntohl shown above is written as ntohl(x) in 4.4BSD-Lite, it is written as ntohl in FreeBSD. Why did the notation once put away revive? I blame GCC/cpp for this revival, which passes these non-conforming sources without issuing a diagnostic. If -pedantic had been a default behavior, the old style source would have never revived. If -pedantic-errors had been a default behavior, although, GCC/cpp would not have been put into practical use because too many sources failed to be compiled. The gcc's man page describes the -pedantic option as: "There is no reason to use this option except for satisfying pedants." Now that eight years have already passed since Standard C was established, it is a high time that GCC/cpp should set -pedantic as default, not go so far as to set -pedantic-errors. *1 In FreeBSD 2.0-R, nested comments were sometimes found, but in 2.2.2-R, they disappeared. This is because GCC/cpp no longer allowed them. This has nothing to do with -pedantic, but I want to say how influential preprocessor's source checking is. Note: *1 I wrote 3.9.3 subsection in 1998. After that, gcc's man page or info deleted this expression, however, the specification remains almost the same. I compiled glibc (GNU LIBC) 2.1.3 sources (February, 2000). Different from those of FreeBSD libc, I found many problems. Some sources are written based on GCC/cpp's undocumented specifications, in which case it took me a lot of time to identify them. sysdeps/i386/dl-machine.h and stdlib/longlong.h have many multi-line string literals as shown below: Some string literals are very long. compile/csu/version-info.h created by make also has a multi-line string literal. Of course, it is non-conforming, but GCC treats it as a string literal with embedded <newline>. The -lang-asm (-x assembler-with-cpp, -a) option allows mcpp to convert a multi-line string literal into the following code: However, this option cannot work properly for a string literal with a directive inserted in the middle as shown in 3.9.1.1, in which case there is no way but to rewrite the source. #include_next appears in the following files: catgets/config.h, db2/config.h, include/fpu_control.h, include/limits.h, include/bits/ipc.h, include/sys/sysinfo.h, locale/programs/config.h, and sysdeps/unix/sysv/linux/a.out.h sysvipc/sys/ipc.h has #warning. Although these directives are not approved by the Standard C, #include_next, in particular, becomes indispensable for glibc 2. So, mcpp for GCC implements #include_next and #warning. The problems concerning #include_next is that it is not only a standard violation but also that what headers are actually included depends on the setting of include directories and a search order, which are changed by users via environment variables. When glibc is installed, some files in glibc's include directory are copied to the /usr/include directory. These files are used as system header files. That these header files contain #include_next means system headers become patchy. It seems to be time to reorganize them. The following files contain definitions of macros with variable number of arguments based on the GCC specification, as well as macro invocations: elf/dl-lookup.c, elf/dl-version.c, elf/ldsodefs.h, glibc-compat/nss_db/db-XXX.c, glibc-compat/nss_files/files-XXX.c, linuxthreads/internals.h, locale/loadlocale.c, locale/programs/linereader.h, locale/programs/locale.c, nss/nss_db/db-XXX.c, nss/nss_files/files-XXX.c, sysdeps/unix/sysdep.h, sysdeps/unix/sysv/linux/i386/sysdep.h, and sysdeps/i386/fpu/bits/mathinline.h This is a deviation from the C99 Standard. You must rewrite the source code before you can use mcpp. *1 Note: *1 This is a spec since GCC2. There is another spec of GCC3 which is a compromise of GCC2 and C99 specs. See 3.9.6.3. The following files have macro invocations with empty arguments: catgets/catgetsinfo.h, elf/dl-open.c, grp/fgetgrent_r.c, libio/clearerr_u.c, libio/rewind.c, libio/clearerr.c, libio/iosetbuffer.c, locale/programs/ld-ctype.c, locale/setlocale.c, login/getutent_r.c, malloc/thread-m.h, math/bits/mathcalls.h, misc/efgcvt_r.c, nss/nss_files/files-rpc.c, nss/nss_files/files-network.c, nss/nss_files/files-hosts.c, nss/nss_files/files-proto.c, pwd/fgetpwent_r.c, shadow/sgetspent_r.c, sysdeps/unix/sysv/linux/bits/sigset.h, sysdeps/unix/dirstream.h math/bits/mathcalls.h, in particular, contains as much as 79 empty arguments. This header file is installed in /usr/include/bits/mathcalls.h and is #included by /usr/include/math.h. Even with an EMPTY macro, nested macro invocations generate a lot of empty arguments. Are there any other ways to write macros more clearly? The following files contain object-like macro definitions replaced with function-like macro names: argp/argp-fmtstream.h, ctype/ctype.h, elf/sprof.c, elf/dl-runtime.c, elf/do-rel.h, elf/do-lookup.h, elf/dl-addr.c, io/ftw.c, io/ftw64.c, io/sys/stat.h, locale/programs/ld-ctype.c, malloc/mcheck.c, math/test-*.c, nss/nss_files/files-*.c, posix/regex.c, posix/getopt.c, stdlib/gmp-impl.h, string/bits/string2.h, string/strcoll.c, sysdeps/i386/i486/bits/string.h, sysdeps/generic/_G_config.h, sysdeps/unix/sysv/linux/_G_config.h Of these, some function-like macros, like the ones in math/test-*.c , are first replaced with an object-like macro name and then further replaced with a function-like macro name. Why did these macros have to be written in this way? sysdeps/generic/_G_config.h, sysdeps/unix/sysv/linux/_G_config.h, and malloc/malloc.c contain the following macro definition expanded to the "defined" pp-token. The intention of this macro definition is that with the following directive, the above line is expected to be expanded as follows: However, the behavior is undefined in Standard C when a #if line has a "defined" pp-token in a macro expansion result. Apart from it, this macro definition is strange in the first place. The HAVE_MREMAP macro is first replaced with the following, and then the identifiers defined, __linux__ and __arm__ are rescanned for more macro replacement. If any of them is a macro, it is expanded. In this case, defined cannot be defined as a macro (Otherwise, it causes another undefined result), and if __linux__ is defined as 1 and __arm__ is not defined, this macro is finally expanded as follows: defined(1), of course, is a syntax error of a #if expression. However, GCC/cpp stops macro expansion at (1) and regards it as the final macro expansion result of the #if line. Since this is "undefined" anyhow, this GNU specification cannot be described as wrong, but it lacks of consistency in that how to expand a macro differs between macros in a #if line and in other lines. At least, it lacks of portability. *1 The above code should be written as follows: I hope this kind of confusing code be eliminated as early as possible. *2 Note: *1 GCC 2/cpp internally treats defined in a #if line as a special macro. For this reason, when GCC/cpp rescans the following sequence of tokens for macro expansion, it evaluates it as a #if expression, as a result of special handling of defined pseudo-macro, instead of expanding the original macro. In other words, distinction between macro expansion and #if expression evaluation is ambiguous. This problem relates to GCC/cpp' own program structure. GCC 2/cpp has a de facto main routine rescan(), which is a macro rescanning routine. This routine reads and processes source file from the beginning to the end, during the course of which, it calls a preprocessing directive processing routine. Although implementing everything using macros is a traditional program structure of a macro processor, this structure can be thought to cause mixture of macro expansion and other processing. *2 In glibc 2.4, this macro was corrected.
Nevertheless, many other macros of the same sort were newly defined. The files named *.S contain assembler source code requiring preprocessing. Some of these files have preprocessing directives, such as #include, #define, and #if. In addition, the file named compile/csu/crti.S generated by Make contains the following lines: or From a syntax point of view, preprocessors cannot tell whether these lines are invalid preprocessing directives or valid assembler comments. GCC seems to leave these lines as they are during preprocessing and treat it as assembler comments. Concatenation of pp-tokens using the ## operator sometimes generates an invalid pp-token. GCC/cpp outputs these pp-tokens without issuing a diagnostic. For compatibility with GCC, I reluctantly decided that, with the -lang-asm (-x assembler-with-cpp, -a) option, mcpp does not treat these non-conforming directives and invalid pp-tokens generated by ## as error, and outputs them as they are and issues a warning. Essentially, these sources should be processed with an assembler macro processor. GNU seems to provide a macro processor called gasp, but it seems to be scarcely used for some reason. When invoked with the -dM option, GCC outputs only macro definitions, which is used by stdlib/isomac.c in 'make check' routine. The problem of the isomac.c is that it accepts only GCC/cpp's macro definition file format and regards a comment or a blank line as an error. Glibc make sometimes uses a program called rpcgen. The problem of rpcgen is that it accepts only GCC/cpp's output format of preprocessor line number information as follows: Rpcgen does accept neither: nor Rpcgen regards them as error. I reluctantly decided that GCC-specific-mcpp uses the GCC format by default. Rpcgen's specification is poor in that it is based on a particular compiler system's format and cannot accept the standard one. Glibc 2.1 'makefile' often uses the -include option and sometimes uses -isystem and -I- options. The former can be substituted with #include at the beginning of source code. The latter two are less necessary; these are only necessary to update system headers. Only GCC-specific-build of mcpp implements these two options, but I would like these less necessary options to be made obsolete. *1 Note: *1 GCC/cpp provides several more options that specify include directories and their search orders, such as -iprefix, -iwithprefix, and -idirafter. It also provides the -remap option that specifies mapping between long-file-names and MS-DOS 8+3 format filenames. On CygWIN systems, specs files contain these options, but it is not necessary to use these options because include directories can be specified with environment variables and because such mapping is no longer necessary on CygWIN. This is not a problem of glibc, but of GCC. The following macros are GCC/cpp predefined macros although their names do not appear in documentation. On Vine Linux 2.1 (egcs-1.1.2) systems, __VERSION__ is set to "egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)". On many systems, including Linux/i386, the values of other three macros have types unsigned int, int, and long int, respectively. However, on FreeBSD and CygWIN systems, their types are slightly different from them (I do not know why). Why does those predefines macros remain undocumented? The most strange thing is the undocumented environment variable named SUNPRO_DEPENDENCIES. sysdeps/unix/sysv/linux/Makefile contains the following script: The intent of this script is to specify a file name with the environment variable SUNPRO_DEPENDENCIES, and to have cpp output macro definitions in source code and dependency description lines between source files to that file. I had no other way but to read the GCC/cpp source code (egcs-1.1.2/gcc/cccp.c) to know how this environment variable works. In addition, there is another environment variable, DEPENDENCIES_OUTPUT, which has a similar function. The difference between the two is that SUNPRO_DEPENDENCIES also outputs dependency description lines among system headers, but DEPENDENCIES_OUTPUT does not. Only GCC-specific-build of mcpp enables these two environment variables, but I would like these undocumented specifications to be made obsolete as early as possible. Linux (i386)/GCC 2 appends the -Asystem(unix), -Acpu(i386) or -Amachine(i386) to cpp invocation options by using specs file. As long as the glibc 2.1.3 for Linux/x86 is concerned, there seems to be no source code that utilizes this functionality. It is a big problem that glibc's system headers have become patchy and very complicated. A small difference in settings may result in a big difference in preprocessing results. On the other hand, Glibc 2.1.3 did not contain #else junk, #endif junk, or duplicate macro definitions that were found in FreeBSD 2.2.2/kernel sources. In some aspects, Glibc 2.1 source is better organized than FreeBSD 2/kernel source. However, as a whole, there were not a few sources that are based on GCC-specific specifications in glibc 2.1, which impairs portability to other compiler systems although such sources form only a small portion of several thousand source files. Dependence on GCC local specifications is not desirable for program readability and maintainability. I hope that GCC V.3 will make obsolete these local specifications and that all the source code based on them will be completely rewritten. You must modify some source code as follows before you can use mcpp to compile glibc 2.1 sources: *1 In addition to the options specified in Makefile or specs file, you must specify the -lang-asm (-xassembler-with-cpp) option to process *.S files containing multi-line string literals or assembler comments before you can invoke mcpp. Usually, you can leave this option specified when preprocessing other files. When you want to use GCC/cpp or mcpp, or change the default options, you had better perform the following steps: Note: *1 mcpp V.2.7 implemented these specs.
Hence, editing of sources are not necessarily required. *2 If you use 'configure' and 'make' to compile GCC-specific-build of mcpp, the 'make install' command will set the script appropriately. The only thing left for you here is to add '-Q -lang-asm' options to the script. Another problem of using mcpp is that it issues a huge amount of warning messages. You can redirect them to a file using the -Q option, but when you preprocess a large amount of source code, such as glibc, total of several hundred MB or more of 'mcpp.err' are created, so it is impossible for you to look through the whole files. Taking a close look at mcpp.err, you will find same warnings being issued repeatedly. This is because the same *.h files are #included by many source programs. To make the files more readable, perform the following procedure: I first compiled GCC 3.2 sources on Linux and FreeBSD, then I used the generated gcc to compile mcpp and then I recompiled GCC 3.2 sources using mcpp for preprocessing. New GCC compilers are bootstrapped during various phases of make; gcc and cc1, etc generated in an earlier phase are used to recompile themselves, and those generated compiler drivers and compiler-propers are used again to recompile themselves, and so on. During the bootstrap, gcc exists under the name of xgcc. Other than cc1 and cc1plus, GCC 2 has a separate preprocessor called cpp. In GCC 3, cpp was absorbed into cc1 and cc1plus. However, there still exists a separate preprocessor cpp0. To have cpp0 preprocess, the -no-integrated-cpp option must be specified when you invoke gcc or g++. Therefore, to have mcpp preprocess, you must use a shell-script that have gcc (xgcc) or g++ invoke mcpp first then invoke cc1 or cc1plus. *1 In the GCC compiler system, the settings of system headers and their search order are becoming very complex. So, a small difference in settings may result in a difference in preprocessing results. Even successful compilation was often difficult to attain. In addition, compilation and tests require a lot of other software. Older versions of such software may cause failure in compilation or tests. Actually, compilation sometimes failed due to some hardware problems on my machine. Actually, I failed to compile GCC 3.2 source under FreeBSD 4.4R. I had to upgrade FreeBSD to 4.7R and changed software packages to those for FreeBSD 4.7R before I was able to succeed in compilation. *2 I used VineLinux 2.5 on two PCs. Although compilation of GCC 3.2 sources using GCC 2.95.3 was successful on one PC (K6/200MHz), recompilation of GCC 3.2 sources using the generated GCC 3.2/cc1 failed, and caused many segmentation faults. Then I changed CPU from K6 to AthlonXP. This time, recompilation was successful; no segmentation faults occurred. Hardware may have caused the problem. When I compiled GCC 3.2 sources using GCC 2.95.4 under FreeBSD on K6, "make -k check" of the generated gcc was almost successful. When I recompiled GCC 3.2 itself using the generated GCC 3.2, in "make -k check" of g++ and libstdc++-v3, about 20 percent of testsuite was unsuccessful. However, when using AthlonXP, instead of K6, everything went OK. Hardware may have caused the problem. On both VineLinux PCs, when I recompiled GCC 3.2 sources using GCC 3.2 itself and mcpp, "make -k check" of the generated gcc was successful. However, in "make -k check" of g++ and libstdc++-v3, 20 percent of testsuite failed. *3, *4, *5 In anyway, the cause of this testsuite failure seems to lie not in the generated compilers themselves, such as gcc, g++, cc1 and cc1plus, but in the header files or some other settings. mcpp cannot be described as completely compatible with GCC/cpp, but is highly compatible. So, mcpp and GCC/cpp can be used interchangeably. GCC 3.2 sources were compiled in the following environment: Only C and C++ were compiled. Note: *1 I had to do this for each bootstrap stage. Since makefile is too large and too complex to change, I employed an inelegant method; I kept on sitting in front of PC screen during the entire process of bootstrap. At each end of the stages, I entered ^C and replaced xgcc and others with shell-scripts. *2 Due to dependency between packages, the system falls into confusion unless appropriate versions are installed. Actually, for this reason, my FreeBSD temporarily failed to invoke kterm. *3 "make -k check" cannot be used with mcpp because diagnostics of mcpp are different from those of GCC. *4 "make -k check" seems to require an English environment, so the LANG environment variable should be set to C. *5 All the testsuite failures were caused by inability of the pthread_* functions, such as pthread_getspecific and pthread_setspecific, to be linked in the library i686-pc-linux-gnu/libstdc++-v3/src/.libs/libstdc++.so.5.0.0. When a correctly generated library was installed, "make -k check" was successful. On FreeBSD, this problem never happened. This is probably because of small differences in settings. This very old way of coding was no longer found in GCC 3.2 sources. Multi-line string literals were made obsolete as late as at GCC 3.2. GCC 3.2 processes a source with a multi-line string literal as you expect, but issues a warning. limits.h and syslimits.h in build/gcc/include generated during the course of make have #include_next. When GCC 3.2 is installed, these header files are copied to limits.h and syslimits.h in lib/gcc-lib/i686-pc-linux-gnu/3.2/include. GCC 3.2 sources does not have #warnings. GCC 3.2 sources have some variable argument macros, but most of them are found in testsuite and they are nothing but test samples. Although GCC 3.2 still supports variable argument macros in GCC 2 notation, the ones using __VA_ARGS__ (in C99 notation) are more frequently found in GCC 3.2 sources. In GCC 3, variable argument macros in a mixed notation of GCC 2 and C99 are found: *1 This definition corresponds to the following one of GCC 2 spec. According to the GCC specifications, in the absence of an argument corresponding to "...", the comma immediately before "##" is deleted. So, this is expanded as follows: As far as this example is concerned, this specification seems to be convenient, but is not desirable in that (1) a comma in a replacement list of a macro definition is not always used to delimit parameters, (2) it allows a token concatenation operator (##) to have other functionality, (3) it makes rules more complex by allowing exceptions. *2, *3, *4 Note: *1 This manual calls the variadic macro of specification since GCC 2 as GCC2 spec, and that of created in GCC 3 as GCC3 spec. *2 While on GCC 2.* 'args...' in definition of GCC2 spec variadic macro should not be separated as 'args ...', on GCC 3 the intervening spaces are allowed. *3 When -ansi option (or any of -std=c* or -std=iso* option) is specified, GCC, however, does not remove the comma even if the variable argument is absent.
Nevertheless, the '##' disappears silently. *4 mcpp V.2.6.3 implemented variadic macro of GCC3 spec for STD mode on GCC-specific-build only.
V.2.7 even implemented GCC2 spec one. Apart from #include-ed system headers, such as /usr/include/bits/mathcalls.h and /usr/include/bits/sigset.h, empty arguments in a macro invocation are found only in gcc/libgcc2.h of GCC 3.2 sources themselves. *1 Note: *1 These two header files are copied into the system header directory when glibc is installed. They do not exist on FreeBSD because glibc is not used. gcc/fixinc/gnu-regex.c and libiberty/regex.c have object-like macros that are replaced with function-like macro name. /usr/lib/bison.simple, a #included file, also has such macros. These macros are all relevant to alloca. For example, libiberty/regex.c has the following macro definitions. This should be written as follows: Why did they omit (size)? In addition, regex.c also has another alloca, which is defined as follows: Their writing style is inconsistent. Furthermore, regex.c has a #include "regex.c" line, which is including itself. regex.c is a strange and unnecessarily complicated source. GCC 3.2 sources do not have macros expanded to 'defined'. According to GCC 3.2 documents, this type of macro is preprocessed in the same way as GCC 2/cpp, but GCC 3.2 issues a warning to indicate "may not portable". However, GCC 3.2 does not issue a warning to an example shown in 3.9.4.6. cpp.info of GCC 3 says: However, the gcc/config directory has several *.S files. Make of GCC 3.2 uses neither rpcgen nor -dM option. However, specifications of rpcgen and the -dM option do not seem to change from the previous versions. These options are frequently used in make of GCC 3.2. Sometimes, the -isystem option is used to specify several system include directories at one time. Is it inevitable to use the option during software compilation that updates system headers themselves? I think they had better use an environment variable to specify all the system include directories. On the other hand, GCC 3/cpp documents discourage to use the -iwithprefix and -iwithprefixbefore options. GCC provides many options to specify include directories. Does GCC 3.2 move toward reorganization or reduction in number of them? *1 Note: *1 GCC 3.2 Makefile uses the -iprefix option in a stand alone manner (without using -iwithprefix or -iwithprefixbefore), although the -iprefix option makes sense only when used with one of these two options following it. GCC 2 did not document predefined macros, such as __VERSION__, __SIZE_TYPE__, __PTRDIFF_TYPE__ and __WCHAR_TYPE__. Even with the -dM option, their existence was unknown. GCC 3 not only documents them but also enhances -dM to show their definitions. GCC 3 documents the SUNPRO_DEPENDENCIES environment variable GCC 2 did not. (I do not know why this environment variable is needed.) GCC 3 implements following #pragmas: Of these, GCC 3.2 sources use poison and system_header. mcpp does not support these #pragmas because I do not think them necessary. (I omit explanation of their specifications.) *1 GCC 3 deprecates assertion directives, such as #assert, although gcc, by default, specifies the -A option. In GCC 2, the -traditional option is implemented in one and the same cpp, result in a strange mixture of very old specifications and C99 ones. In GCC 3, its preprocessor was divided into two: non-traditional cpp0 and tradcpp0. The -traditional option is valid only for gcc. cpp0 does not provides it. gcc -traditional invokes tradcpp0 for preprocessing. tradcpp0 is getting closer to a true traditional preprocessor before C90. They say that they no longer maintain tradcpp0 except for serious bugs. The strange specifications of GCC 2/cpp seem to have been significantly revised. Note: *1 mcpp V.2.7 onward supports #pragma GCC system_header on GCC-specific-build. As seen above, as far as preprocessing is concerned, GCC 3.2 sources have been much improved than glibc 2.1.3 sources in that the traditional way of writing has been almost eliminated and that meaningless options are no longer used. GCC 3.2/cpp0 itself is also much superior to GCC 2/cpp in that it regards traditional specifications as obsolete and articulates the token-based principle. Undocumented specifications have been significantly reduced. Although these improvements are not still sufficient, GCC is certainly moving toward the right direction. However, GNU / Linux system headers become so complex that it is difficult to grasp their entire structure, which may one of the biggest causes of problems in the GNU / Linux system. Another pitiful fact is that the preprocessor is absorbed into the compiler-proper. Therefore, to use mcpp, the -no-integrated-cpp option must be specified when invoking gcc or g++. If you compile a large amount of source files with complicated or many makefiles, or if some program automatically invoke gcc, you should create a shell-script that invokes gcc or g++ with the -no-integrated-cpp option automatically specified. Let me take an example of this. Place the following shell-scripts in the directory where gcc and g++ reside (on my Linux, /usr/local/gcc-3.2/bin), under the names of gcc.sh and g++.sh, respectively. Move to this directory and enter the following commands: In the directory where cpp is located (on my Linux, /usr/local/gcc-3.2/lib/gcc-lib/i686-pc-linux-gnu/3.2), create a script that executes mcpp when cpp0 is invoked, as you did for GCC 2 (See 3.9.5). By doing this, gcc or g++ first invokes mcpp and then invokes cc1 or cc1plus with the -fpreprocessed option appended. -fpreprocessed indicates the source has been preprocessed already. *1 Note that when a GCC version other than the system standard one is installed, additional include directory settings may be required. mcpp embeds these settings when mcpp itself is compiled, thus eliminating the need to set them with environment variables. If possible, I want to replace the cpplib source, the preprocessing part of cc1 or cc1plus, with mcpp. The source files that define the internal interface between cpplib and ccl or cc1plus, as well as the external interface between cpplib and user programs that use it, amount to as much as 46KB. It is impossible to replace. Why is the interfaces so complex? It is a pity. Note: *1 mcpp gets all the necessary informations by 'configure' and sets these scripts by 'make install'. Although GCC 3.2 seemed to go in the direction of better portability, GCC turned its direction to a different goal on 3.3 and 3.4. V.3.3 and 3.4 differ from 3.2 in the following points. GCC / cc1 is becoming one huge and complex compiler absorbing preprocessor and some system header's contents. I doubt whether this is a better way of compiler construction, especially of developing open source one. As regards mcpp, it is a nuisance that gcc arbitrarily hands to preprocessor some irrelevant options. Since it is risky to ignore all the options unrecognized by mcpp, I didn't adopt this. Although mcpp ignores the wrong options such as -c or -m* which are frequently handed from gcc, it will get an error if other unexpected options are passed on. In order to avoid conflicts with those wrong options, mcpp V.2.5 changed some options, -c to -@compat, -m to -e, and some others. To use mcpp with GCC 3.2 or former, it is necessary only to replace invoking of cpp0 by mcpp. To use mcpp with GCC 3.3 or later, it is necessary to divide invoking of cc1 to mcpp and cc1. src/set_mcpp.sh will write shell-scripts for this purpose in the GCC libexec directory on mcpp installation. The 'make install' command will also get GCC predefined macros using -dM option and set those for mcpp. *1, *2, *3 In addition, GCC 3.4 changed processing of multi-byte characters. Its document says as: *4 There is a trend to identify "internationalization" with "unicodization", especially in the Western people who do not use multi-byte characters. It seems that this trend has reached to GCC. What is worse, GCC 3.4 or later does not implement their specification sufficiently. In actual, it behaves as: mcpp takes -e <encoding> option to specify an encoding, and the GCC-specific-build inserts <backslash> to the byte in multi-byte character which has the same value with <backslash>, '"' or '\'', when the encoding is one of BIG-5, shift-JIS or ISO2022-JP, in order to complement GCC's inability. However, it does not convert the encoding to UTF-8. mcpp also treats -finput-charset as the same option as -e. I adopted these specifications because: *7 Note: *1 The output of -dM option, however, slightly differs each other depending on other options.
What is worse, most of the predefined macros are undocumented ones.
As a result, the whole picture cannot be grasped easily. *2 MinGW does not support symbolic link. Though the 'ln -s' command exists, it does not link but only copy. Moreover, MinGW's GCC rejects to invoke a shell-script even if it is named cc1. To cope with this, mcpp's MinGW GCC-specific-build generates a binary executable named cc1.exe (copied also to cc1plus.exe) which invokes mcpp.exe or GCC's cc1.exe/cc1plus.exe. *3 CygWIN / GCC has -mno-cygwin option which alters system include directory and alters GCC's predefined macros. mcpp V.2.6.1 onward, CygWIN GCC-specific-build supports this option and generates two sets of header files for the predefined macros. *4 On GCC in my FreeBSD 6.3, multi-byte character conversion to UTF-8 does not work at all, though libiconv seems to be linked to them.
It was the same with FreeBSD 5.3 and 6.2, too. *5 This conversion seems not to be done in preprocessing phase, but in compilation phase.
Output of -E option is still UTF-8. *6 GCC V.4.1-4.3 fail to compile due to a bug of GCC, if -save-temps or -no-integrated-cpp option is specified at the same time with -f*-charset option. *7 When you pass the output of mcpp to cc1, you should not specify -fexec-charset option nor -finput-charset option. I compiled glibc 2.4 (March, 2006) source, and checked preprocessing problems in it.
As a compiler system, I used GCC 4.1.1 with mcpp 2.6.3.
Since my machine is x86 type, I did not check the codes for other CPUs.MCPP-PORTING
== How to Port MCPP ==
for V.2.7.2 (2008/11)
Kiyoshi Matsui (kmatsui@t3.rim.or.jp)Contents
1. Overview
1.1. High portability
1.2. Standard C mode with highest conformance and other modes
1.3. Notations in this Document
This document uses the following typographical conventions:
Navy colored constant-width font is used for code snippets and command line inputs.
Maroon colored constant-width font is used for Standard predefined macros or any other macros found in some codes.
Italic font is used for the macros defined in mcpp source file named system.H. This document uses these names to denote various mcpp settings as well as to denote the macro itself. Note that these macros are only used in compilation of mcpp, and that the mcpp executable does not have such macros.
2. History
I release these as an open-source-software. I do not have any relationship with DECUS.
The original version does not have a version number, but I refer to them as "DECUS cpp" to differentiate them from mcpp.
In March 2004, V.2.4.1 was released. In this version, recursive macro expansion was revised.
In August 2006, V.2.6.1 was released. In this version, porting to MinGW was added, some bugs were fixed and some relatively small improvements were done.
In November 2006, V.2.6.2 was released. In this version, most of the text file documents were converted to html, some bugs were fixed, and subroutine-build of mcpp was implemented by contribution from Juergen Mueller.
In April 2007, V.2.6.3 was released. In this version, compatibility of GCC-specific-build to GCC was increased, and output to memory buffer was implemented in subroutine-build by contribution from Greg Kress. In addition, binary packages for some systems were begun to be provided.
In May 2007, V.2.6.4 was released. This is a bug-fixed version of V.2.6.3.
In May 2008, V.2.7.1 was released.
This is a bug-fixed version of V.2.7.
This version changed each binary package on UNIX-like systems to install a shared library or DLL of mcpp and an mcpp executable linking the library.
In November 2008, V.2.7.2 was released.
This is a bug-fixed version of V.2.7.1.
http://www.vector.co.jp/soft/dos/prog/se081188.html
http://www.vector.co.jp/soft/dos/prog/se081189.html
http://www.vector.co.jp/soft/dos/prog/se081186.html
3. How to port mcpp to each compiler system: Overview
It is quite easy to make a compiler-independent-build. It is explained in 3.11. *2
3.1. Already supported compiler systems: Making Compiler-specific-build of mcpp
FreeBSD 6.3 GCC V.3.4.6 Vine Linux 4.2 GCC V.2.95.3, V.3.2, V.3.3.6, V.3.4.3, V.4.1.1 Fedora Linux 9 GCC V.4.3.0 Debian LInux 4.0 GCC V.4.1.2 Ubuntu Linux 8.04 / x86_64 GCC V.4.2.3 Mac OS 10.5 GCC V.4.0.1 CygWIN 1.3.10 GCC V.2.95.3 CygWIN 1.5.18 GCC V.3.4.4 MinGW (MSYS 1.0.11) GCC V.3.4.5 WIN32 Visual C++ 2003, 2005, 2008 WIN32 Borland C++ V.5.5J WIN32 LCC-Win32 2003-08, 2006-03
patch -c < ..\noconfig\vc2005.dif
Copy the makefile into the src directory as:
copy ..\noconfig\visualc.mak Makefile
3.1.1. Commonly required settings
#define COMPILER INDEPENDENT
#define COMPILER MSC
#define VERSION_MSG "GCC 3.4"
#define VERSION_MSG "Visual C 2005"
nmake COMPILER=MSC
nmake COMPILER=MSC install
make COMPILER=GNUC
make COMPILER=GNUC install
3.1.2. FreeBSD / GCC V.2.*, V.3.*, V.4.*
#define COMPILER INDEPENDENT
#define COMPILER GNUC
#define COMPILER_EXT_VAL "3"
#define COMPILER_EXT2_VAL "4"
#define COMPILER_CPLUS_VAL "3"
#define GCC_MAJOR_VERSION 3
#define SYSTEM_EXT_VAL "6" /* V.5.*: 5, V.6.*:6 */
#define CPLUS_INCLUDE_DIR1 "/usr/include/c++/3.4"
#define CPLUS_INCLUDE_DIR2 "/usr/include/c++/3.4/backward"
#define STDC_VERSION 0L
3.1.3. Linux / GCC V.2.*, V.3.*, V.4.*
#define SYSTEM SYS_FREEBSD
#define SYSTEM SYS_LINUX
#define COMPILER_SP3_VAL "int"
#define COMPILER_SP3_VAL "long int"
gcc -xc -E -v /dev/null
g++ -xc++ -E -v /dev/null
3.1.4. Mac OS X / Apple-GCC V.4.*
3.1.4.1. Native compiler versus cross compiler
3.1.4.2. Universal binary
3.1.5. CygWIN / GCC V.2.*, 3.*
For CygWIN V.1.5.18 / GCC V.3.4.4, apply cyg1518.dif.
#define CYGWIN_ROOT_DIRECTORY "C:/pub/compilers/cygwin"
3.1.6. MinGW / GCC V.3.*
#define MSYS_ROOT_DIRECTORY "C:/Program Files/MSYS/1.0"
#define MINGW_DIRECTORY "C:/Program Files/MinGW"
3.1.7. LCC-WIN32 2003-08, 2006-03
3.1.8. Visual C++ V.6.0, 2002, 2003, 2005, 2008
3.1.9. Borland C++ V.5.*
3.2. Compiler systems to which the DECUS-cpp had been ported
3.3. noconfig.H, configed.H, system.H
3.4. system.c
3.5. Library functions
3.6. Standard headers
3.7. Makefile and recompilation using mcpp
patch -c < ../noconfig/xyz.dif
make
make install
make clean
internal.H further includes mcpp_lib.h.
ln -sf mcpp cpp0
make NAME=mcpp
make PREPROCESSED=1
make -DPREPROCESSED=1
For further information, see mcpp-manual.html#3.9.5 and mcpp-manual.html#3.9.7. In GCC V.3 or V.4, the preprocessor is absorbed in the compiler (ccl, cclplus), so the call of gcc, g++ needs to be replaced with shell-scripts if you want to use mcpp.
3.8. Compiler systems which can compile mcpp
3.9. Host compiler system and target compiler system
3.10. Compiler systems on MS-DOS and DJGPP
MS-DOS Turbo C V.2.0 OS-9/6x09 Microware C
DJGPP V.1.12 GCC V.2.7.1 MS-DOS LSI C-86 V.3.3 Trial Version
MS-DOS Borland C 4.0 Plan 9 pcc
Win32 Borland C 4.0
However, if you would like to compile mcpp with these compilers, you will easily succeed in making compiler-independent-build by the following procedures, as long as the compiler has most of C90 specifications, though compiler-specific-build is not easy because it requires various settings.
However, since memory is scarce on MS-DOS, you may get an "out of memory" error on preprocessing a source which has many long macro definitions, even if you have compiled mcpp with these settings.
3.11. Making compiler-independent-build of mcpp
./configure; make; make install
3.12. Making subroutine-build of mcpp
int mcpp_lib_main( int argc, char ** argv);
3.12.1. Using configure
./configure --enable-mcpplib
make
sudo make install
make CFLAGS+='-isysroot /Developer/SDKs/MacOSX10.4u.sdk -mmacosx-version-min=10.4 -arch i386 -arch ppc'
3.12.2. Using noconfig/*.mak
make MCPP_LIB=1 mcpplib
make MCPP_LIB=1 mcpplib_install
For Visual C, use 'nmake' instead of 'make'.
Since the 'make' command attached to LCC-Win32 cannot handle 'if' directive, you must edit the makefile whenever you do different 'make's.
On Mac OS X, if you remove the '#' which comments out the definition of variable UFLAGS in mac_osx.mak, universal binary will be generated for each library.
On Windows, the name of the libraries are different from those on Linux as shown in the table below.
FreeBSD / GCC Linux / GCC Mac OS X / GCC CygWIN / GCC MinGW / GCC Visual C, Borland C, LCC-Win32 static library libmcpp.a libmcpp.a libmcpp.a libmcpp.a libmcpp.a mcpp.lib shared library or DLL libmcpp.so.$(SHL_VER) libmcpp.so.$(SHLIB_VER) libmcpp.$(SHLIB_VER).dylib cygmcpp-$(DLL_VER).dll libmcpp-$(DLL_VER).dll mcpp$(DLL_VER).dll import library of DLL libmcpp-$(DLL_VER).dll.a libmcpp-$(DLL_VER).dll.a mcpp$(DLL_VER).lib
make testmain
make testmain_install
3.12.3. Static library and shared library or DLL
4. How to port mcpp to each compiler system: Details
4.1. Setting of noconfig.H, configed.H, system.H
Select the OS which the target compiler will be operated on. The name of the OS is defined right after this. Define appropriately for the OS which is not defined.
Select the target compiler system. The name of the compiler is defined right after this. Define appropriately for the compiler systems which are not defined. When COMPILER is defined as INDEPENDENT, compiler-independent-build of mcpp will be made, which has no target compiler. In this case, most of the settings in PART 1 are ignored.
Write the version information of the host compiler as a string literal to be displayed by -v option or by usage().
Select the host OS and the host compiler system. If these are the same as the target, set as
#define HOST_SYSTEM SYSTEM
#define HOST_COMPILER COMPILER
4.1.1. PART 1: Setting of Target system: for Compiler-specific-build
4.1.1.1. Predefined macros
Specify the unique macro name of the compiler system, which will be predefined in mcpp, in a string literal. Leave undefined any unnecessary ones (should not define to 0 token). *_OLD generate old style macros which do not begin with '_' (underscore), these won't be predefined at mcpp execution time if more than 1 is specified for <n> of the -S<n> option.
(Except GCC-specific-build which predefines these macros even in STD mode, unless -ansi or -std=iso* option is specified.)
In *_STD?, *_EXT and *_EXT2, always specify the macro name beginning with '_'. *_STD1 starts from __, and *_STD2 starts from __ and end with __. In SYSTEM_EXT, SYSTEM_EXT2, COMPILER_STD1, COMPILER_STD2, COMPILER_EXT and COMPILER_EXT2, the value of their macros are also specified by SYSTEM_EXT_VAL, SYSTEM_EXT2_VAL, COMPILER_STD1_VAL, COMPILER_STD2_VAL, COMPILER_EXT_VAL and COMPILER_EXT2_VAL, respectively. This is defined by a string literal which is the integer enclosed by "". The macro that expands to a 0 token is defined as "". If nothing is specified, the value of the macro becomes 1. All other predefined macros (the ones specified by SYSTEM_OLD, SYSTEM_STD1, SYSTEM_STD2, COMPILER_OLD) have a value of 1.
Write the compiler system unique special predefined macro name as a string literal, and define the values by SYSTEM_SP_OLD_VAL and SYSTEM_SP_STD_VAL.
Write the compiler system unique special predefined macro name as a string literal, and define its values by COMPILER_SP1_VAL, COMPILER_SP2_VAL and COMPILER_SP3_VAL.
Specify the name and the value of the compiler system's unique predefined macro, which is defined when -+ option (C++ preprocess) is specified, by the string literal as above. If COMPILER_CPLUS_VAL is not specified, the macro value becomes 1. The name has to begin with '_'. If not required, leave COMPILER_CPLUS itself undefined.
4.1.1.2. Include directories and others
Specify the include directory of the standard header files searched by mcpp. CPLUS_INCLUDE_DIR? should be set when the include directory of C++ is different from that of C. (When invoking mcpp, this is enabled by the -+ option.) As /usr/include, /usr/local/include in UNIX are set in system.c, compiler-system-specific-directories should be set in C_INCLUDE_DIR?.
Define the environment variable name, with which the include directory for the standard header file searched by mcpp is specified at execution time.
ENV_CPLUS_INCLUDE_DIR is the name of the environment variable which specifies the include directory of C++.
Each of them is defined as "INCLUDE", "CPLUS_INCLUDE" as a default. For GCC-specific-build, "C_INCLUDE_PATH" and "CPLUS_INCLUDE_PATH" are the defaults.
Other search paths are setup in system.c and by the -I option. (About the priority of these, see mcpp-manual.html#4.2.)
When writing multiple paths in the above environment variable, write separators in the literal constant. This is ':' of /usr/local/abc/include:/usr/local/xyz/include or ';' of C:/BC55/INCLUDE;C:/BC55/LOCAL/INCLUDE.
Specify the default rule when searching the include file. When processing the directive such as #include "../dir/header.h", the rule of which directory should be searched first. If this is specified to CURRENT, it starts to search the relative path from the current directory of mcpp invocation. If specified as SOURCE, it starts searching from the directory with the source file (includer). If specified to (CURRENT & SOURCE), it starts searching the relative path from the current directory first, then the directory with the source file.
4.1.1.3. The output format of line number information and others
Specify the format for passing the file name and the line number information from mcpp to the compiler-proper.
#line 123 "fname"
The format of the above Standard C source code is set as default. Write an alternative sequence into the string literal to replace this "#line " for compilers which use other formats.
#123 "fname"
If the above is the format, define as "# ". If it is a peculiar format, which is not any of the above, define the format to match. (In some cases, these may need to be added to sharp() or other functions in main.c)
When mcpp is used in the front end of a one-pass compiler, such as Visual C or Borland C, the output of mcpp has to be the Standard C source code to be able to pass the output to the built-in preprocessor. Hence, the transfer of the line number has to be the first format.
If EMFILE is not the macro for the value of errno, which means "too many open files (for the process)" in <errno.h>, define EMFILE into the macro name (Of course, you can add to <errno.h> itself).
If the target compiler is the so-called one-pass-compiler in which the preprocessor is not separated, then set this to TRUE, otherwise set this to FALSE. If this is set to TRUE, all the predefined macros of the compiler system will be output enclosed within comment marks by #pragma MCPP put_defines (#put_defines). This is to prevent duplicate definitions, as it will be preprocessed again if the output of mcpp is passed onto the compiler.
Though GCC 3 or 4 integrate preprocessor into its compiler, this macro should be set to FALSE as an independent preprocessor can also be used.
Define this as TRUE for the OS which does not distinguish upper and lower case in file names as Windows, otherwise set this to FALSE.
4.1.1.4. Settings corresponding to the system's language specifications
Set this to TRUE for the compiler which expand macro unless STDC is the first argument of #pragma line. This is set to FALSE in default. In Visual C and Borland C, set this to TRUE as the argument of #pragma line is always subject to macro expansion. In C99, it is implementation-defined whether or not the argument is macro expanded, and in C90 the argument is never expanded. However, mcpp, if and only it is implemented for Visual C or Borland C, expand macros even in C90 mode, except the argument of the #pragma line starts with STDC or MCPP.
Set this to TRUE when the compiler can process digraphs, otherwise set this to FALSE.
This defines the default value of the predefined macro __STDC__ in the target compiler. If __STDC__ is not defined, set this to 0.
This defines the default value of the predefined macro __STDC_VERSION__ in the target compiler. If __STDC_VERSION__ is not defined, set to 0L.
Write values of CHAR_BIT, UCHAR_MAX, LONG_MAX, ULONG_MAX in <limits.h> of the target compiler system.
Be careful about the definition of UCHAR_MAX, because limits.h of some system has a wrong definition. Define as a signed value as 255 or 0xff, do not define as an unsigned value as 255U or 0xffU. (See cpp-test.html#5.1.3.)
4.1.1.5. Multi-byte characters
Define the encoding for multi-byte characters, that is Kanji in Japanese, of the target.
The first five are all encodings with a character occupying 2-bytes and without shift-states. Though wchar_t is a 4-byte type in some compiler systems, despite the encoding of multi-byte characters and wide characters being 2-byte, the preprocessor is not concerned with the type of wchar_t. As multi-byte or wide characters occupy 2-bytes on source code, it processes accordingly.
EUC_JP Japanese extended UNIX code (UJIS) SJIS Japanese shift-JIS (MS-Kanji) GB2312 Chinese EUC-like GB2312 (simplified-Chinese) BIGFIVE Taiwanese Big Five (traditional-Chinese) KSC5601 Korean EUC-like KSC-5601 (KSX 1001) ISO2022_JP International standard ISO-2022-JP1 Japanese UTF8 A type of encoding of Unicode, UTF-8
ISO-2002-* is the encoding with shift-states. UTF-8 is used to encode 2-byte Unicode to 1, 2 or 3-bytes. Kanji (Chinese characters) become 3-bytes.
When MBCHAR is defined to 0, multi-byte character is not processed by default, and the environment variables/options/#pragma can change it at execution time.
Set this to TRUE when the compiler-proper processes shift-JIS. If the compiler-proper does not process it, then set to FALSE.
In Shift-JIS, there are cases where the second byte of Kanji is the value of 0x5c which is the same as '\\'. If the compiler-proper does not recognize shift-JIS, it interprets it as an escape sequence and gets an error at tokenization.
If SJIS_IS_ESCAPE_FREE is set to FALSE, mcpp processes shift-JIS. That is, when 0x5c is the second byte of shift-JIS Kanji within the string literal or character constant at the final mcpp output time, it adds one more 0x5c. This tentatively makes the English version compiler support characters such as Shift-JIS.
Same as above, set this to TRUE when the compiler-proper processes Big 5, and set to FALSE if not.
Same as above, set this to TRUE if the compiler-proper processes ISO-2022-JP and set to FALSE if not. With ISO-2022-*, there may be the bytes which match not only to '\\', but also to '\'' or '"'. If ISO2022_JP_IS_ESCAPE_FREE is FALSE, mcpp inserts a 0x5c byte before all bytes matching to '\\', '\'', '"'.
4.1.1.6. Target and host system common settings
Set this to TRUE for the compiler system which has the data type of 'long long'. Set this to TRUE, for compilers such as Visual C of up to 2005 or Borland C 5.*, which do not have 'long long' but there are the same size data type __int64 and provides length modifier to display the value by printf().
Visual C 2008 has the 'long long' type.
If the data type called 'intmax_t' is defined, set this to TRUE.
If the system has 'long long', define the length modifier for displaying the maximum integer type value of the host compiler system in printf(). This is "j" in C99. Also, the length modifier of 'long long' is "ll" (ell-ell) in C99. In Visual C 2003 or older and Borland C 5.*, use "I64" to display the value of '__int64'. Also in MinGW, use "I64".
4.1.2. PART 2: Setting of Host system
If the library of the host system has stpcpy(), define this to TRUE. If not, define to FALSE. If it is set to FALSE, stpcpy() in system.c is used.
The value of PATH_MAX defined in <limits.h> of the host system.
If it is not available, FILENAME_MAX in <stdio.h> will be used.
#if _MSC_VER >= 1200
4.1.3. PART 3: Setting of the mcpp behavior specification
4.1.3.1. Several behavioral modes of new and old
There is another mode called COMPAT which is a variation of STD.
"Reiser" model cpp behavior.
K&R 1st specification mode.
Standards (C90, C99, C++98) conforming mode.
Special "post-Standard" mode created by the author, based on the Standards and simplified removing all the irregular specifications.
4.1.3.2. Specifying the details of the behavioral modes
When executing as a C++ preprocessor by -+ option, the Standard macro __cplusplus is predefined to this value. This is 199711L for C++98. This can be changed at the execution time by -V option.
Specify the initial state of the trigraph processing. The -3 option reverses the state. If this is set to TRUE, trigraphs are recognized by default, while they become not recognized when invoked by -3 option. When this is set to FALSE, it is the other way around, trigraphs are not recognized by default, while they become recognized by the -3 option.
Specify the initial state of digraph processing. The -2 option reverses the state. If this is set to TRUE, digraphs are recognized as the default, while it becomes not recognized when invoked by the -2 option. When this is set to FALSE, it is the other way around, digraphs are not recognized by the default while it becomes recognized by the -2 option.
if HAVE_DIGRAPHS == FALSE, mcpp converts digraphs to normal tokens.
Set this to TRUE for making UCN (universal character name) effective when invoked by -V199901L or -+ options. Default is set to TRUE. *1
Set this to TRUE to be able to use multi-byte characters in identifiers when invoked by -V199901L. Default is set to FALSE.
Typedef to the maximum integer type. If there are intmax_t, uintmax_t types, define to them. Else if the compiler systems have long long, unsigned long long, define to them. Else if the compiler systems have __int64, unsigned __int64, define to these. Else define to long, unsigned long. Note long long and unsigned long long is required in C99.
Define the maximum value of uexpr_t.
4.1.3.3. Specifying translation limits
Defines the limitation of rescan time at macro expansion in Standard mode. it does not need to be set to too big a value as the rescan time is usually small in Standard mode.
Defines the limitation of rescan time at macro expansion in pre-Standard mode. An infinite loop can occur by recursive macro expansion, but this limitation can stop that.
Define the maximum length +1 of the logical line (the line spliced deleting \ at the end of physical line of source code). The line after the comment converted to a space (it can spread out to multiple logical lines depending on comments) has to be within this length, too.
Define the internal buffer size of macro expansion. Hence, the result of expanding macros within one logical line (when macro call spreads out to multiple lines, the result of expansion), has to be within this size. This is also used for the maximum length for memorizing the replacement list of one macro definition internally. This should be greater or equal to NBUFF*2 and greater or equal to NWORK*2.
Defines the maximum length for output line of mcpp. This cannot be more than the maximum length +1 of what the compiler-proper can accept. Also, this cannot be more than the value of NBUFF. When the line length after the macro expansion exceeds this, mcpp divides that to the lines of length less than this value, then outputs. The length of string literal has to be within the range of NWORK-2. (The length of the string literal is not the number of elements of the char array, but the length of the string literal token in the source code including " on both sides. For example, \n is counted as 2 bytes. 'L' prefix is also counted for wide string literals.)
For GCC-specific-build and Visual C-specific-build, however, mcpp uses NMACWORK instead of NWORK for output line length.
In other words, it does not divide output line, because these compilers can accept very long line.
Defines the maximum length of an identifier. A name longer than this value is not an error, but is cut down to this length.
Defines the maximum number of arguments of function-like macros. This cannot be bigger than UCHARMAX.
Defines the limit of the nest level bound by parentheses in #if expression (in reality, the nest level is not directly decided by this. Specifically, the number of constant tokens within an expression can be used up to two times of this, and the number of operator tokens that can be used is three times this value, counting a pair of parentheses as 2).
Defines the limit of the nest level of #if (#ifdef, #ifndef) sections (how many levels #if and so on can be nested).
Define the limit of nest level of #include. This prevents infinite recursion of #includes. This can exceeds the limit imposed by OS on number of simultaneously opened files.
Defines the number of elements for the hash table when macros are internally classified by a hash and are stored. This has to be a power of 2. It operates correctly when the number is much smaller than the number of macros, but the process is slightly quicker when this is set to be bigger.
4.2. system.c
Defines path-delimiter of OS. PATH_DELIM must not be '\\' (for the program's convenience). This is set to '/' for Windows. Of course, you can use '\\' in user program, but it is converted to '/' internally.
Defines the suffix of the object file, generated by the compiler, in a string literal. These are "o" for compilers on UNIXes or "obj" for compilers on Windows. This is to be used for the output of the dependency lines for the makefile when one of the -M* option is specified.
The invocation options are implemented. When port mcpp to compiler system to which mcpp hasn't been ported yet, you may need to add some lines to match the compiler driver of the system. When you add to do_options(), you also need to add to set_opt_list() and usage() correspondingly.
do_options() calls mcpp_getopt(), which has the same specification with getopt() of POSIX, so you have to decide whether the option character is with or without arguments. As a basic rule, the options like -P and -P- cannot be implemented simultaneously. (If this is necessary for compatibility with the compiler system's resident preprocessor, this can be done. Refer to the implementation of -M option.) Also, for the longer options such as -trigraphs, you have to implement by 't' as an option character and 'rigraphs' as an argument.
Sets option characters for do_options().
Usage message. The options are classified by modes and written in alphabetical order.
Sets the include directory. Besides the compiler-specific directories specified in noconfig.H (configed.H) by the macros, C_INCLUDE_DIR? or CPLUS_INCLUDE_DIR?, /usr/include, /usr/local/include on UNIX OS are also set in this. (The include directories specified by environment variables, of which names are defined in noconfig.H or configed.H by the macros ENV_C_INCLUDE_DIR, ENV_CPLUS_INCLUDE_DIR, are setup in set_env_dirs()).
The processing of #pragma is implemented. #pragma sub-directive, which mcpp does not process, is passed to the compiler-proper as is. Those which mcpp processes by itself, such as '#pragma MCPP debug', are processed by the functions called from this function. The sub-directives which mcpp processes by itself begin principally with the name of MCPP. '#pragma MCPP *' lines themselves are not outputted. Also, '#pragma once' line is not outputted. On the other hand, '#pragma __setlocale' line is outputted. In Standard C, the extension directive of individual compiler systems has to be implemented as #pragma sub-directive.
If you require the preprocessing directives which don't conform to Standard C (the ones which are not #pragma sub-directives such as #assert, #asm, #endasm, #include_next, #warning, #put_defines, #debug), add the function which processes that and call from here. (However, for GCC, #include_next, #warning can also be used in STD mode).
4.extra malloc()
5. Bug reporting and porting report
5.1. Is this a bug?
5.2. Check for malloc() related bugs
#pragma MCPP debug memory
5.3. Bug report
5.4. Porting report
As the diagnostic messages are output to the file called mcpp.err by the -Q option, read it using a pager or similar. -z option omits the output of the header files.
Digraph and trigraph becomes valid by -2 or -3. -S1 and -V199409L sets __STDC__ to 1 and __STDC_VERSION__ to 199409L.
To test C99 compatibility, check n_std99.t, e_std99.t with -V199901L option.
5.5. Information about Configure for other compiler systems than GCC
5.6. I will try to port if you send me the data.
Also, for the compiler systems, which the processed results of #line 1000 does not become #line 1000 "t_line.c", but other formats such as #1000 "t_line.c", modify this to #line 1000 "t_line.c" and pass through to the compiler proper. Once it has been passed, check to see if this can be recognized or not. (If it does not error out by #line 1000 "t_line.c", it should have an error message in the line of the "error line;". Check to see how the line number displays in the error message.)
/* t_line.c */
#include <stdio.h>
#line 1000
error line;
main(void)
{
return 0;
}
5.7. Please report the test of other compiler systems by the Validation Suite.
5.8. The feedback for improvement
6. Long way to mcpp
6.1. Three days to plan and six years to develop
6.2. V.2.3
6.3. Selected to "Exploratory Software Project"
MCPP-MANUAL
== How to Use MCPP ==
for V.2.7.2 (2008/11)
Kiyoshi Matsui (kmatsui@t3.rim.or.jp)Contents
1. Overview
1.1. High portability
1.2. Standard C mode with highest conformance and other modes
http://www.vector.co.jp/soft/dos/prog/se081188.html
http://www.vector.co.jp/soft/dos/prog/se081189.html
http://www.vector.co.jp/soft/dos/prog/se081186.html
1.3. Notations in this Manual
This manual uses the following typographical conventions:
Navy colored constant-width font is used for code snippets and command line inputs.
Maroon colored constant-width font is used for Standard predefined macros or any other macros found in some codes.
Italic font is used for the macros defined in mcpp source file named system.H. This manual uses these names to denote various mcpp settings. Note that these macros are only used in compilation of mcpp, and that the mcpp executable does not have such macros.
2. Invocation Options and Environment Settings
2.1. Two Kinds of Build and Five Behavioral Modes
The STD mode (default).
The COMPAT mode.
The POSTSTD mode.
The KR mode.
The OLDPREP mode.
Expand recursive macro more than the Standard's specification. On expanding recursive macro, set the range of non-re-replacing of the same name narrower than the Standard.
Refer to cpp-test.html#3.4.26 about the specifications of recursive macro expansion. See test-t/recurs.t for a sample of recursive macro. *4
This mode differs from STD mode in the following points:
2.2. How to Specify Invocation Options
mcpp [-<opts> [-<opts>]] [in_file] [out_file] [-<opts> [-<opts>]]
2.3. Common Options
Output also comments in source code. This option is useful for debugging. Note that a comment is moved ahead of a logical source line when output. This is because a comment is processed before macro expansion or directive processing, and a comment may appear during a macro invocation.
Define a macro named "macro". This option can be used to change the definitions of predefined macros other than __STDC__, __STDC_VERSION__, __FILE__, __LINE__, __DATE__, __TIME__ and __cplusplus. (__STDC_HOSTED__, C99's predefined macro, is exceptionally redefined by this option, because some compiler systems, like GCC V.3, use the -D option to define __STDC_HOSTED__.) To specify a value, use "=<value>". If "=<value>" is omitted, 1 is assumed. (Note that in bcc32, the macro is defined as zero-token by default.) Do not enter white-space characters immediately before "=". If a white-space character is entered immediately after "=", the macro is defined as zero token.
A macro with arguments can be defined by this option.
This option can be specified repeatedly.
Change a multi-byte character encoding to <encoding>. For <encoding>, refer to 2.8.
Specify the first directory in the include directory search path order with <directory>. For a search path, refer to 4.2. If a directory name contains spaces, it has to be enclosed with " and ".
Specify a directory from which mcpp begins searching when it encounters a #include "header" directive (i.e. not <header> format). -I1, -I2 and -I3 indicate the current directory, the source file (i.e. includer) directory, and the both respectively. For details, see 4.2.
On outputting a diagnostic message, mcpp displays only one line of diagnostic without additional information, such as source lines. (By default, one line of diagnostic message is followed by a source code line having a problem. If the source code line in question is found in a #included file, all the #including file names and including line numbers are also displayed in sequence. For a diagnostic on macro, mcpp displays also its definition information).
When Validation Suite is used in the GCC testsuite, this option has to be specified to output a diagnostic message in the same format as GCC.
Output lines that describe dependency among source files. The output destination is the file specified in a command line, or stdout if omitted. If a dependency description is too long to fit in a line, it is folded over the next lines. The preprocessing result is not output.
Almost the same with -M, except that the following header files are not output.
But, GCC-specific-build differs from this, it output the header files excluding only system headers, as GCC does.
Almost the same with -M, except that the preprocessing result is output to the specified file on a command line or stdout. If FILE is specified, mcpp outputs dependency description lines to that file. Otherwise, they are output to a file having the same base filename with the source file and the suffix of ".d" instead of ".c".
Almost the same with -MD, except that, like -MM, the files that are regarded as system header are not output. An output file mcpp outputs dependency description lines to is same as -MD [FILE].
The dependency lines are output to FILE. -MF FILE takes precedence over -MD FILE or -MMD FILE.
"Phony targets" are also output. Each included file can be written as a phony target without a dependency as follows:
test.o: test.c test.h
test.h:
The target name is specified as TARGET not foo.o. -MT '$(objpfx)foo.o' outputs the following line.
$(objpfx)foo.o: foo.c
Same as -MT, except that a string that has a special meaning to 'make' is 'quoted' as follows:
$$(objpfx)foo.o: foo.c
Disable all the predefined macros, including those that begin with "_", except for the ones required by Standards and __MCPP. The Standard predefined macros include __FILE__, __LINE__, __DATE__, __TIME__, __STDC__, __STDC_VERSION__, as well as __STDC_HOSTED__ for C99 and __cplusplus for C++. If you want to disable __MCPP, use the -U option.
Output the preprocessed source to the file. If this option is omitted, the second file argument is regarded as an output path, so this option is not necessary, however, some compiler drivers use this option.
Do not output line number information for the compiler-proper. This option is specified when you want to use mcpp for purpose other than C preprocessing.
Output diagnostic messages to the "mcpp.err" file in the current directory. As these messages are appended to this file, it may become bigger. Delete it from time to time.
Undefine predefined macro named "macro". This option cannot undefine __FILE__, __LINE__, __DATE__, __TIME__, __STDC__, __STDC_VERSION__ (and __STDC_HOSTED__ for C99), as well as __cplusplus invoked with -+ options.
Output the mcpp version and a search order of include directories to stderr.
However, when -K option, explained at 2.4, is specified or #pragma MCPP macro_call directive, explained at 3.5.8, is specified, this option changes its meaning.
Specify a warning level with <level>. <level> should be 0 or "OR" of any one or more values of 1, 2, 4, 8 and 16. 1, 2, 4, 8 or 16 indicates a warning class. For example, if -W 5 is specified, warnings of classes 1 and 4 are output. If 0 is specified, no warnings are output. If this option is specified several times, all the specified values are "ORed" together. For example, -W 1 -W 2 -W 4 is equivalent to -W 7. Instead of -W 7 you can also write as -W "1|2|4". (Enclose with " and " so as | is not interpreted as a pipe.) If this option is omitted, -W 1 is assumed. For warning messages, refer from 5.5 through 5.9.
The preprocessing results of the #included files are not output, but macros are defined. The #include lines themselves are output instead, though #include lines in an included file is not output. This option is used in debug of preprocessing.
2.4. Options by mcpp Behavioral Modes
Behave as C++ preprocessor. mcpp predefines the __cplusplus macro (its value is defined in system.H and defaults to 1), interprets the text from // to the end of a logical line as a comment and recognizes "::", ".*", "->*" as a single token. It evaluates "true" and "false" tokens in a #if expression to 1 and 0, respectively. If __STDC__ and __STDC_VERSION__ are defined, they are undefined. (For GCC-specific-mcpp, __STDC__ is not undefined for compatibility with GCC.) The predefined macros that do not begin with "_" are also undefined. However, extended characters are not converted to UCN. *1, *2
Reverse initial settings for the digraphs processing. With DIGRAPHS_INIT == FALSE, mcpp recognizes digraphs. Otherwise, it doesn't.
Define the value of __STDC_HOSTED__ macro with <n>.
Change the value of __STDC__ to <n> in C. In C++, this option is ignored. The range of <n> has to be 0-9. With <n> set to 1 or higher, the predefined macros that do not begin with "_", such as unix, linux, are disabled. S indicates __STDC__. If this option is omitted, __STDC__ is set to a default value (i.e. 1).
For a GCC-specific-build, -pedantic, -pedantic-errors, or -lang-c89 is equivalent to -S1, so the next -S is ignored.
This option does not disable the non-conforming predefined macros such as unix, linux, i386 for compatibility with GCC.
These macros are disabled only by -ansi or -std=iso* options.
Change the values of the predefined macros __STDC_VERSION__ for C and __cplusplus for C++ to <value>. <value> is of a long type. (In C95, C99, and C++ Standard, this value is set to 199409L, 199901L and 199711L, respectively.) With __STDC__ set to 0, __STDC_VERSION__ is always set to 0L, overriding the -V option.
If this option is omitted for C, __STDC_VERSION__ is set to the value of STDC_VERSION in system.H. (For GCC V.2.7 - V.2.9, 199409L. For others, 0L.) If specifying -V199901L results in __STDC_VERSION__ >= 199901L, mcpp conforms to the following C99 specifications (See 3.7.):
Note that although C99 provides for variable argument macros, mcpp allows them in the C90 and C++ modes. *4
In C++ also, when specifying -V199901L results in __cplusplus >= 199901L, mcpp will enter the C99 compatibility mode, providing the above 2-4 enhancements. (1 is enabled unconditionally and 5 is almost the same.) These are mcpp's own enhancements that do not conform to the C++ Standard.
The -D option cannot be used with __STDC__, __STDC_VERSION__, and __cplusplus. This is to distinguish system-defined macros from user-defined ones.
Reverse initial settings for the trigraphs processing. With TRIGRAPHS_INIT == FALSE, mcpp recognizes trigraphs. Otherwise, it does not.
Enable macro notification mode which embeds macro notifications into comments. This mode is designed to allow reconstruction of the original source position from the preprocessed output. The primary purpose of the macro notification mode is to allow C/C++ refactoring tools to refactor source code without having to implement a special-purpose C preprocessor. This mode is also handy for debugging macro expansions. The goal for macro expansion mode is to annotate every macro expansion, while still allowing the code to be compiled. *5
This mode is also enabled by the following pragma:
#pragma MCPP debug macro_call
The -K option has almost the same effect with this pragma at top of an input file except predefined macros are notified only by this option.
About the specs of macro notification, see 3.5.8. #pragma MCPP debug macro_call.
This option implies -k option.
The -v option changes its meaning in this mode, and outputs more verbose notations.
On the other hand, -a (-x assembler-with-cpp) or -C options automatically disable -K option. *6
Keep horizontal white spaces ('\t' and space characters) without squeezing them into one space.
Comment is converted to spaces of the same length.
This option is to keep column position of source file in preprocessed output except within macro expansion. (Column position of macros are known by -K option.) *7
2.5. Common Options Except for Some Compiler Systems
Predefine macros for 32bit mode.
If the CPU is x86_64 or ppc64, predefined macros for 64bit mode are used by default.
With this option, however, those for i386 or ppc respectively are used.
Predefine macros for 64bit mode.
If the CPU is i386 or ppc, predefined macros for 32bit mode are used by default.
With this option, however, those for x86_64 or ppc64 respectively are used.
Accept the following notations used in some assembler sources without causing an error.
#APP
# + any comment.
"A very very
long long
string literal"
This option cannot be used in POSTSTD mode.
This manual calls this mode lang-asm mode.
This mode is recommended when you use mcpp as a macro processor for some text other than C/C++, for example, as a cpp called from xrdb.
Cancel default include directories and enable only ones specified with an environment variable and the -I option. Instead of -I-, GCC-specific-build uses -nostdinc. In GCC, the -I- option provides quite different functionality. See 2.6.
2.6. Options by Compiler System
gcc -Wp,-W31,-Q23
Define the __LCCDEBUGLEVEL macro as <n>.
Defines the __LCCOPTIMLEVEL macro as 1.
Define the macro _M_IX86_FP as 1, 2 respectively.
Same as -include <file> for GCC.
If <n> is one of 3, 4, 5, 6, B, define the macro _M_IX86 as 300, 400, 500, 600, 600, respectively.
Define the macro _CPPRTTI to 1.
Define the macro _CPPUNWIND to 1.
Define the macro __MSVC_RUNTIME_CHECKS to 1.
Define the macro _CHAR_UNSIGNED to 1.
If -RTC1, -RTCc, -RTCs, -RTCu and such option is specified, define the macro __MSVC_RUNTIME_CHECKS to 1.
Specify that the source is written in C. The result is same with or without this option.
Same as -+.
Same as -N.
Same as -W17 (-W1 -W16).
Same as -j.
Same as -W0.
Same as -I-.
Undefine the macro _MSC_EXTENSIONS and prohibit '$' in identifiers.
Define the macros _NATIVE_WCHAR_T_DEFINED and _WCHAR_T_DEFINED to 1.
Define the macro _VC_NODEFAULTLIB to 1.
Put the <framework> directory to top of the framework directories to search.
The standard framework directories are /System/Library/Frameworks and /Library/Frameworks by default.
Change the target architecture of machine as <arch> from the default one.
This causes changes of some predefined macros.
<arch> should be i386 or x86_64 on the preprocessor for x86, ppc or ppc64 on the one for ppc.
(You can specify any of these 4 for gcc command.
gcc command invokes the preprocessor for x86 on '-arch i386' or '-arch x86_64' options, and the one for ppc on '-arch ppc' or '-arch ppc64' options.)
Same as -fno-dollars-in-identifiers.
Output line number information just like C sources.
The format used to pass the line number information from a preprocessor to compiler-proper is usually as follows:
#line 123 "filename"
Most compiler systems can use this C source format, but some systems cannot. The default specification of mcpp is such that, in compiler-specific-build for the compiler systems that cannot use the C source format, mcpp outputs the line number information in a format that the compiler-proper can accept it.
However, with this option specified, even in compiler-specific-build for the compiler systems that do not accept the C source format outputs the line number information in that format. This option is used with '#pragma MCPP preprocess' to pre-preprocess header files.
Output valid macro definitions in the form of #define lines at the end of preprocessing.
With the -dD option specified, the preprocessing result is output too. Predefined macros are not output.
With the -dM option specified, the preprocessing result is not output, and predefined macros are output except the Standard predefined ones. *3, *4
Define the macro __EXCEPTIONS to 1.
-fno-exceptions does not define this macro.
Same as -e <encoding>. Note that GCC convert the <encoding> to UTF-8 by this option, whereas mcpp does not convert any encoding.
Prohibit '$' in identifiers. (Allow it by default.)
Any of these options defines both of the macro __PIC__, __pic__ to 1.
Define the macro __SSP__ to 1.
Define the macro __SSP_ALL__ to 2.
Emit a special line as the second line of preprocessor's output to convey the current working directory.
Switch the specification of the -I <directory> before and after this option; directories specified with the -I options before -I- are used to search for header files only in the form of #include "header.h"; the directories specified with -I after -I-, if any, are used to search for all #include directives. In addition, during the former search, includer's directories are not used.
include the <file> before processing the main source file. This is equivalent to writing #include <file> at the beginning of the main source file.
Add <dir> to the include path for #include "header.h" form.
Use <dir> as the logical root directory for system headers, that is, prefix <dir> to the path-list of system header directory.
For example, if the default include directory is /usr/include and <dir> is /Developer/SDKs/MacOSX10.4u.sdk, then alter the include directory to /Developer/SDKs/MacOSX10.4u.sdk/usr/include.
Add <dir> to the include path immediately before system-specific directories and immediately after site-specific directories.
Perform C preprocessing. The same as not specifying this option at all.
Predefine a macro __MMX__ to 1.
-mno-mmx undefines __MMX__.
Same as -I- for other compiler systems.
Same as -N.
If ? is a non-0 digit, define a macro __OPTIMIZE__ to 1.
Same as -W1. The result is same with or without this option.
Same as -W16.
Same as -W17. (With -Wall, mcpp does not issue class 2 and 4 warnings because these warnings are issued frequently and annoying for Linux or some other system's standard header files. Class 8 warnings are generally surplus and bothering, but are helpful to confirm portability and etc. To use this option, be sure to specify gcc -Wp,-W31.)
Same as -W0.
Define macro __STRICT_ANSI__ as 1.
Disable non-conforming predefined macros such as linux, i386.
Do not remove comma preceding absent variable argument of GCC-spec variadic. *5
Recognize digraphs. Digraphs specification is also reversed by -2.
Same as -S1. Not only C90 but also C95 specifications are used. The result is same with or without this option.
Almost same as -S1, except these imply -ansi.
Same as -V199901L.
Same as -V199901L and also imply -ansi.
Perform C++ preprocessing. Same as -+.
Same as -+ and also implies -ansi.
Same as -W7 (i.e. -W1 -W2 -W4).
Specify a version of C Standard. To specify C, <n> is 9899 and C++, 14882. If <n> is 9899, <ym> is any of 1990, 199409, 1999 and 199901. If <n> is 14882, <ym> is 199711. If you enter other value than these in <ym>, __STDC_VERSION__ or __cplusplus is set to that value. In this case, <ym> must be specified in six digits, like 200503.
These options imply -ansi.
On the other hand, -std=gnu* do not imply -ansi, also -pedantic does not imply -ansi.
Same as -a for other compiler systems.
Specify lang-asm mode.
In GCC-specific-build, a macro __ASSEMBLER__ is defined to 1, and '$' in identifiers are prohibited.
When the main source file is named *.S, lang-asm mode is implicitly specified without this option.
Recognize trigraphs. Trigraphs specification is also reversed by -3.
Same as -@old.
Alter the include directory from /usr/include to /usr/include/mingw, and alter the predefined macros from the ones for cygwin1.dll to the ones for msvcrt.dll.
mcpp ignores this option. In GCC, this option is equivalent to writing #assert <predicate (answer)> in the source code. Standard C, does not permit extension directives other than #pragma. Fortunately, so far, gcc, by default, passes an equivalent macro with the -D option, so there are no actual problems unless a source program uses #assert, which is a rare case.
2.7. Environment Variables
2.8. Multi-Byte Character Encodings
EUC-JP Japanese extended UNIX code (UJIS) shift-JIS Japanese MS-Kanji GB-2312 EUC-like Chinese encoding (Simplified Chinese) Big-Five Taiwanese encoding (Traditional Chinese) KSC-5601 EUC-like Korean encoding (KSX 1001) ISO-2022-JP1 International standard Japanese UTF-8 A kind of Unicode encoding
EUC-JP eucjp, euc, ujis shift-JIS sjis, shiftjis, mskanji GB-2312 gb2312, cngb, euccn BIG-FIVE bigfive, big5, cnbig5, euctw KSC-5601 ksc5601, ksx1001, wansung, euckr IS0-2022-JP1 iso2022jp, iso2022jp1, jis UTF-8 utf8, utf Not specified c, en*, latin*, iso8859*
shift-JIS japanese, jpn GB-2312 chinese-simplified, chs BIG-FIVE chinese-traditional, cht KSC-5601 korean, kor Not specified C, english
Also GCC info says that, besides LANG, environmental variables LC_ALL and LC_CTYPE can be used to specify an encoding. However, the difference between using LC_ALL or LC_CTYPE or not lies only in their diagnostic messages, in actual.
2.9. How to Use mcpp in One-Pass Compilers
2.10. How to Use mcpp in IDE
2.10.1. How to Make mcpp Available in Visual C++ IDE
Let me explain the appropriate values for these fields by taking an example of making the compiler-independent-build of mcpp itself. (Assuming the name of mcpp executable as mcpp.exe.)
"Build command line": nmake
"Output": mcpp.exe
"Clean command": nmake clean
"Rebuild command line": nmake PREPROCESSED=1
To make the Visual C-specific-build of mcpp, add an option COMPILER=MSC as:
"Build command line": nmake COMPILER=MSC
"Output": mcpp.exe
"Clean command": nmake clean
"Rebuild command line": nmake COMPILER=MSC PREPROCESSED=1
Since a Makefile project does not provide a 'make install' equivalent command, you must write the makefile in such a way that the commands you specify in "Build command line" and "Rebuild command line" also perform installation. *4
If you do not compile mcpp, "Build command line" and "Rebuild command line" can be the same.
When completed, click "Finish".
devenv <Project File> /useenv
vcexpress <Project File> /useenv
2.10.2. How to Make mcpp available in Mac OS X / Xcode.app
export PATH=/Developer/usr/bin:$PATH
configure ${mcpp_dir}/configure --enable-replace-cpp
make
sudo make install
From screen top menu bar of Xcode.app, select "Project" > "Edit Project Settings".
The "project editor" window will appear.
Then, select "Build" pane of it, and edit "Other C flags" item.
The options should be specified following '-Wp,' and separated by commas, for example:
-Wp,-23,-W3
3. Enhancements and Compatibility
3.1. #pragma MCPP put_defines, #pragma MCPP preprocess and others
#pragma MCPP put_defines
#pragma MCPP preprocessed
3.1.1. Pre-preprocessing of Header File
#if PREPROCESSED
#include "mcpp.H"
#else
#include "system.H"
#include "internal.H"
#endif
mcpp > mcpp.H
#pragma MCPP preprocess
#include "system.H"
#include "internal.H"
#pragma MCPP put_defines
3.2. #pragma once
#pragma once
#ifndef __STDIO_H
#define __STDIO_H
/* Contents of stdio.h */
#endif
#ifndef __STDIO_H
#define __STDIO_H
#pragma once
/* Contents of stdio.h */
#endif
However, the path-list is not normalized usually in #line line.
By default, the #line line is output as specified by #include line, prepending the normalized include path, if any.
But, if -K option is specified, it is normalized so as to be easily utilized by some other tools.3.2.1. Tool to Write #pragma once to Header Files
chmod -R u+w *
ins_once -t *.h */*.h */*/*.h
ins_once *.h */*.h */*/*.h
3.3. #pragma MCPP warning, #include_next, #warning
#include_next <header.h>
#pragma MCPP warning any message
#warning any message
3.4. #pragma MCPP push_macro, #pragma __setlocale and others
3.5. #pragma MCPP debug, #pragma MCPP end_debug, #debug, #end_debug
#pragma MCPP debug token expand
/* Coding you want to debug */
#pragma MCPP end_debug
path Displays the include file search path. token Parses tokens one by one and displays its type. expand Traces a macro expansion process. macro_call Embed macro notifications into comments on each macro definition and macro expansion. if Displays the result (true or false) of #if, #elif, #ifdef and #ifndef. expression Traces #if expression evaluation. getc Traces preprocess 1-byte by 1-byte. memory Displays the status of heap memory used by mcpp. 3.5.1. #pragma MCPP debug path, #debug path
When a header file with #pragma once specified is #included again, the message to that effect is displayed.
Moreover, mcpp normalizes the path-list removing the redundant part such as "foo/../", and displays the result when the normalized path-list differs from the original one.
Also mcpp dereferences the symbolic link to its linked-file, and displays the result when conversion is occurred.3.5.2. #pragma MCPP debug token, #debug token
NAM Identifier NUM Preprocessing-number OPE Operator or punctuator STR String literal WSTR Wide string literal CHR Character constant WCHR Wide character constant SPE Special pp-tokens, such as $ and @ SEP Token separator white space 3.5.3. #pragma MCPP debug expand, #debug expand
expand_macro Entrance routine for macro expansion replace Expands a macro one level down. collect_args Collects arguments. prescan Scans a replacement list and processes # and ## operator. substitute Substitutes parameters with arguments. rescan Rescans a replacement list.
<n> Nth parameter <TSEP> Token delimiter inserted by mcpp <MAGIC> Code that inhibits re-replacement of the macro of the same name <RT_END> Code that indicates the end of a replacement list <SRC> Code that indicates an identifier taken from source file while rescanning
<MACm> Call of the m'th macro contained in one macro call <MAC_END> The end of the macro call started by previous MACm <MACm:ARGn> The n'th argument of the m'th macro call <ARG_END> The end of the argument started by previous MACm:ARGn 3.5.4. #pragma MCPP debug if, #debug if
3.5.5. #pragma MCPP debug expression, #debug expression
3.5.6. #pragma MCPP debug getc, #debug getc
3.5.7. #pragma MCPP debug memory, #debug memory
3.5.8. #pragma MCPP debug macro_call
In addition, some informations are output on #undef, #if (#elif, #ifdef, #ifndef) and #endif, too.
This mode is specified also by -K option.3.5.8.1. Comment on #define
#define NULL 0L
#define BAR(x, y) x ## y
#define BAZ(a, b) a + b
/*mNULL 1:9-1:16*/
/*mBAR 2:9-2:25*/
/*mBAZ 3:9-3:24*/
3.5.8.2. Comment on #undef
#undef BAZ
/*undef 10*//*BAZ*/
3.5.8.3. Comment on Macro Expansion
foo(NULL);
foo(BAR(some_, var));
foo = BAZ(NULL, 2);
bar = BAZ(BAZ(a,b), c);
foo(/*<NULL 4:5-4:9*/0L/*>*/);
foo(/*<BAR 5:5-5:20*//*!BAR:0-0 5:9-5:14*//*!BAR:0-1 5:16-5:19*/some_var/*>*/);
foo = /*<BAZ 6:7-6:19*//*!BAZ:0-0 6:11-6:15*//*!BAZ:0-1 6:17-6:18*//*<BAZ:0-0*//*<NULL 6:11-6:15*/0L/*>*//*>*/ + /*<BAZ:0-1*/2/*>*//*>*/;
bar = /*<BAZ 7:7-7:23*//*!BAZ:0-0 7:11-7:19*//*!BAZ:0-1 7:21-7:22*//*<BAZ:0-0*//*<BAZ 7:11-7:19*//*!BAZ:1-0*//*!BAZ:1-1*//*<BAZ:1-0*/a/*>*/ + /*<BAZ:1-1*/b/*>*//*>*//*>*/ + /*<BAZ:0-1*/c/*>*//*>*/;
foo(/*<NULL 4:5-4:9*/0L/*NULL>*/);
foo(/*<BAR 5:5-5:20*//*!BAR:0-0 5:9-5:14*//*!BAR:0-1 5:16-5:19*/some_var/*BAR>*/);
foo = /*<BAZ 6:7-6:19*//*!BAZ:0-0 6:11-6:15*//*!BAZ:0-1 6:17-6:18*//*<BAZ:0-0*//*<NULL 6:11-6:15*/0L/*NULL>*//*BAZ:0-0>*/ + /*<BAZ:0-1*/2/*BAZ:0-1>*//*BAZ>*/;
bar = /*<BAZ 7:7-7:23*//*!BAZ:0-0 7:11-7:19*//*!BAZ:0-1 7:21-7:22*//*<BAZ:0-0*//*<BAZ 7:11-7:19*//*!BAZ:1-0*//*!BAZ:1-1*//*<BAZ:1-0*/a/*BAZ:1-0>*/ + /*<BAZ:1-1*/b/*BAZ:1-1>*//*BAZ>*//*BAZ:0-0>*/ + /*<BAZ:0-1*/c/*BAZ:0-1>*//*BAZ>*/;
3.5.8.4. Comment on #if (#elif, #ifdef, #ifndef)
#define NULL 0L
#define BAR(x, y) x ## y
#define BAZ(a, b) a + b
#include "bar.h"
#ifdef BAR
#ifndef BAZ
#if 1 + BAR( 2, 3)
#endif
#else
#if 1
#endif
#if BAZ( 1, BAR( 2, 3))
#undef BAZ
#endif
#endif
#endif
#line 1 "/dir/foo.c"
#line 1 "/dir/bar.h"
/*mNULL 1:9-1:16*/
/*mBAR 2:9-2:25*/
/*mBAZ 3:9-3:24*/
#line 2 "/dir/foo.c"
/*ifdef 2*//*BAR*//*i T*/
/*ifndef 3*//*BAZ*//*i F*/
/*else 6:T*/
/*if 7*//*i T*/
/*endif 8*/
/*if 9*//*BAZ*//*BAR*//*i T*/
/*undef 10*//*BAZ*/
#line 11 "/dir/foo.c"
/*endif 11*/
/*endif 12*/
/*endif 13*/
3.5.8.5. Comment on #else and #endif
3.5.8.6. #line Output
3.6. #assert, #asm, #endasm
#if ULONG_MAX/2 < LONG_MAX
#error Bad unsigned long handling.
#endif
#assert LONG_MAX <= ULONG_MAX/2
Preprocessing assertion failed
asm (
" leax _iob+13,y\n"
" pshs x\n"
);
3.7. New C99 Features (_Pragma() operator, Variadic Macro and others)
A variable argument macro takes a form of:
#define debug(...) fprintf(stderr, __VA_ARGS__)
debug( "X = %d\n", x);
fprintf(stderr, "X = %d\n", x);
3.8. Particular specifications for certain compiler systems
3.8.1. Variadic macro of GCC and Visual C
#define EMPTY
#define VC_VA( format, ...) printf( format, __VA_ARGS__)
VC_VA( "var_args: %s %d\n", "Hello", 2005); /* printf( "var_args: %s %d\n", "Hello", 2005); */
VC_VA( "absence of var_args:\n"); /* printf( "absence of var_args:\n"); */
VC_VA( "empty var_args:\n", EMPTY); /* printf( "empty var_args:\n", ); */ /* trailing comma */
3.8.2. Handling of 'defined' by GCC
3.8.3. Asm Statement in Borland C and Other Special Syntaxes
asm {
mov x,4;
...;
}
3.8.4. #import and Others
3.8.4.1. #import of Mac OS X / GCC
3.8.4.2. #import and #using of Visual C
3.9. Problems of GCC and Compatibility with GCC
3.9.1. Preprocessing FreeBSD 2/Kernel Source
3.9.1.1. Multi-Line String Literal
asm("
asm code0
#ifdef PC98
asm code1
#else
asm code2
#endif
...
");
asm(
" asm code0\n"
#ifdef PC98
" asm code1\n"
#else
" asm code2\n"
#endif
" ...\n"
);
3.9.1.2. #else junk, #endif junk
#endif MACRO
#endif /* MACRO */
3.9.1.3. #ifdef 0
#ifdef 0
#if 0
3.9.1.4. Duplicate Definition of Macro
#define DEBUG
#undef DEBUG
3.9.1.5. #warning
3.9.1.6. Variable Argument Macros
#define MCD_TRACE(fmt, a...) \
{ \
if (mcd_data[unit].debug) { \
printf("mcd%d: status=0x%02x: ", \
unit, mcd_data[unit].status); \
printf(fmt, ## a); \
} \
}
# define ext2_debug(fmt, a...) { \
printf("EXT2-fs DEBUG (%s, %d): %s:", \
__FILE__, __LINE__, __FUNCTION__); \
printf(fmt, ## a); \
}
#define MCD_TRACE( ...) \
{ \
if (mcd_data[unit].debug) { \
printf("mcd%d: status=0x%02x: ", \
unit, mcd_data[unit].status); \
printf( __VA_ARGS__); \
} \
}
# define ext2_debug( ...) { \
printf("EXT2-fs DEBUG (%s, %d): %s:", \
__FILE__, __LINE__, __FUNCTION__); \
printf( __VA_ARGS__); \
}
#define MCD_TRACE(fmt, ...) \
{ \
if (mcd_data[unit].debug) { \
printf("mcd%d: status=0x%02x: ", \
unit, mcd_data[unit].status); \
printf(fmt, __VA_ARGS__); \
} \
}
# define ext2_debug(fmt, ...) { \
printf("EXT2-fs DEBUG (%s, %d): %s:", \
__FILE__, __LINE__, __FUNCTION__); \
printf(fmt, __VA_ARGS__); \
}
3.9.1.7. Empty Argument in Macro Call
LIST_HEAD(, arg2)
TAILQ_HEAD(, arg2)
CIRCLEQ_HEAD(, arg2)
SLIST_HEAD(, arg2)
STAILQ_HAED(, arg2)
#define EMPTY
LIST_HEAD(EMPTY, arg2)
TAILQ_HEAD(EMPTY, arg2)
CIRCLEQ_HEAD(EMPTY, arg2)
SLIST_HEAD(EMPTY, arg2)
STAILQ_HAED(EMPTY, arg2)
SYSCTL_NODE(, arg2, arg3, arg4, arg5, arg6, arg7, arg8, arg9)
#define SYSCTL_NODE(parent, nbr, name, access, handler, descr) \
extern struct linker_set sysctl_##parent##_##name; \
SYSCTL_OID(parent, nbr, name, CTLTYPE_NODE|access, \
(void*)&sysctl_##parent##_##name, 0, handler, "N", descr); \
TEXT_SET(sysctl_##parent##_##name, sysctl__##parent##_##name);
3.9.1.8. Object-Like Macros Replaced with Function-like Macro Name
#define __byte_swap_long(x) (replacement text)
#define NTOHL(x) (x) = ntohl ((u_long)x)
#define ntohl __byte_swap_long
#define ntohl(x) __byte_swap_long(x)
#define INB inb
#define INW inb
#define INB(x) inb(x)
#define INW(x) inb(x)
3.9.1.9. Preprocessing .S File
asm(
" asm code0\n"
#ifdef Machine_A
" asm code1\n"
#else
" asm code2\n"
#endif
" ...\n"
);
3.9.2. Preprocessing FreeBSD 2/libc Source
#endif;
3.9.3. Problems Concerning GCC 2/cpp
3.9.4. Preprocessing Linux/glibc 2.1
3.9.4.1. Multi-Line String Literal
#define MACRO asm("
instr 0
instr 1
instr 2
")
#define MACRO asm("\n instr 0\n instr 1\n instr 2\n")
3.9.4.2. #include_next, #warning
3.9.4.3. Variable Argument Macros
3.9.4.4. Empty Argument During Macro Calls
3.9.4.5. Object-Like Macros Replaced with Function-like Macro Name
3.9.4.6. Macros Expanded to 'defined'
#define HAVE_MREMAP defined(__linux__) && !defined(__arm__)
#if HAVE_MREMAP
#if defined(__linux__) && !defined(__arm__)
defined(__linux__) && !defined(__arm__) (1)
defined(1) && !defined(__arm__)
#if defined(__linux__) && !defined(__arm__)
#define HAVE_MREMAP 1
#endif
defined(__linux__) && !defined(__arm__)
3.9.4.7. Preprocessing .S File
#APP
#NO_APP
3.9.4.8. Problems of rpcgen and -dM Option
#123 "filename"
#line 123
#line 123 "filename"
3.9.4.9. -include, -isystem and -I- Options
3.9.4.10. Undocumented Predefined Macros
__VERSION__, __SIZE_TYPE__, __PTRDIFF_TYPE__, __WCHAR_TYPE__
3.9.4.11. Undocumented Environment Variables
SUNPRO_DEPENDENCIES='$ (@:.h=.d)-t $@' \
$ (CC) -E -x c $ (sysinclude) $< -D_LIBC -dM | \
... \
etc.
3.9.4.12. Other Problems
3.9.5. To Use mcpp with GCC 2
#!/bin/sh
/usr/lib/gcc-lib/i386-redhat-linux/egcs-2.91.66/mcpp -Q -lang-asm "$@"
The -Q options are optional, however, I recommend that you should use -Q to record a large amount of diagnostic messages.
chmod a+x mcpp.sh
mv cpp cpp_gnuc
ln -sf mcpp.sh cpp
These commands execute mcpp.sh linked to cpp when gcc calls cpp, and mcpp.sh calls mcpp using the above options before the ones specified by gcc.
ln -sf cpp_gnuc cpp
3.9.5.1. To Sort mcpp's Warnings
grep 'fatal:' `find . -name mcpp.err`
grep 'error:' `find . -name mcpp.err`
grep 'warning:' `find . -name mcpp.err` | sort -k3 -u > mcpp-warnings-sorted
grep 'warning:' `find . -name mcpp.err` | sort -k3 | uniq > mcpp-warnings-all
grep 'warning: Replacement' `find . -name mcpp.err` | sort -k3 | uniq | less
After you get an overall idea of what source lines are causing what kinds of errors or warnings, you can see a particular mcpp.err by "less" and then, if necessary, see the source file in question.
In addition, you can sandwich the source code in question with '#pragma MCPP debug expand' and '#pragma MCPP end_debug' and preprocess it again to see the output, in which case I recommend you to invoke mcpp in the following manner so that preprocessing results and diagnostic messages are output to the same file:
mcpp <-opts> in-file.c > in-file.i 2>&1
When you use "make", you must temporarily change the above shell-script.
3.9.6. Preprocessing GCC 3.2 Source
OS make library CPU VineLinux 2.5 GNU make glibc 2.2.4 Celeron/1060MHz VineLinux 2.5 GNU make glibc 2.2.4 K6/200MHz, AthlonXP/2.0GHz FreeBSD 4.7R UCB make libc.so.4 K6/200MHz, AthlonXP/2.0GHz 3.9.6.1. Multi-Line String Literal
3.9.6.2. #include_next and #warning
3.9.6.3. Variable Argument Macros
#define eprintf( fmt, ...) fprintf( stderr, fmt, ##__VA_ARGS__)
#define eprintf( fmt, args...) fprintf( stderr, fmt, ##args)
eprintf( "success!\n") ==> fprintf( stderr, "success!\n")
3.9.6.4. Empty Arguments in Macro Invocation
3.9.6.5. Object-Like Macros Replaced with Function-Like Macros
#define REGEX_ALLOCATE alloca
#define alloca( size) __builtin_alloca( size)
#define REGEX_ALLOCATE( size) alloca( size)
#define alloca __builtin_alloc
3.9.6.6. Macros Expanded to 'defined'
3.9.6.7. Preprocessing of .S Files
Wherever possible, you should use a preprocessor geared to the
language you are writing in. Modern versions of the GNU assembler have
macro facilities.
3.9.6.8. rpcgen and -dM Option
3.9.6.9. -include, -isystem and -I- Options
3.9.6.10. Undocumented Predefined Macros
3.9.6.11. Undocumented Environment Variables
3.9.6.12. Other Problems
#pragma GCC poison
#pragma GCC dependency
#pragma GCC system_header
3.9.7. To Use mcpp with GCC 3 or 4
#!/bin/sh
/usr/local/gcc-3.2/bin/gcc_proper -no-integrated-cpp "$@"
#!/bin/sh
/usr/local/gcc-3.2/bin/g++_proper -no-integrated-cpp "$@"
chmod a+x gcc.sh g++.sh
mv gcc gcc_proper
mv g++ g++_proper
ln -sf gcc.sh gcc
ln -sf g++.sh g++
3.9.7.1. To Use mcpp with GCC 3.3 and 3.4-4.1
3.9.8. Preprocessing Linux/glibc 2.4
The old-fashioned "multi-line string literal" has disappeared.
#include_next is found in the following source files. Its occurrence has increased as compared with the six years older version.
catgets/config.h, elf/tls-macros.h, include/bits/dlfcn.h, include/bits/ipc.h, include/fpu_control.h, include/limits.h, include/net/if.h, include/pthread.h, include/sys/sysctl.h, include/sys/sysinfo.h, include/tls.h, locale/programs/config.h, nptl/sysdeps/pthread/aio_misc.h, nptl/sysdeps/unix/sysv/linux/aio_misc.h, nptl/sysdeps/unix/sysv/linux/i386/clone.S, nptl/sysdeps/unix/sysv/linux/i386/vfork.S, nptl/sysdeps/unix/sysv/linux/sleep.c, sysdeps/unix/sysv/linux/ldsodefs.h, sysdeps/unix/sysv/linux/siglist.h
Though the following is not a part of glibc itself but a testcase file to test glibc by 'make check', #include_next is found also in it.
sysdeps/i386/i686/tst-stack-align.h
#warning appears in sysvipc/sys/ipc.h. This directive is in a block to be skipped in normal processing, and does not cause any problem.
There are definitions of variable argument macros in the following files. All of these are that of the old spec since GCC2. There is not any macro of C99 spec, nor even GCC3 spec one.
elf/dl-lookup.c, elf/dl-version.c, include/libc-symbols.h, include/stdio.h, locale/loadlocale.c, locale/programs/ld-time.c, locale/programs/linereader.h, locale/programs/locale.c, locale/programs/locfile.h, nptl/sysdeps/pthread/setxid.h, nss/nss_files/files-XXX.c, nss/nss_files/files-hosts.c, sysdeps/generic/ldsodefs.h, sysdeps/i386/fpu/bits/mathinline.h, sysdeps/unix/sysdep.h, sysdeps/unix/sysv/linux/i386/sysdep.h
The following testcase files also have variadic macro definitions of GCC2 spec.
localedata/tst-ctype.c, posix/bug-glob2.c, posix/tst-gnuglob.c, stdio-common/bug13.c
Moreover, many of the calls of these macros lack actual argument for variable argument. As much as 142 files have such macro calls lacking variable argument, and 120 files of them have such unusual macro calls as the replacement list of which have ", ##" sequence immediately preceding variable argument and hence removal of the ',' happen.
As a variable argument macro specification, C99 one is portable and is recommendable. However, it is not so easy to rewrite GCC specs macro to C99 one. Both of GCC2 spec and GCC3 spec variadic macros do not necessarily correspond to C99 spec one-to-one, because GCC specs cause removal of the preceding comma in case of absence of variable argument. If you rewrite GCC spec macro definition to C99 one, you also need to rewrite macro calls of absent variable argument and supplement an argument.
In glibc 2.1.3, GCC2 spec macros were not so many, and it was not a heavy work for a user to rewrite them with an editor. In glibc 2.4, however, such macro definitions increased and especially their calls vastly increased. As a consequence, it is impossible now for a user to rewrite them.
To cope with this situation, mcpp V.2.6.3 onward implemented GCC3 spec variadic macro for GCC-specific-build only. Furthermore, mcpp V.2.7 implemented GCC2 spec one, too. However, you should not write GCC2 spec macro in your sources, because the spec is too deviant from token-based principle. Since GCC2 spec corresponds to GCC3 spec one-to-one, it is easy to rewrite a macro definition to GCC3 spec, and call of that macro need not be rewritten. The already written macros with GCC2 spec will become a little clearer, if rewritten this way. *1
To rewrite a GCC2 spec variadic macro to GCC3 spec one, for example, change:
#define libc_hidden_proto(name, attrs...) hidden_proto (name, ##attrs)
to:
#define libc_hidden_proto(name, ...) hidden_proto (name, ## __VA_ARGS__)
That is, change the parameter attrs... to ..., and change attrs in the replacement-list to __VA_ARGS__.
Note:
*1 As for variadic macro of GCC2 spec and GCC3 spec, see 3.9.1.6, 3.9.6.3 respectively.
The macro calls with any empty argument are found in as many as 488 source files. They have greatly increased since the old version. C99 approval of empty macro argument may have influenced the tendency.
In particular, math/bits/mathcalls.h has as many as 79 macro calls with empty argument. That is the same with the old version.
The following files have object-like macro definitions replaced to function-like macro names:
argp/argp-fmtstream.h, hesiod/nss_hesiod/hesiod-proto.c, intl/plural.c, libio/iopopen.c, nis/nss_nis/nis-hosts.c, nss/nss_files/files-hosts.c, nss/nss_files/files-network.c, nss/nss_files/files-proto.c, nss/nss_files/files-rpc.c, nss/nss_files/files-service.c, resolv/arpa/nameser_compat.h, stdlib/gmp-impl.h, string/strcoll_l.c, sysdeps/unix/sysv/linux/clock_getres.c, sysdeps/unix/sysv/linux/clock_gettime.c
elf/link.h has function-like macro definitions replaced to function-like macro names. For example,:
#define ELFW(type) _ElfW (ELF, __ELF_NATIVE_CLASS, type) /* sysdeps/generic/ldsodefs.h:46 */ #define _ElfW(e,w,t) _ElfW_1 (e, w, _##t) /* elf/link.h:32 */ #define _ElfW_1(e,w,t) e##w##t /* elf/link.h:33 */ #define __ELF_NATIVE_CLASS __WORDSIZE /* bits/elfclass.h:11 */ #define __WORDSIZE 32 /* sysdeps/wordsize-32/bits/wordsize.h:19 */ #define ELF32_ST_TYPE(val) ((val) & 0xf) /* elf/elf.h:429 */
with the above macro definitions,
&& ELFW(ST_TYPE) (sym->st_info) != STT_TLS /* elf/do-lookup.h:81 */
in this macro call, ELFW(ST_TYPE) is expanded with the following steps:
ELFW(ST_TYPE) _ElfW(ELF, __ELF_NATIVE_CLASS, ST_TYPE) _ElfW_1(ELF, 32, _ST_TYPE) ELF32_ST_TYPE
Then, ELF32_ST_TYPE with the subsequent sequence (sym->st_info) is expanded to ((sym->st_info) & 0xf). That is to say, a function-like macro call of _ElfW_1(ELF, 32, _ST_TYPE) is expanded to name of another function-like macro ELF32_ST_TYPE.
These macros become more clear, if the 3 definitions of above 6 are written as:
#define ELFW( type, val) _ElfW( ELF, __ELF_NATIVE_CLASS, type, val) #define _ElfW( e, w, t, val) _ElfW_1( e, w, _##t, val) #define _ElfW_1( e, w, t, val) e##w##t( val)
and if they are used as:
&& ELFW(ST_TYPE, sym->st_info) != STT_TLS
Although these arguments may seem to be a little redundant, these are more natural than the original ones, if we think of function call syntax.
The following files contain macro definitions whose replacement-lists have the 'defined' token. *1
iconv/skeleton.c, sysdeps/generic/_G_config.h, sysdeps/gnu/_G_config.h, sysdeps/i386/dl-machine.h, sysdeps/i386/i686/memset.S, sysdeps/mach/hurd/_G_config.h, sysdeps/posix/sysconf.c
Those macros are used in some #if lines of the following files, and also some of the above files themselves.
elf/dl-conflict.c, elf/dl-runtime.c, elf/dynamic-link.h
In glibc 2.1.3, malloc/malloc.c had a macro definition of HAVE_MREMAP whose replacement-list contained the 'defined' token. In glibc 2.4, that macro definition has been revised to portable one, nevertheless the unportable macros of the same sort have increased in other source files.
In a #if expression, the result of a macro expansion whose replacement-list has the 'defined' token is undefined according to the Standards, and it is only self-satisfaction of GCC to preprocess the expression plausibly and arbitrarily. In order to make these sources portable among other preprocessors, at least the definitions of these macros should be rewritten, and in some cases the calls of the macros should be rewritten, too. *2
In most cases, the simple rewriting is sufficient as seen in 3.9.4.6. In some cases, however, this method does not work. Those are the cases where evaluation result of 'defined MACRO' differs depending on its timing. For example, sysdeps/i386/dl-machine.h has the following macro definition, which is used in some #if expressions on other files.
#define ELF_MACHINE_NO_RELA defined RTLD_BOOTSTRAP
Rewriting the definition as follows will not do.
#if defined RTLD_BOOTSTRAP #define ELF_MACHINE_NO_RELA 1 #endif
The macro RTLD_BOOTSTRAP is defined in elf/rtld.c, if and only that file is included before dl-machine.h. In other words, the evaluation result of 'defined RTLD_BOOTSTRAP' depends on the order of including the two files. In order to rewrite these sources portable, the macro ELF_MACHINE_NO_RELA should be abandoned since it is useless macro found only in #if lines, and the #if line:
#if ELF_MACHINE_NO_RELA
should be rewritten as:
#if defined RTLD_BOOTSTRAP
In glibc, this portable style of #if lines are found on many places, at the same time, the undefined style as above example are also found on some places.
Note:
*1 On Linux, /usr/include/_G_config.h is the header file installed from glibc's sysdeps/gnu/_G_config.h, therefore it has the same macro definition as:
#define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
This should be rewritten to:
#if defined (_STATBUF_ST_BLKSIZE) #define _G_HAVE_ST_BLKSIZE 1 #endif
*2 mcpp V.2.7 and later in STD mode on GCC-specific-build handles 'defined' token generated by macro expansion in #if line like GCC. Yet, such a bug-to-bug handling should not be depended on.
*.S files are provided for each CPU type, so their number is very large and amounts to more than 1000. The files for one CPU type as x86 are some portion of them.
*.S file is an assembler source with inserted preprocessing directives such as #if or #include, comments or macros of C. Since assembler source is not consisted of C token sequence, it accompanies some risks to preprocess it by C preprocessor. To process an assembler source, the preprocessor must pass such characters as % or $ (which are not used in C except in string literal or in character constant) as they are, and retain existence or non-existence of spaces as they are. Furthermore, the preprocessor must relax syntax checking to pass a sequence which would be an error if it was in C source. On the other hand, it must process #if lines or macros like C, and must do some sort of error checking, too. What a nuisance! These specifications have not any logical basis at all, these are GCC's local and mostly undocumented behaviors and no more.
To illustrate the problems, let me take an example of the following fragment from nptl/sysdeps/unix/sysv/linux/i386/i486/pthread_cond_wait.S.
.byte 8 # Return address register # column. #ifdef SHARED .uleb128 7 # Augmentation value length. .byte 0x9b # Personality: DW_EH_PE_pcrel # + DW_EH_PE_sdata4
'#ifdef SHARED' intends to be a directive of C.
On the other hand, the latter part of each line starting with # are supposed to be comments.
'# column.' is, however, syntactically indistinguishable from invalid directive, since the # is the first non-white-space-character of the line.
'# + DW_EH_PE_sdata4' causes even syntax error in C.
Another file has the following line, where a single character appears singly.
In C, a pair of the single quote is used to quote a character constant, and unmatched single quote causes a tokenization error.
movl 12(%esp), %eax # that `fixup' takes its parameters in regs.
The above pthread_cond_wait.S also has the following line which is a macro call.
versioned_symbol (libpthread, __pthread_cond_wait, pthread_cond_wait, GLIBC_2_3_2)
The macros are defined as:
# define versioned_symbol(lib, local, symbol, version) \ versioned_symbol_1 (local, symbol, VERSION_##lib##_##version) /* include/shlib-compat.h:65 */ # define versioned_symbol_1(local, symbol, name) \ default_symbol_version (local, symbol, name) /* include/shlib-compat.h:67 */ # define default_symbol_version(real, name, version) \ _default_symbol_version(real, name, version) /* include/libc-symbols.h:398 */ # define _default_symbol_version(real, name, version) \ .symver real, name##@##@##version /* include/libc-symbols.h:411 */ #define VERSION_libpthread_GLIBC_2_3_2 GLIBC_2.3.2 /* Created by make: abi-versions.h:145 */
The line is expected to be expanded as:
.symver __pthread_cond_wait, pthread_cond_wait@@GLIBC_2.3.2
The problem is the definition of _default_symbol_version. There is no C token containing '@' (except string-literal or character-constant). Though pthread_cond_wait@@GLIBC_2.3.2 is a sequence generated by concatenating some parts with ## operator, this is not a C token. The concatenation generates illegal tokens also in midst of its processing. The macro uses ## operator of C, nevertheless its syntax is far from C.
In order to do a sort of preprocessing on an assembler source, essentially an assembler macro processor should be used.
To process assembler codes with C, it is recommended that the asm() or __asm__() function should be used whenever possible, to embed the assembler code in a string literal, and that not *.S but *.c should be used as a file name.
libc-symbols.h has another version of the above macro as follows which is used for *.c.
This macro can be processed by Standard-conforming C preprocessor without problem.
# define _default_symbol_version(real, name, version) \ __asm__ (".symver " #real "," #name "@@" #version)
glibc also has many *.c or *.h files which use asm() or __asm()__. Nevertheless, it has much more *.S files.
If you process an assembler source by C preprocessor in any way, at least you should use /* */ or // as comment notation instead of #. In actual, many sources of glibc use /* */ or //, whereas some sources use #.
Having said so, mcpp V.2.6.3 onward relaxed grammar checking largely in lang-asm mode to process these unusual sources, considering that glibc 2.4 has too many *.S files and out-of-C-grammar-sources has increased since 2.1.3.
The problem of stdlib/isomac.c which I referred to at 3.9.4.8 is the same in glibc 2.4.
Also the problem of rpcgen is unchanged.
In addition, glibc 2.4 has scripts/versions.awk file, which presupposes GCC's peculiar behavior about the number of line top spaces of preprocessed output. In order to use mcpp or other preprocessors, this file should be revised as follows.
$ diff -c versions.awk* *** versions.awk 2006-12-13 00:59:56.000000000 +0900 --- versions.awk.orig 2005-03-23 10:46:29.000000000 +0900 *************** *** 50,56 **** } # This matches the beginning of a new version for the current library. ! /^ *[A-Z]/ { if (renamed[actlib "::" $1]) actver = renamed[actlib "::" $1]; else if (!versions[actlib "::" $1] && $1 != "GLIBC_PRIVATE") { --- 50,56 ---- } # This matches the beginning of a new version for the current library. ! /^ [A-Za-z_]/ { if (renamed[actlib "::" $1]) actver = renamed[actlib "::" $1]; else if (!versions[actlib "::" $1] && $1 != "GLIBC_PRIVATE") { *************** *** 65,71 **** # This matches lines with names to be added to the current version in the # current library. This is the only place where we print something to # the intermediate file. ! /^ *[a-z_]/ { sortver=actver # Ensure GLIBC_ versions come always first sub(/^GLIBC_/," GLIBC_",sortver) --- 65,71 ---- # This matches lines with names to be added to the current version in the # current library. This is the only place where we print something to # the intermediate file. ! /^ / { sortver=actver # Ensure GLIBC_ versions come always first sub(/^GLIBC_/," GLIBC_",sortver)
-isystem and -I- options are no longer used.
On the other hand, -include option is used extremely frequently. A header file include/libc-symbols.h is included by this option as many as 7000 times. This -include is an option to push out a #include line from source to makefile. It makes source incomplete, and is not recommendable.
This is not a problem of glibc but of GCC. While a few important predefined macros were undocumented in GCC 2, they got documented in GCC 3. On the other hand, GCC 3.3 and later predefines many macros, and most of them are undocumented.
debug/tst-chk1.c has a queer part which is not processed as its intension by other preprocessor than GCC, unless revised as follows.
$ diff -cw tst-chk1.c* *** tst-chk1.c 2007-01-11 00:31:45.000000000 +0900 --- tst-chk1.c.orig 2005-08-23 00:12:34.000000000 +0900 *************** *** 113,119 **** static int do_test (void) { - int arg; struct sigaction sa; sa.sa_handler = handler; sa.sa_flags = 0; --- 113,118 ---- *************** *** 135,146 **** struct A { char buf1[9]; char buf2[1]; } a; struct wA { wchar_t buf1[9]; wchar_t buf2[1]; } wa; #ifdef __USE_FORTIFY_LEVEL ! arg = (int) __USE_FORTIFY_LEVEL; #else ! arg = 0; #endif ! printf ("Test checking routines at fortify level %d\n", arg); /* These ops can be done without runtime checking of object size. */ memcpy (buf, "abcdefghij", 10); --- 134,146 ---- struct A { char buf1[9]; char buf2[1]; } a; struct wA { wchar_t buf1[9]; wchar_t buf2[1]; } wa; + printf ("Test checking routines at fortify level %d\n", #ifdef __USE_FORTIFY_LEVEL ! (int) __USE_FORTIFY_LEVEL #else ! 0 #endif ! ); /* These ops can be done without runtime checking of object size. */ memcpy (buf, "abcdefghij", 10);
Contrary to its innocent looking, the original source defines printf() as a macro, and as its consequence, #ifdef and other directive-like lines are usually eaten as an argument of the macro call. According to the Standards, the result is undefined when there is a line in an argument of a macro which would otherwise act as a directive. Since directive processing and macro expansion should be done in the same translation phase, it is an arbitrariness of GCC to process directive first. In the first place, processing of #ifdef __USE_FORTIFY_LEVEL line also contains macro processing, therefore it is extremely arbitrary to process this line and the other directive-like lines first then expand printf() macro. C preprocessing should be done sequentially from the top.
The configure script of glibc also has a portion to use GCC's peculiar help message. The script searches help message of compiler for "-z relro" option. If you use mcpp as a preprocessor, this portion does not yield the expected result. In spite of this problem, fortunately, compiling and test of glibc is done normally.
By the way, while GCC up to 3.2 appended many useless -A options by default on its invocation, GCC 3.3 onward ceased to do it.
Most of the portability problems I had found in glibc 2.1.3 have not been cleared in glibc 2.4 the six years newer version. On the contrary, number of sources lacking portability has increased.
There have been a few improvements such as disappearance of multi-line-string-literal, -isystem, -I- options and -A options on GCC side.
Meanwhile, sources with such unportable features have largely increased as #include_next, variadic macro of GCC2 spec, its call without variable argument, macro definition with 'defined' token in its replacement-list, *.S file and -include option. Macro calls with an empty argument have also increased. Above all it is most annoying that the writings which do not correspond to Standard C one-to-one, and hence cannot be easily converted to portable one, have increased.
All of these are problems of dependency on GCC's local specification and undocumented behavior. In a large scale software as glibc, once such unportable sources are created, it becomes difficult to revise them because many source files are interrelated. As a consequence, the same writings tend to be inherited for years, and even new sources are written so as to suit the old interfaces. For example, it shows this relationship directly that only the variadic macros of GCC2 spec are used, and neither of C99 spec nor GCC3 spec are not used at all. Besides, even if some unportable parts in a few sources are revised, at the same time the old unportable codings often appear newly in other sources. The old style writings are not easily cleared.
On the other hand, change of GCC behavior breaks many sources, and the possible influence becomes greater with time, therefore GCC becomes difficult to change its behavior. I think that both of GCC and glibc need to tidy up their old local specifications and old interfaces drastically in the near future.
On Linux, the system compiler is GCC, and the standard library is glibc. In these circumstances, there are some system headers which presuppose only GCC. Those are the obstacles to use other compiling tools than GCC such as mcpp of compiler-independent-build. For example, stddef.h and some other Standard header files are located only in GCC's version specific include directory, and are not found in /usr/include. These are rude deficiencies of the system header structure, and mcpp needs some workarounds for them.
On Linux, GCC installs a version specific include directory such as /usr/lib/gcc-lib/SYSTEM/VERSION/include where the Standard headers stddef.h, limits.h and some others are located. These headers and GCC behavior on them are queer. The problems are the same on CygWIN as on Linux. Mac OS X also has a few problems on some Standard headers.
In the first place, on Linux, five of the Standard C header files float.h, iso646.h, stdarg.h, stdbool.h, stddef.h are located only in the GCC version specific directory, not in /usr/include nor /usr/local/include. The system headers on Linux seem to more or less intend that compiler systems other than GCC use only /usr/include and GCC uses its version specific directory in addition to /usr/include. In fact, /usr/include lacks some Standard headers, that is the problem for non-GCC compilers or preprocessors.
If non-GCC preprocessor uses also GCC version specific directory, then on limits.h in this directory, the preprocessor encounters #include_next which is a GCC specific directive. If that is the case, why doesn't the preprocessor implement #include_next? Then, the limits.h causes a problem, because it is not cleanly written. What is worse, GCC V.3.3 or later predefines practically by itself the macros to be defined by limits.h, hence the header is useless for other preprocessors.
Besides, as for GCC itself, it shows queer behavior with #include_next in this header.
Although these problems are complicated to explain, I will describe them here, because they have been neglected for years for some reason.
Note that only mcpp of compiler-independent-build suffers this problem. GCC-specific-build is not affected.
The include directories for GCC are typically set as:
/usr/local/include /usr/lib/gcc-lib/SYSTEM/VERSION/include /usr/include
These are searched from upper to lower. The second is the GCC specific include directory. SYSTEM is i386-vine-linux, i368-redhat-linux or such, VERSION is 3.3.2, 3.4.3 or such. If you install another version of GCC into /usr/local, the /usr/lib/gcc-lib part above will become /usr/local/lib/gcc. In C++, some other directories are set with higher priority than /usr/local/include. For GCC V.3.* and 4.*, those are:
/usr/include/c++/VERSION /usr/include/c++/VERSION/SYSTEM /usr/include/c++/VERSION/backward
The name of these directories seem GCC specific, nevertheless no other C++ standard directories do not exist, so the other preprocessors can use no directories but these. For GCC 2.95, the include directory in C++ was:
/usr/include/g++-3
In addition, the directories specified by -I option or by environment variables are prepended to the list.
Let me take an example of limits.h in C on GCC V.3.3 or later focusing on definition of LONG_MAX, in order to make the explanations below simple. There are two limits.h: one in /usr/include and another in the version specific directory.
#include <limits.h>
By this line, GCC includes /usr/lib/gcc-lib/SYSTEM/VERSION/include/limits.h. This header file starts as:
#ifndef _GCC_LIMITS_H_ #define _GCC_LIMITS_H_ #ifndef _LIBC_LIMITS_H_ #include "syslimits.h" #endif
Then, GCC includes /usr/lib/gcc-lib/SYSTEM/VERSION/include/syslimits.h which is a short file as:
#define _GCC_NEXT_LIMITS_H #include_next <limits.h> #undef _GCC_NEXT_LIMITS_H
Now, limits.h is included again. Which limits.h? Since this directive is #include_next, it would skip the /usr/lib/gcc-lib/SYSTEM/VERSION/include, and would search /usr/include. GCC's cpp.info says:
This directive works like `#include' except in searching for the specified file: it starts searching the list of header file directories _after_ the directory in which the current file was found.
In fact, however, GCC does not include /usr/include/limits.h, but includes /usr/lib/gcc-lib/SYSTEM/VERSION/include/limits.h again somehow.
This time _GCC_LIMITS_H_ has been defined already, so the block beginning with the line:
#ifndef _GCC_LIMITS_H_
is skipped, and the next block is evaluated:
#else #ifdef _GCC_NEXT_LIMITS_H #include_next <limits.h> #endif #endif
Again, just the same #include_next <limits.h> which were found in /usr/lib/gcc-lib/SYSTEM/VERSION/include/syslimits.h. Does GCC include /usr/lib/gcc-lib/SYSTEM/VERSION/include/limits.h again as the previous time, which is the current file, and run into infinite recursion? No, it does not, but it includes /usr/include/limits.h this time. The behavior of GCC is beyond my understanding.
In /usr/include/limits.h, <features.h> and some other headers are included. Also, /usr/include/limits.h has a block beginning with the line:
#if !defined __GNUC__ || __GNUC__ < 2
In this block, <bits/wordsize.h> is included, and the Standard required macros are defined depending whether wordsize is 32 bit or 64 bit. For example, if wordsize is 32 bit, LONG_MAX is defined as:
#define LONG_MAX 2147483647L
Of course, GCC skips this block. Then, going to the end of this file, it returns to /usr/lib/gcc-lib/SYSTEM/VERSION/include/limits.h. Then, ending this file of the second inclusion, it returns to /usr/lib/gcc-lib/SYSTEM/VERSION/include/syslimits.h. Then, this file ends too, and GCC returns to the first inclusion of /usr/lib/gcc-lib/SYSTEM/VERSION/include/limits.h. In this file, after the above cited part, there are definitions of the Standard required macros. For instance, LONG_MAX is defined as:
#undef LONG_MAX #define LONG_MAX __LONG_MAX__
Then, the file ends.
#include <limits.h>
The processing of this line has ended. After all, LONG_MAX is defined to __LONG_MAX__ and it is the end. What is __LONG_MAX__? As a matter of fact, GCC V.3.3 or later predefines many macros including __LONG_MAX__ which is predefined to 2147483647L for 32 bit system. As with the other Standard required macros, the situations are almost the same as LONG_MAX, because they are defined using the predefined ones. If so, what is the purpose of these complicated header files and #include_next handling at all?
The behavior of GCC V.2.95, V.3.2, V.3.4, V.4.0 and V.4.1 on #include_next is the same as V.3.3. That is to say:
#include_next <limits.h>
by this line in /usr/lib/gcc-lib/SYSTEM/VERSION/include/syslimits.h, GCC includes /usr/lib/gcc-lib/SYSTEM/VERSION/include/limits.h, and by the same line in this file:
#include_next <limits.h>
it includes /usr/include/limits.h. As a result, in processing the line:
#include <limits.h>
/usr/lib/gcc-lib/SYSTEM/VERSION/include/limits.h is included twice. This duplicate inclusion happens to produce the same result, nevertheless it is redundant, and first of all, the behavior differs from the specification and is not consistent. In addition, this part of the file is redundant if the behavior accords to the specification.
#else #ifdef _GCC_NEXT_LIMITS_H #include_next <limits.h> #endif
Now, what happens to compiler or preprocessor other than GCC using Linux standard headers? stddef.h and some other Standard headers are not found in /usr/include nor /usr/local/include. If so, how about using also GCC version specific directory?
#include <limits.h>
By this line, the preprocessor includes /usr/lib/gcc-lib/SYSTEM/VERSION/include/limits.h, and from this file it includes /usr/lib/gcc-lib/SYSTEM/VERSION/include/syslimits.h, and in this file, it sees the line:
#include_next <limits.h>
Then, how about implementing #include_next? If the #include_next is implemented as its specification, the preprocessor searches by this line the "next" include directory /usr/include, and includes /usr/include/limits.h. Then, this non-GCC preprocessor processes the block beginning with this line:
#if !defined __GNUC__ || __GNUC__ < 2
In this block it defines LONG_MAX as:
#define LONG_MAX 2147483647L
and defines also the other macros appropriately. Then, it ends this file, and returns to /usr/lib/gcc-lib/SYSTEM/VERSION/include/syslimits.h. Then, it ends this file, and returns to /usr/lib/gcc-lib/SYSTEM/VERSION/include/limits.h. And it encounters these lines:
#undef LONG_MAX #define LONG_MAX __LONG_MAX__
At the end of the long run, all the correct definitions are canceled, and they become the undefined name __LONG_MAX__ or such!
Up to GCC V.3.2, the corresponding part of version specific limits.h had the lines like:
#define __LONG_MAX__ 2147483647L
Hence, the canceled macros are redefined correctly. Although the most part of the processing is useless, the results were correct. With the header files of V.3.3 or later, a non-GCC preprocessor is taken around here and there to get vain results.
The problems are summarized as below: *1, *2, *3, *4
Under these problems lies the excessively complicated system header structure. The extension directive #include_next enhances the complication. The use of this directive is very limited. Though GCC and glibc use it in compiling and installing of themselves, it does not exist in the installed system headers except for limits.h. The rare example in limits.h causes GCC above mentioned confusion. This presents a question on the reason of its existence.
Anyway, the compiler-independent-build of mcpp needs the following workarounds for the present. In order to avoid confusion, the compiler-independent-build does not implement #include_next nor uses GCC specific include directories.
For the GCC-specific-build of mcpp, no special setting is required, because it has GCC specific include directories list, implements #include_next as its specification, and predefines the macros as GCC does.
Note:
*1 I have checked the descriptions of this 3.9.9 section on Linux / GCC 2.95.3, 3.2, 3.3.2, 3.4.3, 4.0.2, 4.1.1, 4.3.0 and on CygWIN / GCC 2.95.3, 3.4.4. As with CygWIN, the behavior on #include_next was as its specification on GCC 2.95.3, but on 3.4.4 it changed to the same behavior as Linux. The C++ include directories in CygWIN was /usr/include/g++-3 on 2.95.3, while they are /usr/lib/gcc/i686-pc-cygwin/3.4.4/include/c++ and its sub-directories on 3.4.4.
*2 On FreeBSD 6.2 or 6.3 and its bundled GCC 3.4.6, all the Standard C headers are present in /usr/include, #include_next is not used in any system headers, and GCC specific C include directory does not exist. However, C++ include directories are GCC version dependent as /usr/include/c++/3.4, /usr/include/c++/3.4/backward.
Even on FreeBSD, an installation of another version of GCC makes GCC-version-specific include directory. Most of the headers in the directory are redundant. However, the headers in /usr/include remain unchanged.
*3 On Mac OS X Leopard / Apple-GCC 4.0.1, as on Linux, there is a GCC-version-specific include directory, #include_next is used in limits.h and a few other headers, also two limits.h are found. However, #include_next in syslimits.h has been deleted by Apple. float.h, iso646.h, stdarg.h, stdbool.h, stddef.h are all found in /usr/include, hence so much special settings are not necessary for mcpp. But, float.h, stdarg.h are only for GCC and Metrowerks (for powerpc), so if you use them with mcpp, you must rewrite float.h yourself and make stdarg.h to include GCC-version-specific one. Note that some definitions in float.h are different between x86 and powerpc.
*4 On MinGW / GCC 3.4.*, though the include directories and their precedence differ from the other systems, the behavior of GCC on #include_next is the same, and also some Standard headers are not in the standard include directory /mingw/include but in its version-specific-directory.
*5 float.h for i386 system can be written as follows referring to GCC's setting:
/* float.h */ #ifndef _FLOAT_H___ #define _FLOAT_H___ #define FLT_ROUNDS 1 #define FLT_RADIX 2 #define FLT_MANT_DIG 24 #define DBL_MANT_DIG 53 #define LDBL_MANT_DIG 64 #define FLT_DIG 6 #define DBL_DIG 15 #define LDBL_DIG 18 #define FLT_MIN_EXP (-125) #define DBL_MIN_EXP (-1021) #define LDBL_MIN_EXP (-16381) #define FLT_MIN_10_EXP (-37) #define DBL_MIN_10_EXP (-307) #define LDBL_MIN_10_EXP (-4931) #define FLT_MAX_EXP 128 #define DBL_MAX_EXP 1024 #define LDBL_MAX_EXP 16384 #define FLT_MAX_10_EXP 38 #define DBL_MAX_10_EXP 308 #define LDBL_MAX_10_EXP 4932 #define FLT_MAX 3.40282347e+38F #define DBL_MAX 1.7976931348623157e+308 #define LDBL_MAX 1.18973149535723176502e+4932L #define FLT_EPSILON 1.19209290e-7F #define DBL_EPSILON 2.2204460492503131e-16 #define LDBL_EPSILON 1.08420217248550443401e-19L #define FLT_MIN 1.17549435e-38F #define DBL_MIN 2.2250738585072014e-308 #define LDBL_MIN 3.36210314311209350626e-4932L #if defined (__STDC_VERSION__) && __STDC_VERSION__ >= 199901L #define FLT_EVAL_METHOD 2 #define DECIMAL_DIG 21 #endif /* C99 */ #endif /* _FLOAT_H___ */
On V.2.7, mcpp began to support Max OS X / GCC. This section describes the problems of the system found by mcpp. The author, however, does not know the system so much yet. He has only compiled mcpp itself and firefox on the system. He knows nothing about Objective C nor Objective C++.
Since GCC is practically the only compiler on this system now, some dependencies on GCC-local specs are found in some of its system headers. Such dependencies are not so much as Linux, maybe because its standard library is not glibc. But, they are not so few as FreeBSD. Some tidy-ups are desirable.
Another characteristic of this system is that the system compiler is a GCC largely modified and extended by Apple. In the system headers and some sources of Apple on Max OS X, dependencies on Apple-GCC-local specs are more conspicuous than those on general-GCC-local specs. In particular, the extended specs to support both of Intel-Mac and PowerPC-Mac on a machine are the most characteristic.
Here, we refer to the system of Mac OS X Leopard / Apple-GCC 4.0.1.
The GCC-local directive #include_next are not many, but found in float.h, stdarg.h, varargs.h in /usr/include/ and the files of the same name in /Developer/SDKs/MacOSX10.*.sdk/usr/include/.
All of them are to include different real header files depending on whether the compiler is GCC or Metrowerks.
When the compiler is GCC, stdarg.h, for example, does '#include_next <stdarg.h>'.
The limits.h in GCC-version-specific include directory has #include_next as Linux, but the one in syslimits.h has been removed and a bit tidied up.
Though this directive is used modestly, it is a problem that float.h, stdarg.h presuppose only GCC and Metrowerks.
Those can be written more portable as on FreeBSD. *1
In addition, #include_next for GCC on a header in /usr/include is a nonsense, because the priority of that include directory is lower than GCC-version-specific one.
Consequently this #include_next is never executed.
Another GCC-local directive #warning is sometimes found in objc/, wx-2.8/wx/ and a few other directories in /usr/include/, and their corresponding directories in /Developer/SDKs/MacOSX*.sdk/usr/include/.
Most of the directives are warnings against obsolete or deprecated files or usages.
backward_warning.h in /usr/include/c++/VERSION/backward/ and its corresponding file in /Developer/SDKs/MacOSX*.sdk/ are to execute #warning against these deprecated headers.
And all the headers in the directories include this header.
This is the same with Linux or FreeBSD.
Note:
*1 About how to use these headers with compiler-independent-build of mcpp, refer 3.9.9.4 and its note 3.
/usr/include/sys/cdefs.h and its corresponding file of the same name in /Developer/SDKs/MacOSX*.sdk/ have a macro definition as:
#define __DARWIN_NO_LONG_LONG (defined(__STRICT_ANSI__) \ && (__STDC_VERSION__-0 < 199901L) \ && !defined(__GNUG__))
And it is used in stdlib.h and a few others as:
#if __DARWIN_NO_LONG_LONG
This macro should be defined as: *1
#if defined(__STRICT_ANSI__) \ && (__STDC_VERSION__-0 < 199901L) \ && !defined(__GNUG__) #define __DARWIN_NO_LONG_LONG 1 #endif
Note:
*1 As for its reason, see 3.9.4.6 and 3.9.8.6.
gssapi.h, krb5.h, profile.h in /System/Library/Frameworks/Kerberos.framework/Headers have queer #endif lines like:
#endif \* __KERBEROS5__ */
This \* __KERBEROS5__ */ seems to intend to be a comment. I cannot understand why they must invent such a writing. Though GCC usually warns at it, Apple-GCC does not issue any warning even if -pedantic or any other options are specified. Apple-GCC does not warn at the following case, too. It still trails a sense of pre-C90.
#endif __KERBEROS5__
As far as a compilation of firefox 3.0b3pre source is concerned, any of the following special usages of macro, which are frequently found in glibc sources and Linux system headers, is not found in the Mac OS X system headers included from the firefox source.
Apple-GCC has some peculiar specifications different from the general GCC.
Specs to generate binaries for both of Intel-Mac and PowerPC-Mac on either machine
Mac OS X has a pair of compilers for x86 and ppc. (One is a native compiler, and other is a cross compiler.) This pair of Apple-GCCs have their own option -arch. If you specify multiple CPUs as '-arch i386 -arch ppc', gcc will be repeatedly invoked, binaries for the specified CPUs will be generated, and a "universal binary" which bundles all the binaries will be created. Also they have another peculiar option -mmacosx-version-min=. You can use this option along with -isysroot or --sysroot option, and widen the range of compatibility of the binary to the older versions of Mac OS X to some extent. These specs are convenient to make a binary package for Mac OS X.
As for preprocessing, you should remember that some predefined macros differ depending on the CPU specified.
"framework" directories
Mac OS X has "framework" directories inherited from NeXTstep. Framework is a hierarchical directory that contains shared resources such as header files, library, documents, and some other resources. To include a header file in these directories, such a directive is used as:
#include <Kerberos/Kerberos.h>
This format is the same with:
#include <sys/stat.h>
However, these two have quite different meanings. While the latter includes the file sys/stat.h in some include directory (in this case /usr/include), <Kerberos/Kerberos.h> is not a path-list, and Kerberos is not even a directory name. This is a file Kerberos.framework/Headers/Kerberos.h in a framework directory /System/Library/Frameworks. And in actual, Kerberos.framework/Headers is a symbolic-link to Kerberos.framework/Versions/Current/Headers. This is the most simple case of framework header file location. There are many other far more complex cases.
Who has invented such a complex system? This system burdens a preprocessor, because it needs to search system headers in framework directories repeatedly building and rebuilding path-list. Some headers further include many other headers.
"header map" file
Xcode.app is an IDE on Max OS X. It uses "header map" file, which is a list of include files. One of the tools of Xcode checks source files, searches the files to be included, and records the path-list of the files into a file named *.hmap, and Apple-GCC refers to it instead of include directories. This is an extended feature of Apple-GCC.
Header map file is a device to lessen burdens of header file searching. However, it is a binary file of a peculiar specification and lacks transparency. In order to lessen heavy burdens of framework header searching, it is more desired to reorganize the framework system.
Tokens in #endif line
As shown in the previous section, Apple-GCC does not issue even a warning whatever junks are on a #endif line, regardless of whatever options specified. It is quite an anachronism.
Non-ASCII characters in comments
This is not a problem of GCC, but a problem of system headers in framework directory. In many headers, some non-ASCII characters are frequently found in comments, such as the copyright mark (0xA9) and others of ISO-8859-* (?). They are nuisances on an environment of multibyte characters, even if in comments. A little bit of character encoding consciousness is desired. Though the characters of this kind are sometimes found also in /usr/include of Linux, they are far more often found in framework headers of Mac OS.
I compiled source of firefox developing version 3.0b3pre (January, 2008), or 3.0-beta3-prerelease, by GCC replacing its preprocessor with mcpp V.2.7 on Linux/x86 + GCC 4.1.2 and Mac OS X + GCC 4.0.1. mcpp were executed with -Kv option passing its output to cc1 (cc1plus). As its results, the compilations successfully completed, and the firefox binaries were generated. *1
The preprocessing portability of firefox source on the whole is rather high. The dependencies on GCC-local specifications, such as frequently found in glibc sources, are not found so many. It is portable enough to officially support both of GCC on Linux, Mac OS X and Visual C++ on Windows.
Preprocessing portability of a source is, however, not necessarily sufficient, even if GCC and Visual C pass it. In the sections below, I will check some problems, sometimes comparing them with glibc sources. I omit explanations on GCC's problems here to avoid duplication. For GCC's problems, refer to 3.9.4, 3.9.8 , which also comment on glibc sources. *2
Note:
*1 I checked out the sources from CVS repository of mozilla.org. One of the motivations to compile firefox source was to test -K option of mcpp. This option was proposed by Taras Glek, and he was working on refactoring of C/C++ source at mozilla.com. So, I also used firefox source to test -K option and other behaviors of mcpp. About -K (-Kv) option, refer to 2.4.
*2 There is a list of coding-guidelines for firefox as below.
But, its content is too old.
portable-cpp
The following GCC-local-specs, which are sometimes used in glibc sources, are not used in firefox sources. Though compiling firefox on Linux includes system headers, and some of which contain such as GCC2-spec variadic macros, they are not firefox sources themselves.
The following features are not used even in recent glibc, and not used in firefox at all.
However, a lot of #include_next are found only in one directory: config/system_wrappers/, which is generated by configure. All of the 900 files generated in the directory are short header files of the same pattern. For example, stdio.h is:
#pragma GCC system_header #pragma GCC visibility push(default) #include_next <stdio.h> #pragma GCC visibility pop
This is a code to utilize '#pragma GCC visibility *' directive implemented in GCC 4.*. At the same time, there is a file 'config/gcc_hidden.h' as below. The file is specified by -include option for most of the translation units, and read in at the start of the units.
#pragma GCC visibility push(hidden)
system_wrappers directory should be the include directory with the highest priority, so you should specify it as a first argument of -I option. In spite of such a constraint, this usage of #include_next is simple and seems to has no problem.
On the other hand, for many sources in nsprpub directory, '-fvisibility=hidden' option is used instead of '-include gcc_hidden.h', and the headers in system_wrappers are not used. This nsprpub directory seems still to be reorganized.
Many sources use C99 specifications without specifying C99. GCC use "gnu89" spec by default on *.c source, which is a compromising spec of C90 plus some of C99 specs and GCC-local specs. Some of firefox sources use the following C99 specs implicitly, depending on GCC's default behavior.
Empty argument in macro call
Though empty argument in macro call is rare in firefox, these 3 files have it. The actual macro called with any empty argument is only one named NS_ENSURE_TRUE.
layout/style/nsHTMLStyleSheet.cpp, layout/generic/nsObjectFrame.cpp, intl/uconv/src/nsGREResProperties.cpp
Also the following files in gfx/cairo/cairo/src/ have it. The actual macro is only one: slim_hidden_ulp2.
cairoint.h, cairo-font-face.c, cairo-font-options.c, cairo-ft-font.c, cairo-ft-private.h, cairo-image-surface.c, cairo-matrix.c, cairo-matrix.c, cairo-pattern.c, cairo-scaled-font.c, cairo-surface.c, cairo-xlib-surface.c, cairo.c
Though these empty macro arguments are used on Linux, they are not used on Mac OS X. Anyway, these are not tricky ones.
Translation limits beyond C90
Length of an identifier, nesting level of #include, number of macro definitions and so forth often exceed C90 translation limits.
Identifiers longer than 31 bytes are found especially frequently in the directory gfx/cairo/cairo/src/.
Nesting of #include over than 8 level and macro definitions over than 1024 are often found, too.
These are almost inevitable on Linux and Mac OS X, since only inclusion of some system headers often reaches these limits.
Using // comment in C source
Some C sources have this type of comments. The list of guidelines prohibits this. However, this causes few problem nowadays.
Above specifications are also available on Visual C 2005, 2008. Since GCC has -std=c99 option, we might use this to specify C99 explicitly. Visual C, however, has no option to specify a version of Standard. We cannot help to use C99 specs implicitly. Therefore, firefox sources cannot be blamed for using C99 specs implicitly, in the current states of the major compiler-systems. *1
By the way, firefox sources do not use variadic macro for some reason, in spite of using some other C99 specs implicitly. Visual C up to 2003 did not implement variadic macro. Is that why firefox did not use the feature? The circumstances has changed since Visual C 2005 implemented it.
Note:
*1 On C++, GCC defaults to "gnu++98" spec, which is explained as "C++98 plus GCC extensions". It has in actual, however, some C99 specs mixed. Meanwhile, Visual C says that it is based on C90 and C++98 for C and C++ respectively. In actual, both of C and C++ of Visual C have C99 features mixed in it as well as a few Visual C extensions, especially in Visual C 2005 and 2008. Both of GCC and Visual C have such mixture of versions of Standard and their own extensions and modifications, thus bring about some ambiguities. The absence of option in Visual C to specify a version of Standard is the most inconvenient problem.
Object-like macro replaced with function-like macro name is found sometimes in many other programs, and is found also in firefox sources below, though not frequently.
content/base/src/nsTextFragment.h, modules/libimg/png/mozpngconf.h, modules/libjar/zipstub.h, modules/libpr0n/src/imgLoader.h, nsprpub/pr/include/obsolete/protypes.h, nsprpub/pr/include/private/primpl.h, nsprpub/pr/include/prtypes.h, parser/expat/lib/xmlparse.c, security/nss/lib/jar/jarver.c security/nss/lib/util/secport.h, xpcom/glue/nsISupportsImpl.hIn addition, building of firefox creates, in a directory for developing environment, many links to header files, which are copied into /usr/include/firefox-VERSION/ when you install developing environment for firefox. Some of these header files have symbolic links to the above files. mozilla-confic.h, that is created by configure, has a macro definition of this kind, too.
These macros should be written as function-like macro to improve readability. In actual, many other macros in firefox sources are defined as function-like macro replacing to another function-like macro with the same arguments. There are coding style differences among the authors. It would be better to set a coding guideline on this matter.
A Macro with 'defined' token in its replacement text, sometimes found in glibc, is found in firefox only once.
modules/oji/src/nsJVMConfigManagerUnix.cpp defines a macro as:
#define NS_COMPILER_GNUC3 defined(__GXX_ABI_VERSION) && \ (__GXX_ABI_VERSION >= 102) /* G++ V3 ABI */and uses it in itself as:
#if (NS_COMPILER_GNUC3)
This macro should be removed and the #if line should be rewritten as:
#if defined(__GXX_ABI_VERSION) && (__GXX_ABI_VERSION >= 102) /* G++ V3 ABI */
Maybe this file is to be compiled only by GCC, nevertheless it is not good practice to depend on preprocessor's wrong implementation.
Note:
*1 GCC-specific-build of mcpp V.2.7 enabled GCC-like handling of 'defined' in macro on #if line. But mcpp warns at it, and you would better to revise the code.
The following files in jpeg directory have #endif lines with comments without comment mark. All of the lines has appeared by some recent updates.
jmorecfg.h, jconfig.h, jdapimin.c, jdcolor.c, jdmaster.c
Though this style of writing was frequently seen in some sources for UNIX-like systems up until middle of 1990s, it has almost completely disappeared nowadays, and cannot be found even in that glibc sources. GCC usually warns at it as expected. For all that, these sources take such a writing style. Only Apple-GCC does not warn at it. Have these sources been edited on Mac OS?
The assembler sources are written as *.s (*.asm) files, and some of which contain macros, but in principle, they do not call for preprocessor.
On Mac OS X / ppc, however, there is only one exception. xpcom/reflect/xptcall/src/md/unix/xptcinvoke_asm_ppc_rhapsody.s calls for preprocessor, because it has a #if block containing only one line. The block seems to be unnecessary already.
Compilation of firefox begins with configure, which generates mozilla-config.h. In compilation of most of the sources, this header file is specified by -include option. config/gcc_hidden.h is also specified similarly. Why don't the sources #include these headers at their top?
Some silent redefinition of macros are found, though they are rare.
In compilation of most of the sources, -DZLIB_INTERNAL option is specified. In other words, the macro is defined as 1. It is, however, defined by some sources in modules/zlib/src/ as:
#define ZLIB_INTERNAL
It is defined to zero token. And it is used as:
# ifdef ZLIB_INTERNAL
Though the difference does not make different result in this case, different definitions of the same macro is not recommended. Maybe the option by Makefile is redundant.
xpcom/build/nsXPCOMPrivate.h defines a macro MAXPATHLEN differently from /usr/include/sys/param.h. This discrepancy stems from an inconsistency among the related header files about whether include /usr/include/sys/param.h or not. The related header files should be reorganized.
On Mac OS X, assert macro, once defined in /usr/include/assert.h, is redefined in netwerk/dns/src/nsIDNKitInterface.h. '#undef assert' should precede it.
On Mac OS X, in modules/libreg/src/VerReg.c, queer redefinition of macro VR_FILE_SEP occurs as:
#if defined(XP_MAC) || defined(XP_MACOSX) #define VR_FILE_SEP ':' #endif #ifdef XP_UNIX #define VR_FILE_SEP '/' #endif
, because on Mac OS X, configure defines both of XP_MACOSX and XP_UNIX. This redefinition may be an intended one. Anyway, it is misleading. It would be better to write as below, clearly showing the priority of XP_UNIX.
#ifdef XP_UNIX #define VR_FILE_SEP '/' #elif defined(XP_MAC) || defined(XP_MACOSX) #define VR_FILE_SEP ':' #endif
The following files have too long comments crossing over several hundred lines or more.
extensions/universalchardet/src/base/Big5Freq.tab, extensions/universalchardet/src/base/EUCKRFreq.tab,intl/unicharutil/src/ignorables_abjadpoints.x-ccmap, layout/generic/punct_marks.ccmap
Especially, in the directories intl/uconv/ucv*/, there are many files with too long comments. There is even a case of single comment crossing over 8000 lines! All of these files have name of *.uf or *.ut, and are mapping tables between Unicode and each Asian encodings, generated automatically by some tool. They do not seem to be source of C/C++, but they are included from other C++ sources. Most part of these files are comments, which seem to be a sort of document or table for some other tool.
It is not recommendable to include long documents or tables in source files. They should be separated from source files, even if placed in source tree.
Though these files are used in Linux, they are not used in Mac OS X. On the other hand, on Mac OS X, system headers in framework directories are frequently used, and some of them are queer files mostly occupied with comments.
The encoding of newline in firefox source is [LF]. A few files, however, have a small block of lines ending with [CR][LF]. All of these [CR][LF] lines seem to be fragments inserted as patches. Some conversion tools should be used when one edit source file on Windows.
I used mcpp to preprocess some sample programs provided by Visual C++ 2003, 2005 and 2008. The system headers seem to have only a few compatibility problems shown below. These problems are often seen in other compile systems and do not have a serious impact on preprocessing.
Although the Linux system-headers and glibc sources often contain GCC local specification based coding, Visual C++ system headers has only a few Visual C++ local coding.
I found only one outrageous macro in Visual C++. Vc7/PlatformSDK/Include/WTypes.h has the following macro definition: *1
#define _VARIANT_BOOL /##/
This macro definition is used in oaidl.h and propidl.h in Vc7/PlatformSDK/Include/ as follows:
_VARIANT_BOOL bool;
What does this macro aim at?
This macro seems to expect _VARIANT_BOOL to be expanded into // and the line to be commented out. Actually, this expectation is met in Visual C cl.exe !
In the first place, // is not a token (preprocessing-token). Macro definitions should be processed and expanded after source are parsed into tokens and a comment is converted into one space. Therefore, it is irrational for a macro to generate comments. When this macro is expanded into //, the result is undefined because // is not a valid preprocessing-token.
In order to use these header files with mcpp, comment out these macro definitions and change many _VARIANT_BOOL occurrences as follows:
#if !__STDC__ && (_MSC_VER <= 1000) _VARIANT_BOOL bool; #endif
If you use only Visual C 5.0 or later, this line can be simply commented out as follows:
// _VARIANT_BOOL bool;
This macro is, indeed, out of question, however, it is Visual C/cl.exe, which allows such an outrageous macro to be preprocessed as a comment, should be blamed. This example reveals the following serious problems this preprocessor has:
Probably, the cl.exe preprocessor was developed based on a very old somewhat character-based preprocessor. It is easy to presume that the preprocessor has been upgraded by repeating partial revision to the old preprocessor.
There are many preprocessors which presumably have a very old program structure. GCC 2/cpp, shown in 3.9, is one of such preprocessors. Repeated partial revision of such a preprocessor will only makes its program structure more complicated. However much such revision may be made, there are limits to quality such preprocessor can achieve. Unless a old source is given up and completely rewritten, a clear and well-structured preprocessor cannot be obtained.
At GCC 3/cpp0, a total revision was made to GCC 2; the entire source code was rewritten. So, GCC 3/cpp0 has become quite different from GCC 2. Although mcpp was initially developed based on the source of an old preprocessor, DECUS cpp, the source code was totally rewritten soon.
Note:
*1 Visual C++ 2005 Express Edition does not contain Platform SDK. However, you can download "Platform SDK for Windows 2003", and use it with VC2005. wtypes.h, oaidl.h, propidl.h in this PlatformSDK/Include directory also have the same macro definition and its usage as VC2003 Platform SDK.
Also on Visual C++ 2008, in the header files of the same name in 'Microsoft SDKs/Windows/v6.0A/Include' directory, that macro definition and its usage are quite the same.
Another problem is use of '$' in identifiers. Its use in macro names suddenly increased in the system headers on Visual C++ 2008. Though such macros were also found on Visual C++ 2005, they were rare. But, on Visual C++ 2008, they are found here and there.
'Microsoft Visual Studio 9.0/VC/include/sal.h' is the most conspicuous one. This header defines macros for so-called SAL (standard source code annotation language) of Microsoft, and has many names containing '$'. This file is included from many standard headers via 'Microsoft Visual Studio 9.0/VC/include/crtdefs.h', so most sources are compiled with these macros without knowing it.
If you specify -Za option to invoke the compiler cl, the SAL is disabled and all of the names with '$' in the sal.h disappear. The necessity of this notation is, however, hard to understand. Though GCC also enables '$' in identifiers by default, its actual use is rarely found nowadays.
This kind of names are also found in the system headers named specstrings*.h in 'Microsoft SDKs/Windows/v6.0A/Include' directory. They are included from Windows.h via WinDef.h, and the names with '$' do not disappear even -Za option is specified. The option causes only errors. So, you cannot use the -Za option to compile a source which includes Windows.h.
This chapter does not contain all the C preprocessor specifications. For details on Standard C preprocessing, refer to cpp-test.html. For mcpp behaviors in each mode, refer to 2.1. This chapter covers several preprocessor-related specifications, including those called implementation-defined by Standards. For more details on mcpp implementation-defined-behaviors, see Chapter 5, "Diagnostic Messages".
The header file internal.H defines values returned by mcpp to a parent process. mcpp returns 0 on success, and errno for errno != 0 and 1 for errno == 0 on error. Success means that no error has occurred.
This section explains the order in which mcpp searches directories for an include file when it encounters a #include directive.
With the -I- option (-nostdinc option for GCC-specific-build and -X for Visual C-specific-build), the directories specified in 4.4 and later are not searched.
ANSI C Rationale says the ANSI committee intends to define a current directory as base directory. I think this is acceptable, in that the base directory is always constant and that the specification is clearer. However, some implementations, such as UNIX, seem to define a source file directory as base one at least for #include "header". The compiler-independent-build of mcpp also takes source file directory as base, according to the majority.
This section explains how to construct a header-name pp-token and extract a file name from it.
Evaluation of #if expression depends on the largest integer type of the host compiler (by which mcpp was compiled) and that of the target compiler (which uses mcpp). Since the compiler-independent-build has no target compiler, the type depends only on the host compiler.
mcpp in Standard mode evaluates #if expression in the common largest integer type of the host and target compiler. Nevertheless, mcpp in pre-Standard mode evaluates it in (signed) long.
In the compiler-systems having type "long long", if __STDC_VERSION__ is set to 199901L or higher using the -V199901L option, mcpp evaluates a #if expression in "long long" or "unsigned long long", according to the C99 specification. Although C90 and C++98 stipulate that #if expression is evaluated in long / unsigned long, mcpp evaluate it in long long / unsigned long long even in C90 or C++98 mode, and issues a warning in case of the value overflows the range of long / unsigned long. *1
Visual C and Borland C 5.5 do not have a "long long" type, but have an __int64 type of the same length. So, a #if expression is evaluated as __int64 / unsigned __int64. (However, since LL and ULL suffixes cannot be used in Visual C++ 2002 or earlier and Borland C 5.5, these suffixes must not be used in coding other than #if lines.)
In addition, when you invoke with the -+ option for C++ preprocessing, mcpp evaluates pp-tokens 'true' and 'false' in a #if expression to 1LL (or 1L) and 0LL (or 0L), respectively.
mcpp in Standard mode evaluates #if expression as follows. For a compiler without long long, please read "long long" and "unsigned long long" hereinafter, until the end of 4.5, as "long" and "unsigned long", respectively. For pre-Standard mode read all of them as "long".
Anyway, an integer constant token always has a non-negative value.
In pre-Standard mode, an integer constant token is evaluated within the range of non-negative long. A token beyond that range is diagnosed as an out of range error. All the operations are performed within the range of long.
If both of host and target compilers have type unsigned long long and the range of unsigned long long of the host is narrower than that of the target, a beyond that host range is evaluated to an out of range error.
If an operation using constant tokens produces a result out of range of long long, an out of range error occurs. If it produces a result out of range of unsigned long long, a warning is issued. This can be applied to intermediate operation results.
Since a bitwise right shift of a negative value or a division operation using it does not provide portability, mcpp issues a warning. If an operation using a mixture of unsigned and signed operands converts a signed negative value to an unsigned positive value, a warning is also issued. How these values are evaluated depends on the specification of the compiler-proper of the host system.
C90 and C++98 makes it a rule that a preprocessor evaluates a #if expression in long/unsigned long (in C99, the maximum integer type is used). These specifications are rougher than those of compiler-propers. A (#)if expression is often evaluated differently between preprocessor and compiler-proper, especially when sign extension is involved.
In addition, since keywords are not used during Standard C preprocessing, sizeof or cast cannot be used in a #if expression. Of course, neither variables, enumeration constants, nor floating point numbers can be used there. Standard mode allows the "defined" operator in a #if expression as well as the #elif directive. Except for these differences, mcpp evaluates a #if expression in accordance with priority of and the associative law among operators, just as compiler-propers do. In a binary operation, an arithmetic conversion often takes place to equalize the types on both-hand sides; If one operand is unsigned long long and the other is long long, the both are converted to unsigned long long.
Note:
*1 mcpp up to V.2.5 evaluated #if expression in C90 and C++98 by long long / unsigned long long internally, and issued an error on overflow of long / unsigned long. From V.2.6 onward, mcpp degraded the error to warning for compatibility with GCC or Visual C.
Constant tokens in a #if expression includes identifiers (macros and non-macros), integer tokens and character constants. How to evaluate character constants is implementation-defined and lacks of portability. Even (#)if 'const' is sometimes evaluated differently between preprocessor and compiler-proper. Note that Standards does not even guarantee that (#)if 'const' is evaluated to the same.
mcpp in POSTSTD mode does not evaluate a character constant in a #if expression, which is almost meaningless, and makes it an error.
Like other integer constant tokens, mcpp evaluates a character constant in a #if expression within the range of long long or unsigned long long. (In pre-Standard mode, long only.)
A multi-byte character or a wide character is generally evaluated with 2-bytes type, except for the UTF-8 encoding, which is evaluated with 4-bytes type. Since UTF-8 has a variable length, mcpp evaluates it with 4-byte type. mcpp does not support EUC's 3 byte encoding scheme. (A 3-byte character is recognized as 1 byte + 2 bytes. As a consequence, its value is evaluated correctly.) Although there are some implementations using the 2-byte encoding scheme that define wchar_t as 4-byte, mcpp has no relevance to wchar_t. The following paragraphs describe two-byte multi-byte character encodings.
Multi-byte character constants, such as 'X', are evaluated to ((First byte value << 8) + Second byte value). (8 is the value of CHAR_BIT in <limits.h>.) Note that 'X' is used here to designate a multi-byte character. Though 'X' itself is not a multi-byte character, it is used here to avoid character garbling.
Let me take an example of multi-character character constants, such as 'ab', '\x12\x3', and '\x123\x45'. 'a', 'b', '\x12', '\x3' and '\x123' are regarded as one byte. When a multi-character character constant is evaluated, each one byte, starting from the highest one, is evaluated within the range of [0, 0xFF] and combined by shifting it to left by 8. (0xFF is the value of UCHAR_MAX in <limits.h>.) If the value of one escape sequence exceeds 0xFF, an out of range error occurs. Therefore, in the implementation of the ASCII character set, the above three tokens are evaluated to 0x6162, 0x1203 and error, respectively.
L'X' is evaluated to the same value as 'X'. Let me take an example of multi-character wide character constants, such as L'ab', L'\x12\x3', and L'\x123\x45'. L'a', L'b', L'\x12', L'\x3', L'\x123', and L'\x45' are regarded as one wide character. When a multi-character wide character constant is evaluated, each wide character, starting from the highest one, is evaluated within the range of [0, 0xFFFF] and combined by shifting it to left by 16. If the value of one escape sequence exceeds the maximum value of an unsigned 2-byte integer, an out of range error occurs. Therefore, in the implementation of the ASCII character set, the above three tokens are evaluated to 0x00610062, 0x00120003, and 0x01230045, respectively.
If the values of a multi-character character constant and a multi-character wide character constant exceed the range of unsigned long long, an out of range error occurs.
With __STDC_VERSION__ or __cplusplus set to 199901L or higher, mcpp evaluates a Universal Character Name (UCN) in the form of \uxxxx and \Uxxxxxxxx as a hex escape sequence. (I know this evaluation is nonsense but no other way.)
If the compiler-proper of the target compiler system uses a signed char or signed wchar_t, a character constant in a (#)if expression may be evaluated differently between mcpp and compiler-proper. The range that causes a range error may also differ between them. In addition, evaluation of multi-character character constants and multi-byte character constants varies even among preprocessors and among compilers. Standard C does not define whether, with CHAR_BIT set to 8, 'ab' is evaluated to 'a' * 256 +'b' or 'a' + 'b' * 256.
In general, character constants should not be used in an #if expression, as long as you have an alternative method. I think an alternative method always exists.
Standard C stipulates that preprocessing is a process independent of run-time environments or compiler-proper specifications, thus prohibiting it from using sizeof and cast in an #if expression. However, pre-Standard mode allows sizeof (type) in a #if expression. This was done as a part of my effort to add necessary modifications to DECUS cpp, such as adding long long and long double processing, while retaining its original functionality. As to cast, I neither implemented nor had a will to do so because it would require troublesome work.
A series of macros beginning with S_, such as S_CHAR, in eval.c define the size of each type. Under cross implementation, these macros must be modified to specify size of the types, in integer values, used in the target system.
I have to admit that mcpp does not provide the full functionality of #if sizeof. mcpp just ignores the letter of "signed" or "unsigned" preceding char, short, int, long, and long long when it appears in a #if sizeof. Also mcpp does not support sizeof (void *). I know this is a half-hearted implementation but I do not want to increase the number of flags in system.H in vain for this non-conforming function. I initially thought of removing the sizeof code from the original version because I did not intend to support cast at all, but on the second thought, I decided to make a small amount of modifications to make use of the existing code.
mcpp in principle compresses a white-space sequence, excluding <newline>, as a token separator into one space character during tokenization in the translation phase 3. If -k or -K option is specified in STD mode, however, it outputs horizontal white spaces as they are without compressing. It also deletes a white-space sequence at the end of a line.
A white-space sequence at the beginning of a line is deleted in POSTSTD mode, and putout as they are in other modes. The latter is special treatment for convenience of human reading. *1
This compression and deletion occurs during the intermediate phase. The next phase 4 involves macro expansion and preprocess-directive-line processing. Macro expansion may sometimes produce several space characters before and after the macro. Of course, the number of space characters does not affect compilation results.
Standard C says that whether implementation compresses a white-space sequence into one space character during the translation phase 3 is implementation-defined, but you usually do not have to worry about this. <Vertical-tab> or <form-feed> in a preprocessor directive line may adversely affect portability, since this is undefined in Standard C. mcpp converts it to one space character.
Note:
*1 Up to V.2.6.3 mcpp squeezed line top white spaces into one space. In V.2.6.4, it changed the behavior.
This section describes the specifications of mcpp executables generated when DIFfile and makefile for each compiler system in the noconfig directory are used to compile mcpp with default settings. When a configure script is used to compile mcpp, the generated mcpp may differ, depending on configure's results, however, as long as OS and compiler system versions are same, generated mcpps would be same except for include directories.
The compiler-independent-build of mcpp has the constant specifications regardless of the compiler system with which mcpp was compiled, except a few features dependent on OS and CPU.
There are compiler-independent-build and compiler-specific-build for mcpp executables, and each executable has several behavioral modes. For those, refer to 2.1. This section describes the settings centering on STD mode.
DIFfiles and makefiles are for the following compiler systems:
FreeBSD 6.3 GCC V.3.4 Vine Linux 4.2 / x86 GCC V.2.95, V.3.2, V.3.3, V.3.4, V.4.1 Debian GNU/Linux 4.0 / x86 GCC V.4.1 Ubuntu Linux 8.04 / x86_64 GCC V.4.2 Fedora Linux 9 / x86 GCC V.4.3 Mac OS X Leopard / x86 GCC V.4.0 CygWIN 1.3.10 (GCC V.2.95), 1.5.18 (GCC 3.4) MinGW & MSYS GCC 3.4 WIN32 LCC-Win32 2003-08, 2006-03 WIN32 Visual C++ 2003, 2005, 2008 WIN32 Borland C++ V.5.5
In addition, for the following compilers which I don't have, the difference files contributed from some users are contained here.
WIN32 Visual C++ V.6.0, 2002 WIN32 Borland C++ V.5.9 (C++Builder 2007)
Of all the macros defined in noconfig.H and system.H, the settings of those mentioned below are identical among every mcpp executable, regardless of their compiler systems.
Each mcpp is compiled with DIGRAPHS_INIT == FALSE, so enables digraph when the -2 (-digraphs) option is specified.
With TRIGRAPHS_INIT == FALSE, trigraph is enabled with the -3 (-trigraphs) option.
With OK_UCN set to TRUE, Universal Character Name (UCN) can be used in C99 and C++.
With OK_MBIDENT set to FALSE, multi-byte-characters cannot be used in identifiers.
With STDC set to 1, the initial value of __STDC__ is 1.
The translation limits are set as follows.
NMACPARS (Maximum number of macro arguments) 255 NEXP (Maximum number of nested levels of #if expressions) 256 BLK_NEST (Maximum number of nested levels of #if section) 256 RESCAN_LIMIT (Maximum number of nested levels of macro rescans) 64 IDMAX (Valid length of identifier) 1024 INCLUDE_NEST (Maximum number of #include nest level) 256 NBUFF (Maximum length of a source line) *1 65536 NWORK (Maximum length of an output line) 65536 NMACWORK (Size of internal buffers used for macro expansion) 262144
On GCC-specific-build and Visual C-specific-build, however, NMACWORK is used as the maximum length of an output line.
This macro differs on OS regardless of build types.
MBCHAR (Default encoding of multibyte character):
Linux, FreeBSD, Mac OS X EUC-JP WIN32, CygWIN, MinGW SJIS
The settings of the macros below are different among compiler systems.
STDC_VERSION (Initial value of __STDC_VERSION__):
Compiler-independent, GCC 2 199409L Others 0L
HAVE_DIGRAPHS (Is digraphs output as it is?):
Compiler-independent, GCC, Visual C TRUE Others FALSE
EXPAND_PRAGMA (Is a #pragma line macro-expanded in C99?):
Visual C, Borland C TRUE Others FALSE
GCC 2.7-2.95 defines __STDC_VERSION__ to 199409L. However, in GCC V.3.*,V.4.*, __STDC_VERSION__ is no longer predefined by default and is now defined in accordance with an execution option. mcpp setting for GCC follows these variations.
If STDC_VERSION is set to 0L, mcpp predefines __STDC_VERSION__ as 0L. So, specifying the -V199409L option sets __STDC__ and __STDC_VERSION__ to 1 and 199409L, respectively and allows only predefined macros that begin with '_', resulting in mcpp in the strictly C95 conforming mode. The -V199901L option specifies C99 mode.
In C99 mode, mcpp predefines __STDC_HOSTED__ as 1.
mcpp itself predefines neither __STDC_ISO_10646__, __STDC_IEC_559__ nor __STDC_IEC_559_COMPLEX__. These values are compiler-system-specific. In glibc 2 / x86, the system header defines __STDC_IEC_559__ and __STDC_IEC_559_COMPLEX__ as 1. Other compiler systems do not define them.
If HAVE_DIGRAPHS is set to FALSE, digraph is output after converting to usual token.
The argument of #pragma line beginning with STDC, MCPP or GCC is never macro-expanded even if EXPAND_PRAGMA == TRUE.
Include directories are set as follows:
System-specific or site-specific directories under UNIX-like OSs are as follows (common to compiler-independent-build and compiler-specific-build):
FreeBSD, Linux, Mac OS X, CygWIN /usr/include, /usr/local/include
Mac OS X has also the framework directories set to /System/Library/Frameworks and /Library/Frameworks by default.
On MinGW, /mingw/include is the default include directory.
CygWIN GCC-specific-build changes /usr/include to /usr/include/mingw by -mno-cygwin option.
For the implementation-specific directories that vary among compiler systems and their versions, see the DIFfiles. The compiler-independent-build does not set implementation-specific directories. mcpp for the compiler systems on Windows does not preset any directory but uses the environment variables: INCLUDE, CPLUS_INCLUDE. These environment variables are used by the compiler-independent-build too.
If these default settings do not suit you, change settings to recompile mcpp, or use environment variables or the -I option.
When the length of a preprocessed line exceeds NWORK-1, mcpp generally divides it into several lines so that each line length becomes equal to or less than NWORK-1. A string literal length must be equal to or less than NWORK-2. mcpp of GCC-specific-build and Visual C-specific-build, however, do not divide output line.
Again for confirmation, the macros mentioned above in italics are used only to compile mcpp, and are not built-in macros in a mcpp executable.
If you invoke mcpp without an input file and enter '#pragma MCPP put_defines', the built-in macros will be displayed.
With __STDC__ set to 1 or higher, the macros that do not begin with '_' are deleted. The -N (-undef) option deletes all the macros other than __MCPP. After -N, you can use -D to defines macro symbols over again. When you use a different compiler system version from those specified here, -N and -D allow you to redefine your version macro without recompiling mcpp. The -D option allows you to redefine a particular macro without using -N or -U.
When you use the -+ (-lang-c++) option to specify C++ preprocessing, __cplusplus is predefined with its initial value of 1L. In addition, some other macros are also predefined:
Although there are some predefined macros in GCC, those predefined by GCC were few, until GCC V.3.2. Most of them are passed from gcc to cpp by the -D option. So, it is not necessary for mcpp to define them for compatibility. However, mcpp predefines these macros for being used in a stand alone manner, such as pre-preprocessing.
GCC V.3.3 and later predefines 60 or 70 macros (suddenly). GCC-specific-build of mcpp V.2.5 and later for GCC V.3.3 or later also includes these predefined macros other than the above ones. These GCC-specific predefined macros are written in mcpp_g*.h header files, which is generated by installation of mcpp.
Since FreeBSD, Linux, CygWIN, MinGW / GCC and LCC-Win32, Visual C 2008 have a type long long, an #if expression is evaluated in long long or unsigned long long. Visual C 6.0, 2002, 2003, 2005 and Borland C 5.5 do not have a "long long" type but __int64 and unsigned __int64 instead. These types are used.
In the above compiler systems with type long ranges:
[-2147483647-1, 2147483647] ([-0x7fffffff-1, 0x7fffffff])
and unsigned long ranges:
[0, 4294967295] ([0, 0xffffffff]).
In the compiler systems with type long long ranges:
[-9223372036854775807-1, 9223372036854775807] ([-0x7fffffffffffffff-1, 0x7fffffffffffffff]),
and unsigned long long ranges:
[0, 18446744073709551615] ([0, 0xffffffffffffffff]).
All the compiler-propers of the above compiler systems internally represent a signed integer as two's complement number. So do bit operations. This can be applied to mcpp's #if expression.
Right shift of a negative integer is an arithmetic shift. This can be applied to mcpp's #if expression. (Right shifting an integer by one bit halves the value with the sign retained)
In an integer division or modulus operation, if either or both operands are negative values, an algebraic operation like Standard C's ldiv() function is performed. This can be applied to mcpp's #if expression.
These OSs use the ASCII basic character set. So does mcpp.
There is a memory management routine, kmmalloc, that I developed. This routine has malloc(), free(), realloc() and other memory handling functions. If kmmalloc is installed in systems other than CygWIN or Visual C 2005 or 2008, kmmalloc is linked when the MALLOC=KMMALLOC (or -DKMMALLOC=1) option is specified in make. Also its heap memory debugging routine is linked. mcpp for Linux and LCC-Win32 uses EFREEP, EFREEBLK, EALLOCBLK, EFREEWRT and ETRAILWRT with an errno of 2120, 2121,2122, 2123 and 2124 assigned, and other mcpp uses 120, 121, 122, 123, and 124. (Refer to mcpp-porting.html#4.extra.) *2
On the systems other than GNU and Visual C, you should preset the environment variable TZ, for example JST-9 in Japan. Or, the __DATE__ and __TIME__ macros are not set correctly.
Note:
*1 This limit applies also to the line spliced by <backslash><newline> deletion. Moreover, it applies to the line after converting a comment into a space and possibly concatenated multiple logical lines by a comment spreading across the lines.
*2 CygWIN 1.3.10 and 1.5.18 provides malloc() that has an internal routine named _malloc_r() which is called by a few other library functions. So this malloc() cannot be replaced with other malloc(). Also in Visual C 2005 and 2008, the program terminating routine calls an internal routine of resident malloc(), hence other malloc() cannot be used.
This section covers diagnostic messages issued by mcpp, as well as their meaning. By default, these messages are output to stderr. With the -Q option, they are redirected to the mcpp.err file in the current directory. A diagnostic message is output in the following manner:
If the -j option is specified, mcpp outputs neither the above 2 nor 3.
Diagnostic messages are divided into three levels:
fatal error Indicates an error is so serious that it is no longer meaningful to continue preprocessing. error Indicates there is a syntax or usage error. warning Indicates code lacks of portability or may contain a bug.
Warnings are further divided into five classes:
Class 1 Source code may contain a bug or at least lack portability. Class 2 Probably, source code will present no problem in practical use, but is problematic in terms of Standard conformance. Class 4 Probably, source code will present no problem in practical use, but is problematic in terms of portability. Class 8 Rather surplus warnings to #if groups skipped, sub-expression in #if expression whose evaluation is skipped, and etc. Class 16 Warning to trigraphs and digraphs.
Warnings other than Class 1 or 2 are rather specific to mcpp.
mcpp has various types of diagnostic messages. For example, STD mode provides the following types of diagnostics for each level and class.
fatal error 17 types error 76 types warning class 1 49 types warning class 2 15 types warning class 4 17 types warning class 8 30 types warning class 16 2 types
Principally, these messages point the coding in question. The diagnostic messages below have a sample value embedded in a token or a numeric value from source code. For the messages with a macro name embedded, a value the macro is expanded into is shown in real messages.
Depending on cases, a same message is issued as warning or error, in which case, this manual gives the first occurrence a detailed description. For the subsequent occurrences, the message is only listed.
Of all the errors shown below, some errors, such as a buffer overflow, occur due to mcpp specification restrictions. Some macros in system.H define translation limits, such as a buffer size. Enlarge the buffer size and recompile mcpp if necessary, however, be careful not to increase it too much. A large buffer in a system with a small amount of memory may cause an "out of memory" error frequently.
A fatal error occurs and preprocessing is terminated when it is no longer possible to continue preprocessing due to an I/O error or a shortage of memory, or it is no longer meaningful to do so due to a buffer overflow. A status value of failure is returned to a parent process.
The following four errors may also be caused by a buffer overflow at a token that is not so particularly long during macro expansion, in which case, you should divide the macro invocation.
mcpp issues an error message when it found a grammatical error. Standard C stipulates that a compiler system should issue a diagnostic message when they encounter a violation of syntax rules or constraints. Principally, Standard mode issues an error message to this type of violation, but sometimes issues a warning.
mcpp issues an error message or warning to most of undefined items in Standard C. However, mcpp issues neither an error nor a warning to the following undefined items:
For details on what is a violation of syntax rule or constraint, undefined, unspecified or implementation-defined in Standard C preprocessing, refer to cpp-test.html.
Even if an error occurs, mcpp continues preprocessing as long as they are not fatal one. mcpp shows the number of errors and returns the status of failure to the parent process when it exits.
The following several messages are all token-related errors. For the first four, mcpp skips the line in question and continues preprocessing. The first three are string literal or other token-related errors, indicating that a closing quotation mark is not found by the end of the logical line. This type of error occurs when you write a text that does not take a form of a preprocessing-token sequence in neither a string literal nor comment, as shown below:
#error I can't understand.
As processing-tokens are not so strictly defined as C tokens in the compiler-proper, most character sequences are regarded as pp-token sequences, as long as they belong to a source character set. Therefore, it is only this type of coding that causes a preprocessing-token error. Pp-token errors may occur in a skipped #if group.
This section covers messages issued when a source file ends with an unterminated #if section or macro invocation. If the file (not included file) marks the end of input, the message "End of input", not "End of file", is issued.
These diagnostic messages are issued as an error or warning, depending on mcpp modes.
Standard mode issues these messages as error, in which case mcpp skips the macro invocation in question and restores relationship between paired directives in a #if section to that of when the file is initially included.
On the other hand, pre-Standard mode issues warnings. OLDPREP mode does not even issue warning except on unterminated macro call.
This section covers errors caused by ill balanced directives of #if, #else and etc. Even if mcpp finds ill balance among these directives, it continues processing, assuming that the processing group so far still continues. mcpp checks to see whether directives are balanced even in a skipped #if group.
The #if (#ifdef) section is a block between #if (#ifdef or #ifndef) and #endif. The #if (#elif, #else) group is a smaller block, say, between #if (#ifdef or #ifndef) and #elif, between #elif and #else, or between #else and #endif within the #if (#ifdef) section.
The following two errors occur when #asm and #endasm are not balanced. These messages are issued only by compiler-specific-build for a particular compiler system and in pre-Standard mode.
This section covers simple syntax errors on directive lines that begin with #. The errors hereinafter discussed until 5.4.12 do not occur within a skipped #if group. (mcpp invoked with the -W8 option issues a warning to an unknown directive.)
When mcpp finds a directive line with a syntax error, it ignores the line and continues processing, in which case, it neither regards #if as the beginning of a section nor changes line numbers even with a #line. If a #include or #line line has a macro argument, Standard mode expands the macro and checks the syntax. Pre-Standard mode does not expand the macro.
Although the messages below do not show the directive name in question, the source line that follows the message show it. (A directive line with a comment converted to a space character always becomes one line, which is called "preprocessed line" here.)
The following error occurs only in Standard mode and this directive is ignored. OLDPREP mode issues neither an error nor a warning. KR mode issues a warning and continues preprocessing as if there had been no "junk" text.
This section covers syntax-related errors in #if, #elif and #assert directives. If a #if (#elif) line has these errors, mcpp evaluates it to false, skips the #if (#elif) group, and continues processing.
For a skipped #if (#ifdef, #ifndef, #elif or #else) group, mcpp checks validity of C preprocessing tokens and balance of these directives, but not other grammatical errors.
A #if line has a sub-expression whose evaluation is skipped. For example, in case of #if a || b, if "a" is evaluated to true, "b" is not evaluated at all. However, the following 14 types of syntax errors or translation limit errors are checked, even if they are located in a sub-expression whose evaluation is skipped.
The following error messages are relevant to #if sizeof. Only pre-Standard mode issues this error.
The following errors do not occur in a sub-expression whose evaluation is skipped. (mcpp invoked with the -W8 option issues a warning.)
The Standards say that #if expression is evaluated by the largest integer type in C99 and long / unsigned long in C90 and in C++98. mcpp evaluate it by long long / unsigned long long even if in C90 or C++98, and issues a warning on the value outside of long / unsigned long in C90 and in C++98. In this subsection, please read the following long long / unsigned long long as long / unsigned long for the compiler without long long, and as long in pre-Standard mode. In POSTSTD mode, character constant in #if expression is not available and causes a different error.
The following errors are relevant to sizeof. They are not issued in a sub-expression whose evaluation is skipped (The -W8 option issues a warning). Only in pre-Standard mode.
This section covers #define related errors. A macro will not be defined if an error occurs at #define. The # and ## operator related errors occurs in Standard mode. __VA_ARGS__ related errors also occur in Standard mode. Although variable argument macro is a C99 specification, mcpp allows these macros to be used in C90 and C++ modes for compatibility with GCC and Visual C++ 2005, 2008. (A warning is issued.)
In STD mode of GCC-specific-build, if you write GCC2-spec variadic using __VA_ARGS__, you will get this error. __VA_ARGS__ should be used only in GCC3-spec or C99-spec variadic.
This section covers #undef related errors.
This section covers macro expansion errors. mcpp displays a macro definition, as well as the source filename and line number where it is found. The errors related to # or ## operator can occur only in Standard mode.
When the following errors occur, the macro invocation will be skipped.
The following errors can occur only with -K option in STD mode. These errors mean that the macro is extremely complex and buffer for macro notification ran short. Almost impossible to happen in real life.
The following two are checked when mcpp is invoked with the -V199901L option. The same thing can be said when mcpp is invoked with the -V199901L option in C++ mode.
A warning is issued when source, although syntactically correct, possibly contains some coding mistakes or has a portability problem. Warnings are divided into five classes: 1, 2, 4, 8, and 16. These classes are enabled when the -W <n> option is specified on mcpp invocation. <n> specifies a ORed value of any of 1, 2, 4, 8, and 16. Class 4, for example, can be specified explicitly with -W4, and implicitly with -W<n>, where <n> is 1|4, 1|2|4, 2|4, 1|4|8, 4|8, 4|16, etc., because the AND-ed value of <n> and 4 is 4 (non-0).
Standard mode issues an error message to most of the source code that causes a Standard C undefined behavior, but a warning to some.
Likewise, Standard mode always issues a warning to the source code which uses Standard C unspecified specifications, except for the following:
Standard mode issues a warning to many implementation-defined behaviors, except for the following:
As you see, mcpp can perform almost all the portability checks necessary at a preprocessing level.
POSTSTD mode is identical with STD mode except for some specification differences described in section 2.1.
Regardless of the number of warnings, mcpp always returns a status of success. mcpp invoked with the -W0 option does not issue a warning.
Beside character codes, ISO-2022-JP has a shift sequence. Apart from the shift sequence, all the multi-byte characters other than UTF-8 are two bytes length.
Encoding first byte second byte shift-JIS 0x81-0x9f, 0xe0-0xfc 0x40-0x7e, 0x80-0xfc EUC-JP 0x8e, 0xa1-0xfe 0xa1-0xfe KS C 5601 0xa1-0xfe 0xa1-0xfe GB 2312-80 0xa1-0xfe 0xa1-0xfe Big Five 0xa1-0xfe 0x40-0x7e, 0xa1-0xfe ISO-2022-JP 0x21-0x7e 0x21-0x7e
On unterminated line or comments, the following messages are issued. OLDPREP mode does not issue warning.
The following warning messages are issued in pre-Standard mode. Pre-Standard mode ignores these warnings to continue processing until it reaches the end of input, causing many unexpected results. Standard mode issues an error. OLDPREP mode does not issue even warning, except on unterminated macro.
The following message is issued only in Standard mode.
The following message is issued only in STD mode.
#define THIS$AND$THAT(a, b) ((a) + (b))mcpp interprets this as follows:
#define THIS $AND$THAT(a, b) ((a) + (b))and issues a warning. Of course, this is a quite rare case.
The following warnings are issued only in lang-asm mode.
The following warnings on #pragma line are issued only in Standard mode. The lines are outputted in principle in spite of the warnings. However, the lines to be processed by preprocessor such as most of the lines starting with #pragma MCPP or #pragma GCC are not outputted. The pragmas for compiler or linker such as #pragma GCC visibility * are outputted without warning.
The GCC-specific-build issues the following warnings:
GCC-specific-build issues a Class 2 warning to a line with #pragma GCC followed by either poison or dependency and does not output the line. GCC V.3 resident preprocessor process the line but mcpp does not.
The following warnings are issued only in pre-Standard mode. Standard mode regards them as errors.
KR mode issues the following warning. Standard mode issues the same warning only to #pragma once, #pragma MCPP put_defines, #pragma MCPP push_macro, #pragma MCPP pop_macro, #pragma push_macro, #pragma pop_macro, #pragma MCPP debug, #pragma MCPP end_debug, and #endif for GCC-specific-build on STD mode; for other directives, Standard mode issues an error. OLDPREP mode issues neither an error nor a warning.
The following three warnings are relevant to an argument of #if, #elif, or #assert:
The followings warnings are also relevant to an argument of #if, #elif or #assert. They are not issued in a sub-expression whose evaluation is skipped. (mcpp invoked with the -W8 option issues them.)
The followings warnings are relevant to operations and types in a constant expression on #if, #elif or #assert lines. No warnings are also issued in a skipped sub-expression. (mcpp invoked with -W8 issues them.)
mcpp evaluate #if expression by long long / unsigned long long even if in C90 or C++98, and issues a warning on the value outside of long / unsigned long in C90 and in C++98. Also on LL suffix in other than C99, mcpp issues a warning. These warnings are of class 1 in compiler-independent-build and class 2 in compiler-specific-build. In POSTSTD mode, character constants are not used in #if expression, hence no warning is issued. (Those make errors.)
In these warnings, mcpp displays a macro definition followed by the source filename and line number where the macro is defined.
The following two are issued only in OLDPREP mode. (In other mode it causes an error.)
This section covers line number related warnings.
In C90, when you use #line to specify a value slightly below 32767, you won't receive an error, but sooner or later, the line number will exceed 32767, in which case, mcpp continues to increase the line number while issuing a warning. Some compiler-proper may not accept this large line number. It is not desirable to specify a large number with #line.
This section covers warnings to code that does not contains a bug but causes a portability problem.
mcpp evaluate #if expression by long long / unsigned long long even if in C90 or C++98, and issues a warning on the value outside of long / unsigned long in C90 and in C++98. Also LL suffix in other than C99 mode gets a warning as well as i64 suffix of compiler-specific-builds for Visual C and Borland C. These warnings are of class 1 in compiler-independent-build and class 2 in compiler-specific-build.
Only the Standard mode issues the following five warnings:
#define EMPTY, if possible, and then write EMPTY where an empty argument is written.
The following warning is issued only in POSTSTD mode.
The following two warnings are issued only in some compiler systems. Of course, the coding in question is valid in those particular systems, but it lacks of portability, so a warning is issued to remind users of it.
Standard C guarantees some minimum translation limits. It is desirable that a preprocessor imposes translation limits that exceed these values, but source programs that uses preprocessor' own translation limits will restrict portability. mcpp provides some macros in "system.H" that allows you to set translation limits to any values you like. mcpp in Standard mode issues a warning to source code that exceeds a Standard C guaranteed limit. However, these messages are excluded from Class 1 and 2 because they may be issued frequently, depending on standard headers of compiler systems or source programs.
The following warnings are not issued in a skipped #if group.
With __STDC_VERSION__ >= 199901L, the Standard specified translation limits are as follows:
Length of logical source line 4095 bytes Length of string literal, character constant, or header name 4095 bytes Identifier length 63 characters Depth of nested #includes 15 Depth of nested #ifs, #ifdefs, or #ifndefs 63 Depth of nested parentheses in #if expression 63 Number of macro parameters 127 Number of definable macros 4095
Note that the length of a UCN or multi-byte-character in an identifier is counted as the number of characters, not bytes. (A queer stipulation)
When mcpp is invoked with the -+ option to specify C++ preprocessing, the Standard guideline of translation limits are as follows:
Length of logical source line 65536 bytes Length of string literal, character constant, or header name 65536 bytes Identifier length 1024 characters Depth of nested #includes 256 Depth of nested #ifs, #ifdefs, or #ifndefs 256 Depth of nested parentheses in #if expression 256 Number of macro parameters 256 Number of definable macros 65536
Note that mcpp allows the maximum number of macro parameters of 255. So, when it reaches 256, mcpp issues an error.
The following warnings are excluded from class 1 and 2 because they are issued too frequently.
The following two warnings are issued only in Standard mode.
This warning is only with -K option in STD mode.
There is little chance that the indicated source code contains a bug, but these messages are issued to call attention to it. mcpp invoked with the -W8 option issues these warnings.
In a skipped #if group, whether preprocessing directives, such as #ifdef, #ifndef, #elif, #else, and #endif, are balanced or not is checked. However, mcpp invoked with the -W8 option also checks non-conforming or unknown directives. Standard mode issues a warning when the depth of nested #ifs exceeds 8.
The following warnings are related to #if expression. Given an expression of #if a || b, for example, if "a" is true, "b" is not evaluated. However, mcpp invoked with -W8 issues a warning to non-evaluated sub-expressions, in which case, the note saying "in non-evaluated sub-expression" is appended.
Trigraphs and digraphs are not used at all in an environment where they are not need to. If they are found in such an environment, attention needs to be paid. The purpose of the -W16 option is to find such trigraphs and digraphs. On the other hand, these warnings are very bothersome in an environment where trigraphs or digraphs are used on a regular basis because they are issued very frequently. For this reason, I set up a separate class for these warnings. Anyway, mcpp issues these messages only in the state where the trigraphs or digraphs are enabled. Digraph is for Standard mode only, and trigraph is for STD mode only.
<% -> { <: -> [ %: -> # %> -> } :> -> ] %:%: -> ##Therefore, the compiler-proper is not necessary to be able to handle digraphs. However, POSTSTD mode converts a digraph into a regular pp-token during the translation phase 1. The difference of this behavior between the modes appears when a # operator converts a digraph into a string; STD mode directly converts a digraph sequence into a string, while POSTSTD mode converts it into a regular pp-token, and then into a string. In addition, if a string literal contains a character sequence which is equivalent to a digraph sequence, STD mode does not convert it, while POSTSTD mode converts it into a character sequence of the corresponding pp-tokens.
Diagnostic Message | Fatal error | Error | Warning class | 1 | 2 | 4 | 8 | 16 |
---|---|---|---|---|---|---|---|
"..." isn't the last parameter | 5.4.7 | ||||||
"/*" in comment | 5.5.1 | ||||||
"and" is defined as macro | 5.5.3 | ||||||
"defined" shouldn't be defined | 5.4.7 | ||||||
"MACRO" has not been defined | 5.5.3 | ||||||
"MACRO" has not been pushed | 5.5.3 | ||||||
"MACRO" is already pushed | 5.5.3 | ||||||
"MACRO" wasn't defined | 5.8 | ||||||
"op" of negative number isn't portable | 5.5.4 | 5.8 | |||||
"__STDC__" shouldn't be redefined | 5.4.7 | ||||||
"__STDC__" shouldn't be undefined | 5.4.8 | ||||||
"__VA_ARGS__" without corresponding "..." | 5.4.7 | ||||||
"__VA_ARGS__" cannot be used in GCC2-spec variadic macro | 5.4.7 | ||||||
## after ## | 5.4.7 | ||||||
#error | 5.4.10 | ||||||
#include_next is not allowed by Standard | 5.6 | 5.8 | |||||
#warning | 5.5.7 | ||||||
'$' in identifier "THIS$AND$THAT" | 5.6 | ||||||
16 bits can't represent escape sequence L'\x12345' | 5.4.6 | 5.8 | |||||
2 digraph(s) converted | 5.9 | ||||||
2 trigraph(s) converted | 5.9 | ||||||
8 bits can't represent escape sequence '\x123' | 5.4.6 | 5.8 | |||||
_Pragma operator found in directive line | 5.4.12 | ||||||
Already seen #else at line 123 | 5.4.3 | ||||||
Bad defined syntax | 5.4.5 | ||||||
Bad pop_macro syntax | 5.5.3 | ||||||
Bad push_macro syntax | 5.5.3 | ||||||
Buffer overflow expanding macro "macro" at "something" | 5.4.9 | ||||||
Buffer overflow scanning token "token" | 5.3.3 | ||||||
Bug: | 5.3.1 | ||||||
Can't open include file "file-name" | 5.4.11 | ||||||
Can't use a character constant 'a' | 5.4.5 | ||||||
Can't use a string literal "string" | 5.4.5 | ||||||
Can't use the character 0x24 | 5.4.5 | ||||||
Can't use the operator "++" | 5.4.5 | ||||||
Constant "123456789012" is out of range of (unsigned) long | 5.5.4 | 5.6 | 5.8 | ||||
Constant "1234567890123456789012" is out of range | 5.4.6 | 5.8 | |||||
Converted 0x0c to a space | 5.7 | ||||||
Converted [CR+LF] to [LF] | 5.5.1 | 5.6 | |||||
Converted \ to / | 5.6 | ||||||
Division by zero | 5.4.6 | 5.8 | |||||
Duplicate parameter names "a" | 5.4.7 | ||||||
Empty argument in macro call "MACRO( a, ," | 5.6 | ||||||
Empty character constant '' | 5.4.1 | 5.5.1 | |||||
Empty parameter | 5.4.7 | ||||||
End of file with no newline, supplemented the newline | 5.5.2 | ||||||
End of file with unterminated #asm block started at line 123 | 5.4.2 | 5.5.2 | |||||
End of file with unterminated comment, terminated the comment | 5.5.2 | ||||||
End of file with \, deleted the \ | 5.5.2 | ||||||
End of file within #if (#ifdef) section started at line 123 | 5.4.2 | 5.5.2 | |||||
End of file within macro call started at line 123 | 5.4.2 | 5.5.2 | |||||
Excessive ")" | 5.4.5 | ||||||
Excessive token sequence "junk" | 5.4.4 | 5.5.3 | |||||
File read error | 5.3.2 | ||||||
File write error | 5.3.2 | ||||||
GCC2-spec variadic macro is defined | 5.6 | ||||||
Header-name enclosed by <, > is an obsolescent feature | 5.6 | ||||||
I64 suffix is used in other than C99 mode "123i64" | 5.6 | 5.8 | |||||
Identifier longer than 31 bytes "very_very_long_name" | 5.7 | ||||||
Ignored #ident | 5.5.3 | 5.8 | |||||
Ignored #sccs | 5.5.3 | 5.8 | |||||
Illegal #directive "123" | 5.4.4 | 5.5.3 | 5.8 | ||||
Illegal control character 0x1b in quotation | 5.5.1 | ||||||
Illegal control character 0x1b, skipped the character | 5.4.1 | ||||||
Illegal digit in octal number "089" | 5.5.1 | ||||||
Illegal multi-byte character sequence "XY" in quotation | 5.5.1 | ||||||
Illegal multi-byte character sequence "XY" | 5.4.1 | ||||||
Illegal parameter "123" | 5.4.7 | ||||||
Illegal shift count "-1" | 5.5.4 | 5.8 | |||||
Illegal UCN sequence | 5.4.1 | ||||||
In #asm block started at line 123 | 5.4.3 | ||||||
Integer character constant 'abcde' is out of range of unsigned long | 5.5.4 | 5.6 | 5.8 | ||||
Integer character constant 'abcdefghi' is out of range | 5.4.6 | 5.8 | |||||
Less than necessary N argument(s) in macro call "macro( a)" | 5.4.9 | 5.5.5 | |||||
Line number "0x123" isn't a decimal digits sequence | 5.4.4 | 5.5.6 | |||||
Line number "2147483648" is out of range of 1,2147483647 | 5.4.4 | ||||||
Line number "32768" got beyond range | 5.5.6 | ||||||
Line number "32768" is out of range of 1,32767 | 5.5.6 | ||||||
Line number "32769" is out of range | 5.5.6 | ||||||
LL suffix is used in other than C99 mode "123LL" | 5.5.4 | 5.6 | 5.8 | ||||
Logical source line longer than 509 bytes | 5.7 | ||||||
Macro "MACRO" is expanded to "defined" | 5.5.4 | ||||||
Macro "MACRO" is expanded to "sizeof" | 5.5.4 | ||||||
Macro "MACRO" is expanded to 0 token | 5.5.4 | ||||||
Macro "macro" needs arguments | 5.8 | ||||||
Macro started at line 123 swallowed directive-like line | 5.5.5 | ||||||
Macro with mixing of ## and # operators isn't portable | 5.7 | ||||||
Macro with multiple ## operators isn't portable | 5.7 | ||||||
Misplaced ":", previous operator is "+" | 5.4.5 | ||||||
Misplaced constant "12" | 5.4.5 | ||||||
Missing ")" | 5.4.5 | ||||||
Missing "," or ")" in parameter list "(a,b" | 5.4.7 | ||||||
More than 1024 macros defined | 5.7 | ||||||
More than 31 parameters | 5.7 | ||||||
More than 32 nesting of parens in #if expression | 5.7 | ||||||
More than 8 nesting of #if (#ifdef) sections | 5.7 | 5.8 | |||||
More than 8 nesting of #include | 5.7 | ||||||
More than BLK_NEST nesting of #if (#ifdef) sections | 5.3.3 | ||||||
More than INCLUDE_NEST nesting of #include | 5.3.3 | ||||||
More than necessary N argument(s) in macro call "macro( a, b, c) | 5.4.9 | ||||||
More than NEXP*2-1 constants stacked at "12" | 5.4.5 | ||||||
More than NEXP*3-1 operators and parens stacked at "+" | 5.4.5 | ||||||
More than NMACPARS parameters | 5.4.7 | ||||||
Multi-character or multi-byte character constant 'XY' isn't portable | 5.7 | 5.8 | |||||
Multi-character wide character constant L'ab' isn't portable | 5.7 | 5.8 | |||||
Negative value "-1" is converted to positive "18446744073709551615" | 5.5.4 | 5.8 | |||||
No argument | 5.4.4 | 5.5.3 | |||||
No header name | 5.4.4 | ||||||
No identifier | 5.4.4 | ||||||
No line number | 5.4.4 | ||||||
No space between macro name "MACRO" and repl-text | 5.5.3 | ||||||
No sub-directive | 5.5.3 | ||||||
No token after ## | 5.4.7 | ||||||
No token before ## | 5.4.7 | ||||||
Not a file name "name" | 5.4.4 | ||||||
Not a formal parameter "id" | 5.4.7 | ||||||
Not a header name "UNDEFINED_MACRO" | 5.4.4 | ||||||
Not a line number "name" | 5.4.4 | ||||||
Not a valid preprocessing token "+12" | 5.4.9 | 5.6 | |||||
Not a valid string literal | 5.4.9 | ||||||
Not an identifier "123" | 5.4.4 | 5.5.3 | |||||
Not an integer "1.23" | 5.4.5 | ||||||
Not in a #if (#ifdef) section | 5.4.3 | ||||||
Not in a #if (#ifdef) section in a source file | 5.4.3 | 5.5.3 | |||||
Operand of _Pragma() is not a string literal | 5.4.12 | ||||||
Operator ">" in incorrect context | 5.4.5 | ||||||
Old style predefined macro "linux" is used | 5.5.5 | ||||||
Out of memory (required size is 0x1234 bytes) | 5.3.2 | ||||||
Parsed "//" as comment | 5.6 | ||||||
Preprocessing assertion failed | 5.4.10 | ||||||
Quotation longer than 509 bytes "very_very_long_string" | 5.7 | ||||||
Recursive macro definition of "macro" to "macro" | 5.4.9 | ||||||
Removed ',' preceding the absent variable argument | 5.5.5 | ||||||
Replacement text "sub(" of macro "head" involved subsequent text | 5.5.5 | 5.8 | |||||
Rescanning macro "macro" more than RESCAN_LIMIT times at "something" | 5.4.9 | ||||||
Result of "op" is out of range | 5.4.6 | 5.8 | |||||
Result of "op" is out of range of (unsigned) long | 5.5.4 | 5.6 | 5.8 | ||||
Shift count "40" is larger than bit count of long | 5.5.4 | 5.6 | 5.8 | ||||
sizeof is disallowed in C Standard | 5.8 | ||||||
sizeof: Illegal type combination with "type" | 5.4.6 | 5.8 | |||||
sizeof: No type specified | 5.4.5 | ||||||
sizeof: Syntax error | 5.4.5 | ||||||
sizeof: Unknown type "type" | 5.4.6 | 5.8 | |||||
Skipped the #pragma line | 5.6 | ||||||
String literal longer than 509 bytes "very_very_long_string" | 5.7 | ||||||
The macro is redefined | 5.5.4 | ||||||
This is not a preprocessed source | 5.3.4 | ||||||
This preprocessed file is corrupted | 5.3.4 | ||||||
Too long comment, discarded up to here | 5.7 | ||||||
Too long header name "long-file-name" | 5.3.3 | ||||||
Too long identifier, truncated to "very_long_identifier" | 5.5.1 | ||||||
Too long line spliced by comments | 5.3.3 | ||||||
Too long logical line | 5.3.3 | ||||||
Too long number token "12345678901234" | 5.3.3 | ||||||
Too long pp-number token "1234toolong" | 5.3.3 | ||||||
Too long quotation "long-string" | 5.3.3 | ||||||
Too long source line | 5.3.3 | ||||||
Too long token | 5.3.3 | ||||||
Too many magics nested in macro argument | 5.4.9 | ||||||
Too many nested macros in tracing MACRO | 5.4.9 | ||||||
UCN cannot specify the value "0000007f" | 5.4.1 | 5.8 | |||||
Undefined escape sequence '\x' | 5.5.4 | 5.8 | |||||
Undefined symbol "name", evaluated to 0 | 5.7 | 5.8 | |||||
Unknown #directive "pseudo-directive" | 5.4.4 | 5.5.4 | 5.8 | ||||
Unknown argument "name" | 5.5.3 | ||||||
Unterminated character constant 't understand. | 5.4.1 | ||||||
Unterminated expression | 5.4.5 | ||||||
Unterminated header name | 5.4.1 | ||||||
Unterminated macro call "macro( a, (b,c)" | 5.4.9 | ||||||
Unterminated string literal | 5.4.1 | ||||||
Unterminated string literal, catenated to the next line | 5.5.1 | ||||||
Variable argument macro is defined | 5.6 | ||||||
Wide character constant L'abc' is out of range of unsigned long | 5.5.4 | 5.6 | 5.8 | ||||
Wide character constant L'abc' is out of range | 5.4.6 | 5.8 |
I have developed the Validation Suite to verify conformance of preprocessing to Standard C/C++, and released it along with mcpp source. The Validation Suite is intended to allow you to verify all the Standard C preprocessing specifications. Of course, I used the Validation Suite to check mcpp. And what is more, I have compiled mcpp in many compiler systems to verify its behavior. Therefore, I am confident that mcpp is now almost flawless, free of bugs and misinterpretation of specifications, however, I cannot deny the possibility that it still contains some bugs.
If you find a strange behavior, do not hesitate to let me know. If you receive a diagnostic message saying "Bug: ...", it is undoubtedly a bug of mcpp or a compiler system. (Probably, it's mcpp's.) How illegal a user program may be, should mcpp lose control, it is mcpp that is to be blamed for it.
When you report a bug, please be sure to provide the following information:
Other than bugs, I would appreciate if you give me feedback on mcpp usage, diagnostic messages or this manual.
For your feedback or information, please post to "Open Discussion Forum" at:
or send via e-mail.
mcpp-2.7.2/doc/cpp-test.html 0000644 0001750 0001750 00001312224 11113256232 012575 0000000 0000000I completely rewrote DECUS cpp by Martin Minow to create a portable C preprocessor called mcpp V.2. mcpp stands for 'Matsui cpp'. This preprocessor is provided as source code and can be ported for various compiler systems by modifying some macros in header files on compilation. In addition, execution programs has various behavioral modes such as Standard C (ISO/ANSI/JIS C) and others. Among those modes, the Standard C mode literally implements strict Standard C preprocessing.
While implementing this preprocessor, I also created a testing tool called "Validation Suite for Standard C Conformance of Preprocessing". This suite also covers C++. This document explains the Validation Suite. The Validation Suite is open to the public as an open-source-software along with this documentation under the BSD-style-license.
The Validation Suite became available to the public on NIFTY SERVE/FC/LIB 2 in August 1998, and also later on http://www.vector.co.jp/pack. It did not have a version number, however, and so is assumed to be version 1.0.
V.1.1 supports the C99 draft in August 1997 and is an update to V.1.0. V.1.1 was also made public on NIFTY SERVE/FC/LIB 2 and vector/software pack in September, 1998.
V.1.2 supports the official C++ Standard release and is a small update to V.1.1. It also became available on NIFTY SERVE/FC/LIB 2 and vector/software pack in November 1998.
V.1.3 supports the official C99 release and is an update to V.1.2. In addition, behavioral test samples were rewritten so that they can be used by the GCC / testsuite.
V.1.3, while it was under development, was adopted as the year 2002's "Exploratory Software Project" at the Information-technology Promotion Agency, Japan (IPA) by Yutaka Niibe Project Manager. From July 2002 to February 2003, the development continued under the grants-in-aid from IPA and Niibe PM's advice. The English version of the document was created under my supervision with the translation work outsourced to HighWell, Inc. In February 2003, mcpp V.2.3 and Validation Suite V.1.3 were released on m17n.org.
In addition, mcpp and Validation Suite were adopted as the year 2003's "Exploratory Software Project" by Hiroshi Ichiji Project Manager. This allowed an update to V.2.4 and V.1.4. *1
mcpp and Validation Suite have been kept on updating after the project. V.2.5 and V.1.5 were released in March 2005. Validation Suite V.1.5 changed allocation of points and some other matters. In July 2006, mcpp V.2.6 and Validation Suite V.1.5.1 are released. Validation Suite V.1.5.1 updated the test result of the preprocessors. In November 2006, mcpp V.2.6.2 and Validation Suite V.1.5.2 were released. In April 2007, mcpp V.2.6.3 and Validation Suite V.1.5.3 were released. V.1.5.2 and V.1.5.3 of Validation Suite have no substantial change. They have only small updates of this document. In May 2007, mcpp V.2.6.4 was released, and in March 2008, mcpp V.2.7 and Validation Suite V.1.5.4 were released. Validation Suite V.1.5.4 updated a few testcases and added test data on Visual C++ 2008. In May 2008, mcpp V.2.7.1 was released, and in November 2008, mcpp V.2.7.2 and Validation Suite V.1.5.5 were released. Validation Suite V.1.5.5 added test data on Wave V.2.0.
Note:
*1 The overview of the Exploratory Software Project can be found below (in Japanese only).
mcpp from V.2.3 through V.2.5 had been located at:
http://www.m17n.org/mcpp/
In April 2006, mcpp project moved to:
mcpp V.2.2 and Validation Suite V.1.2 are located in the following Vector's web site. They are in the directory called dos/prog/c, but they are not for MS-DOS exclusively. Sources are for UNIX, WIN32, MS-DOS. The documents are Japanese only.
http://www.vector.co.jp/soft/dos/prog/se081188.html
http://www.vector.co.jp/soft/dos/prog/se081189.html
http://www.vector.co.jp/soft/dos/prog/se081186.html
The text files in these archive files available at Vector use [CR]+[LF] as a <newline> and encode Kanji in shift-JIS for DOS/Windows. On the other hand, those from V.2.3 through V.2.5 available at SourceForge use [LF] as a <newline> and encode Kanji in EUC-JP for UNIX. From V.2.6 on two types of archive, .tar.gz file with [LF]/EUC-JP and .zip file with [CR]+[LF]/shift-JIS, are provided.
ISO/IEC 9899:1990 (JIS X 3010-1993) had been used as C Standard, but in 1999, ISO/IEC 9899:1999 was adopted as a new Standard. This document calls the former C90 and latter C99. The former is generally called ANSI C or C89 because it migrated from ANSI X3.159-1989. ISO/IEC 9899:1990 plus its Amendment 1995 is sometimes called C95. C++ Standards are ISO/IEC 14882:1998 and its corrigendum version ISO/IEC 14882:2003. This document calls both of them C++98.
The Standards referred in this explanation are below.
C90: ANSI X3. 159-1989 (ANSI, New York, 1989) ISO/IEC 9899:1990(E) (ISO/IEC, Switzerland, 1990) ibid. Technical Corrigendum 1 (ibid., 1994) ibid. Amendment 1: C Integrity (ibid., 1995) ibid. Technical Corrigendum 2 (ibid., 1996) JIS X 3010-1993 (JIS Handbook 59-1994, Tokyo, 1994, Japanese Standards Association) C99: ISO/IEC 9899:1999(E) ibid. Technical Corrigendum 1 (2001) ibid. Technical Corrigendum 2 (2004) C++: ISO/IEC 14882:1998(E)
ANSI X3.159 contained "Rationale." It was not adopted by ISO C90 for some reason, but reappeared in ISO C99. This "Rationale" is also referred to from time to time.
PDF versions of C99 and C++ Standards can be obtained online on the following sites. (The open-std.org site provides drafts which can be freely downloaded.)
C99, C++98, C++03:
C99+TC1+TC2:
C99 Rationale final draft in October, 1999:
Though this document was text-file in the older versions, it has changed to html-file at V.1.5.2. This document uses the following typographical conventions:
Before explaining the Validation Suite, I will talk about the overall characteristics of Standard C (ANSI/ISO/JIS C) preprocessing. This is not text bookish, but intended to point out the concepts and issues of Standard C by comparing with K&R 1st.
As I explain, I will concentrate on the differences between K&R 1st and C90 first, C90 and C99, then C90 and C++ in this order. C99 is currently available as a Standard, however, it has not been implemented on actual compiler systems very much. Therefore, it is more realistic to center on C90.
This chapter shows few samples, please refer to the Validation Suite since it is a sample itself.
There were endless varieties of dialects amongst pre-Standard C language implementations. Above all, there were almost no standards for preprocessing. The reason was because the preprocessing specification was too simplistic and ambiguous in the 1st edition of "The C Programming Language" by Kernighan & Ritchie as a reference. In addition, it seems that preprocessing was thought to be a bonus to the language proper. However, many features were added to preprocessing by each implementation since K&R 1st. Some supplemented flaws in the language proper while others tried to maintain portability among different implementations. However, there were too many differences among implementations in any case. The truth was it was nowhere close to being portable.
C90 provided a clear specification on preprocessing which had been a cause of the confusion for many years. There are some new features added which are well known. What is more important, however, is that C90 provides virtually the first overall specification on preprocessing. You can see a basic point of view, "what is preprocessing?", which had been vague thus far, everywhere in this specification. Preprocessing in C90 is not just K&R 1st + alpha. In order to understand this, I believe we need to grasp not only "new features", but also such basics clearly. Unfortunately, however, the basics of preprocessing are not summarized together in the body of the Standard and just mentioned briefly in "Rationale", a commentary on the Standard. Even more unfortunately, it contains incoherent parts which seem to be the results of making a compromise with conventional preprocessing. Therefore, I will summarize basic characteristics of C90 preprocessing and examine their issues.
Characteristics different from pre-Standard processing or newly defined are summarized as the following four points.
These principles are examined below in turn.
No preprocessing procedure was described in K&R 1st, which caused much confusion. C90 specifies and defines translation phases. It can be summarized as follows. *1
Needless to say, these steps do not actually have to be in separate phases as long as they are processed to lead the same result.
Among these phases, phase 1 to 4 or to 6 belong to the range of preprocessing. It is usually handled up to phase 4 since token separators such as <newline> need to be retained if a preprocessor is an independent program and outputs the preprocessing result as an intermediate file (if escape sequences such as \n are converted in phase 5, they cannot be distinguished from <newline> and other token separators.) This Validation Suite also tests up to phase 4.
Note:
*1 C90 5.1.1.2 Translation phases C99 5.1.1.2 Translation phasesC99 added _Pragma() operator processing at phase 4. Some words have been added to the description, but no changes to the meanings in particular.
*2 In the C99 draft in November, 1997, a multi-byte character not included in the basic source character set before trigraphs replacement was, by a specification, to be converted into a Unicode hexadecimal sequence of the \Uxxxxxxxx format, that is, a universal character name. This sequence is re-converted into the execution character set in phase 6. This is similar to the C++ Standard.
This specification is vague, furthermore, the load is great on implementations. Fortunately, the processing at phase 1 was deleted in the draft in January, 1999 and remained as is in C99 official version.
K&R 1st described only 2 cases below regarding line splicing by <backslash><newline>, no others were defined.
C90 clearly defines that <backslash><newline> is deleted in phase 2 before decomposing lines into preprocessing tokens and token separators in phase 3. This allows any lines or any tokens to be spliced.
Also, as trigraph processing is done in phase 1, the ??/<newline> sequence is similarly deleted. On the other hand, 0x5c code in the second byte of Kanji characters is not a <backslash> for the implementation with ASCII as its basic character set and Shift-JIS Kanji Characters as the multi-byte character encoding since one Kanji character is one multi-byte character.
It is good that the translation phases became clear, but I am wondering if it is necessary to support line splicing in the middle of a token. Although this was the only way to write a long string literal which cannot fit on one line on the screen in K&R 1st, it is not necessary to start a new line in the middle of a token on purpose since adjacent string literals are concatenated in C90. Line splicing is required only when you write a long control line. If that were the only issue, it would have been better to reverse phase 2 and 3.
Still, the current specification seems to be adopted for backward compatibility so that the source written assuming the specification of the concatenation by line splicing for string literals in K&R 1st can be processed. The specification is almost meaningless for new source in practical use, however, it is probably appropriate since it is simple, comprehensible, and easiest to implement.
The concept of preprocessing token (pp-token, for short) was also introduced for the first time in C90. Since it does not seem to be known very much, however, I will summarize the content at first. Below are specified as pp-tokens. *1
header-name
identifier
preprocessing-number
character-constant
string-literal
operator
punctuator
Non-white-space character other than above
These look nothing special at a casual glance, but they are quite different from the token proper. Tokens are below.
keyword
identifier
constant (floating-constant, integer-constant, enumeration-constant, character-constant)
string-literal
operator
punctuator
Pp-tokens differ from tokens in the following points.
Surprisingly, only string-literal and character-constants are the same. The most important of all are no keywords and the existence of a preprocessing-number instead of a numeric token. We will discuss these 2 items further.
Note:
*1 C90 6.1 Lexical elements C99 6.4 Lexical elements
In C99, an operator was absorbed by a punctuator whether it is a pp-token or token. Operator became the term simply for an "operator" functionality, not as a token type. The same punctuator token (punctuator pp-token) may function as a punctuator or operator depending on the context. Also, _Pragma was added as a pp-token operator.
A keyword is recognized for the first time in phase 7. A keyword is handled as an identifier in preprocessing phases. For preprocessing, an identifier is a macro name or an identifier which has not been defined as a macro. That means that even a macro with a same name as a keyword can be used. *1
This specification is indispensable in order to separate preprocessing from implementation-dependencies. This, for example, prohibits using a cast or sizeof in #if expressions. *2
Note:
*1 To be more precise, a parameter name in a macro definition is also an identifier. In addition, a preprocessing directive name is a special identifier and has a similar characteristic to a keyword. Whether this is a directive, however, is judged by syntax. If it is not in a valid place for directives, it is simply an identifier, which may be subject to macro expansion.
*2 Refer to 3.4.14.7 and 3.4.14.8.
Preprocessing-numbers (pp-number, for short) are specified as below. *1, *2
digit
.digit
pp-number digit
pp-number nondigit
pp-number e sign
pp-number E sign
pp-number .
Non-digits are letters and underscores.
Summarized as below.
Pp-number includes all floating-constants and integer-constants, and even non-numerical sequences, 3E+xy, for example. Pp-number was adopted to make preprocessing simple and is considered to be helpful for the tokenization of this type of sequence which precedes semantic interpretation. *3
It is correct that tokenization becomes simple, however, a non-numeric pp-numbers is not a valid token. Therefore, it must disappear before completing preprocessing. Use of non-numeric pp-numbers deliberately in source is highly unlikely. The only case I can think of is that a numeric pp-number and another type of pp-token are concatenated to become a non-numeric pp-number in the macro defined by using a ## operator and it is stringized by a macro defined by a # operator. Any pp-token becomes a valid token if it is put in a string literal. However, without accepting the existence of non-numeric pp-number, the one generated by concatenation will not become a valid pp-token (the result becomes undefined.)
Although this type of usage is extremely special and need not to be examined in detail, pp-numbers provide an interesting subject matter in terms of token-oriented preprocessing.
Note:
*1 C90 6.1.8 Preprocessing numbers C99 6.4.8 Preprocessing numbers
*2 In C99, pp-number p sign and pp-number P sign sequences were added to enable hexadecimal expression for floating point numbers. In addition, the nondigit above was replaced by the identifier-nondigit. This is a change accompanied by the approval of using an UCN (universal character name) and implementation-defined multi-byte characters in an identifier. (Refer to 2.8.) In other words, an UCN can be used in a pp-number and an implementation using multi-byte characters is supported. This is allowed in case of stringizing by ## and # operators though no UCN or multi-byte characters are included in a numerical value token.
*3 C89 Rationale 3.1.8 Preprocessing numbers C99 Rationale 6.4.8 Preprocessing numbers
C90 acquired pp-token concatenation capability with the ## binary operator in a macro definition. This is known as a "new feature" in C90. This, however, is something introduced to replace a hack rather than a new feature. What I would like to pay attention here is that this is essential for token-oriented preprocessing.
The traditional token concatenation method, uses a specification of replacement of a comment with 0 character, known as the so-called "Reiser type cpp." On other occasions, token concatenation occurs unintentionally in the preprocessor with character-oriented operations. And, there were hacks taking advantage of it. I can say all took advantage of flaws in character-oriented preprocessing.
On the other hand, C90 allows explicit token concatenation by token-oriented operations. Source file is decomposed into sequences of pp-tokens and white spaces in translation phase 3. The only cases of combining pp-tokens later is ## operator concatenation and # operator stringizing, header-name construction, and the concatenation of string literals or wide character string literals next to each other. The handling of non-numeric pp-numbers is clear if its existence is considered in this context. That is to say, there are the following principles of C90 tokenization.
Pp-tokens are not concatenated implicitly. Concatenation must be done explicitly by ## operators. Pp-tokens concatenated once will never be separated again.
In pre-Standard character-oriented preprocessing, macro call expansion sometimes caused unintended concatenation with tokens before and after with the token sequence as its result. However, this can be thought as something which must not occur in token-oriented C90 preprocessing. *1
Note:
*1 Refer to 3.4.21.
In C90, the #if expression was evaluated with one type of size, either long or unsigned long. This also simplified preprocessing and helped reducing implementation-dependent parts at the same time. Compared with the int size, which varies greatly depending on implementations, long/unsigned long are 32 bits for the most part and sometimes 64 or 36 bit. This assures considerable portability in general #if expressions. *1, *2, *3
Note:
*1 C90 6.8.1 Conditional inclusion -- Semantics *2 C99 6.10.1 Conditional inclusion
In C99, the #if expression type was specified as the maximum integer type for the compiler system. Since C99 requires long long/unsigned long long, this would be long long/unsigned long long or wider. This, however, reduces the portability of #if expressions.
*3 There will be more implementations with 64 bit long in the future, but I am not sure if that is good... By the way, I personally believe that the integer type size should be defined as below.
In other words, there has been a constraint of sizeof (short) <= sizeof (int) <= sizeof (long) since the arrival of 64 bit compiler systems which made everything constrained and caused no type to maintain portability. It will be better to remove this constraint and to decide types by absolute size.
Standard C preprocessing specifications above allow writing the source code for a preprocessor itself portably. That is because the preprocessor needs to know nothing about the implementation-dependent parts of compiler proper. Only peripheral parts as below become problems when you actually try to write preprocessors portably in Standard C compiler systems.
2 and 3 above are necessary only if they are implemented for existing implementations. I expect that 2 will be standardized in the #line 123 "filename" format same as a source file. 3 is not necessary (logically, leaving convenience aside) for Standard C preprocessing. 4 can be written in the form such that special implementations are not required depending the source code (though implementations are easier for a basic character set if a table is used in the source code.) 5 will not be a problem since the integer type size for a host implementation is seldom smaller than that of target one in reality.
Needless to say, mcpp was also created with a motive that these Standard C preprocessing specifications are independent of compiler proper (though there are many #if sections in order to assure the portability since this mcpp is intended to be ported to many compiler systems.)
Macro expansion method with arguments is specified modeling after function calls in C90 and called a function-like macro. If macros are contained in an argument, they are expanded before substitution for a parameter in principle.
This point was not clear for pre-Standard implementations. I suspect that most used the method that an argument was substituted for a parameter without the macro within it being expanded and was expanded at rescanning. So to speak, the editor-like text replacement repetition is speculated to be behind this type of expansion. In general, text replacement repetitions for editors are fine for macro expansion without arguments, but I wonder if they were extended to the macro expansion with arguments for many preprocessors.
However, this method causes a strange way of using macros totally different from function call-like appearance in source. When calling a nested macro with arguments, the situation where it is not clear which argument is which occurs. Given these points, implementation-specific features will increase. In short, I can say that it became too heavy a load for the editor-like text replacement repetition as macro expansion of C grew to take on advanced features.
In consideration of this confusion, the grammar was organized in Standard C by positioning function-like macro calls as replacements for function calls. The Rationale states the following principles for which the Standard C specifications of macros are grounded. *1
- Allow macros to be used wherever functions can be.
- Define macro expansion such that it produces the same token sequence whether the macro calls appear in open text, in macro arguments, or in macro definitions.
This stands to reason for function calls, but not for macros with arguments. It is obvious that this is not for editor-like text replacement repetition.
Note:
*1 C89 Rationale 3.8.3 Macro replacement C99 Rationale 6.10.3 Macro replacement
What is essential for achieving the principle of macro expansion parallel to function calls is to expand macros in an argument first before substituting a parameter with the argument. And, the macro call within an argument must have been completed within that argument (it causes an error if not completed.) A macro within an argument must not absorb the text right after it. Thus, nested function-like macro calls can maintain logical clarity. *1
Note:
*1 Refer to 3.4.25.
Operands for the # operator, however, are not supposed to be macro-expanded. Operands for the ## operator are not macro-expanded, either. The pp-token generated by concatenation becomes subject to macro expansion at rescanning. Why do we need this specification?
This specification is meaningful when an argument includes macros indeed. It is helpful if you want to stringize or concatenate token sequences including macros as they are. On the contrary, to expand a macro before stringizing and concatenation, it needs wrapping another macro which does not use # and ## operators. In order for a programmer to be able to choose either of these, a specification is needed so that no macro-expansion is performed for the operands of # and ## operators. *1
Note:
*1 Refer to 3.4.25, misc.t/PART 3 and 5. A typical example where the specification of no macro expansion for the operand of the # operator helps is the assert() macro. Refer to 5.1.2.
Now, after macro calls are replaced with a replacement list and function-like macro arguments are substituted for parameters in the replacement list after being expanded, the replacement list is rescanned searching for more macro calls.
This rescanning is a specification since K&R 1st. And there seems to have been an "editor-like text replacement repetition" approach in the background. In Standard C, however, function-like macro arguments have been completely expanded already except for operands of ## operators. What on earth is expanded at rescanning?
Rescanning is necessary for macros in replacement list other than parameters, and for macros with ## operators. What is needed rescanning other than those is so-called cascaded macros, where macro "definitions" are nested. If "arguments" for macro calls are nested, they usually do not get expanded again at rescanning since they have been expanded in the nesting structure before rescanning (though there are exceptions. Refer to 2.7.6 and 3.4.27.)
Although cascaded macros are expanded one after another, it may be a problem sometimes. That is the case the macro definition itself is recursive. If this is expanded as is, it will fall into infinite recursion. The same problems occur in not only the direct recursive case where a macro is included in the definition itself, but also the indirect recursion of 2 or more definitions. In order to avoid this situation, Standard C adds a specification of "If the name of the macro being replaced is found during the rescan of the replacement list, it is not replaced." The phrases are difficult, but its intention is easy to understand.
This is a point where a function-like macro has different grammar from the function. It is also different from the editor-like replacement. Since it is a macro specific specification and has been used as a convenient processing, which is used only in macros, I think it is appropriate to keep this specification by clearly defining it.
I have covered above only good aspects or simple and clear aspects of Standard C preprocessing specifications. If I go into more detail, however, there are parts that are irregular or lacking in their utility or portability for their implementation overhead. Most of these are there without being able to settle traditional or implicit pre-Standard preprocessing methods. The existence of this type of useless area like an appendix which confuses the specification and makes implementation troublesome. Standard C also includes a few parts which caused new unnecessary complications. Those problems are sorted out below.
Although header-names enclosed by < and > have been used traditionally since K&R 1st, they are extremely exceptional and irregular as tokens. Because of this, Standard C has many undefined and implementation-defined parts regarding header-name pp-token. For example, it is undefined if the /* sequence is included. Also, if it is not a header-name as in <stdio.h>, the part which is divided into multiple pp-tokens as in <, stdio, . , h, and > must be combined to one pp-token as far as the #include line goes. That method is implementation-defined. Tokenization is performed in translation phase 3. However, if the line turns out to be a #include directive in phase 4, tokenization needs to be redone. I would have to say it is a very illogical specification. The processing in case a space exists between (temporary) pp-tokens, which were once tokenized in phase 3, is also implementation-defined. Directives such as #include <stdio.h> appear to have most portability, but have low portability in respect of preprocessing implementations. Irregularity increases more if the argument of the #include line is a macro.
Header-names enclosed by " and " have no problems like these. However, \ is not handled as an escape character as in header-names enclosed by < and > and this is the difference from string literals. It is not illogical that no escape sequence exists in a header-name which is processed in phase 4 since escape sequences are processed in phase 6 (\ within a header-name is undefined by the Standard. This must be a consideration to ease implementation. In reality, no problem occurs unless \ comes right before " inside of " and ". It is a little more complicated inside of < and >.) *1
Also, the difference between #include <stdio.h> and #include "stdio.h" is only that the former searches a specific implementation-defined location while the latter first searches (a relative path from) the current directory and the same location as <stdio.h> if it is not found (as Standard C does not assume an OS, it does not use the term, "current directory." But it is interpreted for most operating systems.) Therefore, #include <stdio.h> can be simply written as #include "stdio.h".
By having two kinds of header-name formats, there is a readability advantage of spotting at a glance the distinction between user defined and system provided headers. But it does not have to go to the trouble of providing an irregular token for that purpose. All it needs is to distinguish one from another by using different suffixes as in "stdio.h" and "usr.H" (if I add just in case, it is acceptable if a system does not distinguish uppercase and lowercase of a filename since this is a readability issue. Naturally, they can be "usr.hdr", "usr.usr", "usr.u", and etc.)
I believe that header-names enclosed by < and > should be abolished since it serves no use as a language specification and complicates preprocessing tokenization. It cannot be abolished out of the blue, but I would like it to be specified as an obsolescent feature.
Note:
*1 UCN starting with \ was introduced by C99, which is a little troublesome.
The next problem is the handling of white spaces as token separators between operands of # operators. One or more white spaces are compressed into one space and no space is inserted if there is no white space.
This is half-defined specification. In order to ensure token-based operations, the existence of token separators must not have an influence. For that reason, it should have been defined so that all token separators are deleted or a space is placed between every pp-token. C89 Rationale 3.8.3.2 (C99 Rationale 6.10.3.2) states that the # operator was decided "as a compromise between token-based and character-based preprocessing discipline" within this specification.
This compromise led to an extra burden rather than easing preprocessor implementation and brought ambiguity to complicated macro expansion as well. There is an example shown below in Example 4 of C90 6.8.3, C99 6.10.3 Macro replacement -- Examples.
#define str(s) # s #define xstr(s) str(s) #define INCFILE(n) vers ## n #include xstr(INCFILE(2).h)
This #include line is expanded as:
#include "vers2.h"
This example is filled with many problems. There is no vagueness in INCFILE(2) being replaced with vers2. However, the expansion result of INCFILE(2).h, an argument for xstr(), is a sequence of 3 pp-tokens, vers2, ., and h. The expansion example in the Standard is handled with no white spaces among 3 pp-tokens. This involves issues as below.
All these ambiguity and complexity come from the incompleteness of token separator handling in operands for # operators.
I think it is better to have the specification such that # operators are stringized after each pp-token is separated with a single space in order to avoid implicit concatenation of pp-tokens and causing complicated problems and to show what kind of pp-token sequence the argument stringized is. If defined that way, this macro will be expanded as "vers2 . h". Needless to say, this is not an appropriate macro.
As this example shows, the only case where it will be troublesome to insert a space where none exists is the macro for the #include line using # or ## operator. The #include line to be processed in translation phase 4 cannot use the concatenation of the string literals to be processed in phase 6. However, the macro for the #include line can be simply defined as a string literal without bothering to be parameterized using # and ## operators. Sacrificing token-based principles just because of this parameterization is a great loss in the balance.
In the Standard C preprocessing specification, the syntax is token-based while the semantics specification of # operators is suddenly character-based, losing logical consistency.
Moreover, this example of the Standard assumes the specification, which is not necessarily clear from the Standard text. It is an inappropriate example and should be deleted.
Note:
*1 mcpp was compelled to be implemented in the same way.
There is a similar specification to white space handling in operand of # operators with regards macro re-definition. It is defined; "A macro re-definition must be equivalent to the original macro. In order to be equivalent, the number and name of parameters must be the same and the replacement list must have the same spellings. However, in the case of white spaces in the replacement list, their existence must be the same though the number can be different."
If the specification of the # operators are as above, this is an obvious conclusion since same handling is necessary for white spaces in the replacement list. Still, the cause of the problem is the specification of the # operators.
If the # operator is handled in such a way that one space exists between every pp-token in operands, there will be no issue regarding the existence of white spaces for macro re-definition.
Moreover, this can be generalized in the preprocessor implementation, by replacing with one space between every pp-token in source as a principle. By doing so, tokenization for macro expansion can be done easily and accurately. However, there are two exceptions to this principle. One is <newline> in the preprocessing directive line and another is whether there are white spaces between a macro name and the subsequent '(' in macro definition. This traditionally has been the basis of preprocessing in C and cannot be changed after all these years.
I have mentioned in 2.7.3 that parameter names must match regarding macro re-definition in the specification, but I believe this is an excessive specification. Parameter names, of course, do not make any difference to macro expansion. However, in order to check for re-definition, a preprocessor needs to store parameter names of all macro definitions. Even so, its usage is nothing other than re-definition checking within the specification. It is not such a great idea to give overhead to implementations only for almost meaningless checking.
I think it is better to remove the specification that parameter names must match at macro re-definition.
#if expression as an argument for the #if line is a constant expression in the integer type. Its evaluation must be independent of the execution environment since it is done in preprocessing. Because of that, a cast, the sizeof operator, and enum constants, which require references to the execution environment (these are first evaluated in translation phase 7) are excluded from #if expression compared to standard integer constant expressions. Character constants (and wide character constants), however, are not excluded.
Character constant evaluation is implementation-defined with many factors as shown below and has little portability.
As above, little can be predicted how evaluation is done since the character constant values for the #if expressions have no portability among implementations and may differ even within the same implementation depending on the compilation phases.
In general, the specification of the C language integer type has few ambiguous parts. Although negative value handling is implementation-defined with respect to computation, they are CPU-dependent and there are few parts implementers can decide optionally with character constant evaluation as an only exception. Many aspects are determined at the discretion of implementers other than the CPU specification, basic character set, and multi-byte character encoding.
The range of discretion for implementations increases immensely for the character constants of the #if expressions and no matching is guaranteed among compilation phases. Little of what is evaluated is understood even if this is evaluated. Character constant evaluation can be thought to require a reference to the execution environment under normal conditions. Standard C preprocessing removed this process of requiring a reference, but not character constants only somehow. And, it seems that the specification, which does not require a reference, was forced to be created, which created ambiguity.
How are this type of character constants for #if expressions used? The value of a char type variable is often compared with a character constant in the compilation phase, but there is no usage in the preprocessing phase in which no variables are used. I cannot think of any appropriate examples for the use of a character constant in a #if expression. This is a useless thing and should be removed from the #if expression subject just as cast and sizeof. There will be less source with issues if this is removed compared even with cast or sizeof removal.
Macro calls are once replaced with the replacement list, and then rescanned. The messiest thing in the rescanning specification in Standard C is that the token sequence after the macro call is rescanned continuously to the replacement list as if the sequence is following the replacement list. This is completely deviated from the principle of function-like macro specification modeled after function calls and becomes the most outstanding factor to make macro expansion incomprehensible. I think that this specification of the subsequent token sequence as a rescanning subject should be removed and that the rescanning subject should be limited to the replacement list only.
Actually, rescanning subsequent token sequence seems to have been a long time implicit specification since around K&R 1st. This specification is no longer necessary in Standard C, but remained in an appendix as a legacy. Since this issue concerns the basis of macro expansion, I will study it further in detail below.
It is not an easy thing to describe the macro rescanning method in writing. The text in the Standard or in K&R 2nd is not easy to understand, either. For example, K&R 2nd A.12 says "the replacement list is repeatedly rescanned." However, the Standard does not state "repeatedly". It can be read as one rescanning. It can also be read as recursive rescanning, but not clearly described as such either.
This cannot be explained accurately without using an actual example. Furthermore, it cannot be understood intuitively without explaining the implementation method. That is how close macro rescanning is to the traditional implementation of macro expansion.
First, we will take a silly example. To simplify the problem, assume x and y are not a macro. How can this macro call be expanded?
#define FUNC1( a, b) ((a) + (b)) #define FUNC2( a, b) FUNC1 OP_LPA a OP_CMA b OP_RPA #define OP_LPA ( #define OP_RPA ) #define OP_CMA , FUNC2( x, y) 1: FUNC2( x, y) 2: FUNC1 OP_LPA x OP_CMA y OP_RPA 3: FUNC1 ( x , y )
It becomes clear at once that 1: is replaced with 2: and 3: is generated by rescanning in 2:. Then, is 3: a macro call? More specifically, should this be rescanned again from the beginning?
Is rescanning something repeated many times from the beginning or whose applicable range is gradually narrowed down recursively? The truth is neither.
As a matter of fact, rescanning seems to have been performed in a certain type of exceptional recursion or in a certain type of repetition resulting in the same. Its classical example is the one in the Macro Processing chapter in "Software Tools" by Kernighan & Plauger. This is something to be developed into the M4 macro processor later and this itself is not a C preprocessor. It is indicated that the macro processor was originally designed and implemented in C by Ritchie. The prototype of the preprocessor implementation is available.
In this macro processor, rescanning is realized by sending back the replacement list to the input for re-reading when there is a macro call. When there is another macro call in the replacement list, the new macro replacement list is sent back and re-read as if "it had been in the input originally." As it is written; "it provides an elegant way to implement the rescanning of macro replacement text", "This is how we handle the recursion implicit in nested sources of input", and others, this method greatly helps macro processor program to be structured and understood easily.
Many of C preprocessors perform rescanning by putting the replacement list in pseudo inputs, a type of stack, and re-reading it.
In the example above, if FUNC1 turns out not to be a macro call at this point when 2: is being rescanned, this token is established at the same time and the replacement hereafter will be for OP_LPA and after whether repetition or recursion. If OP_LPA is replaced with ( and turns out not to be a macro, x and later will be applicable next. This way, establishing a token is done sequentially starting with the beginning and 3: becomes the last result. This is no longer a macro call.
This method since "Software Tools" (or even before that) is certainly a concise implementation method. Though not mentioned in "Software Tools", there is also a pitfall. The problem is that there is a chance that rescanning may scan the part after the original macro call beyond the replacement list since the replacement list sent back to the input is read-in consecutively with source. In nested macros, the nesting level may get shifted; unnoticed while rescanning. A macro without arguments expanded into the name of the macro with arguments and a abnormal macro where the replacement list comprises the first half of another macro call with arguments causes this situation.
#define add( x, y) ((x) + (y)) #define head add( head a, b)
This is the example. This strange macro call is expanded as ((a) + (b)). This for some reasons ended up with being officially acknowledged by Standard C. In fact, this macro is legal rather than undefined.
I cannot think that C preprocessors were intended for abnormal macros like this. However, I wonder perhaps whether the original C preprocessor implementation was as above, which resulted in expanding these macros somehow in silence, and some programs consciously took advantage of these holes to the point where this became a de facto standard specification and finally approved in Standard C. That is to say, a small defect in the original C preprocessor implementation led to a strange de facto standard and left a trail even in Standard C. This is the reason an appendix is an appendix.
Now, returning to the topic of whether rescanning is recursive or repetitive, I believe that it is not necessarily wrong to say this is either an irregular recursion or repetition. It is recursion, but it has the strange characteristic that it is not always narrowed down its applicable range as in ordinary recursion, but rather the range is gradually shifted. This is repetition. However, the repetition is not from the beginning, but from the middle by including following parts gradually.
Therefore, it is possible to process the text, after all comments and preprocessor directives are processed, from the beginning until the end with this shifted rescanning only. In fact, such method is used in "Software Tools" and there is some using a similar way in the current C preprocessor source. In other words, rescanning is a synonymous with macro expansion and also the macro expansion for all text.
The fact that rescanning subjects are shifted gradually causes many problems. The next example was listed in C89 Rationale 3.8.3.4 (C99 Rationale 6.10.3.4) as an example of a macro, which is unclear how to be expanded. It is stated that the reason why this process was not defined as a specification was that "as the Committee saw no useful purpose in specifying all the quirks of preprocessing for such questionably useful constructs." However, this example is suggestive. Rather, this was not possible to be defined as a specification.
#define f(a) a*g #define g(a) f(a) f(2)(9)
In this example, f(2) is replaced with 2*g at first. If the "subsequent preprocessing tokens" is not to be rescanned, macro expansion is completed and f(2)(9) becomes the token sequence of 2*g(9). However, as the "subsequent token sequence" is applicable, this g(9) forms a macro call and replaced with f(9). Here, it is not clear whether this f(9) should be replaced with 9*g again or not by applying the rule of no re-replacement for the macro with a same name. The token sequence of f(9) is generated by rescanning the continuation of g, that is the end of the first replacement result of f(2), and (9) of the "subsequent token sequence" and it is unclear whether this is inside or outside the f(2) call nest.
This problem was corrected in C90 Corrigendum 1, which adds the next example to Annex G.2 Undefined behavior.
-- A fully expanded macro replacement list contains a function-like macro name as its last preprocessing token (6.8.3).
This correction, however, only causes more confusion.
First of all, the wording, "fully expanded macro replacement list", is not clear in meaning. This can be only interpreted as "the replacement list after the macro within the argument is expanded if there is an argument." In that case, in the example of f(2)(9), f(2) is replaced with 2*g before considering the re-replacement of the macro with a same name and it becomes undefined already when g is a function-like macro name by rescanning it. In other words, if f is called, it always becomes undefined in this f and g macro definition.
If this "correction" is applied, the following example for macro rescanning in ISO/IEC 9899:1990 6.8.3 Examples will be undefined to begin with.
#define f(a) f(x * (a)) #define x 2 #define g f #define w 0,1 #define t(a) a t(t(g)(0) + t)(1); /* f(2 * (0)) + t(1); */ g(x+(3,4)-w) /* f(2 * (2+(3,4)-0,1)) */
The Standard states that these macro calls will be expanded as in the comments, but this is not the case if the Corrigendum is applied. In these macro definitions of f and g, they will be always undefined when the g identifier appears. It is because f, a function-like macro name, is the only and last pp-token in the replacement list for g.
t(t(g)(0) + t)(1)
At first, the argument of the first t call will be expanded.
t(g)(0) + t
Since there is another macro call, t(g), it will be expanded, but the argument must be expanded first for that.
g
And if this is replaced with f, it will become undefined here.
Even if replacements are continued as is, it will become:
t(f) f
And it will be undefined again since the last pp-token of the t(f) expansion result is f. If replacements continue further, it will become:
f(0) + t f(x * (0)) f(2 * (0)) f(2 * (0)) + t t(f(2 * (0)) + t) f(2 * (0)) + t
This ends the expansion of the first t call in any case, but it will be undefined for the third time since the end of this replacement list is a function-like macro name, t.
How about the following?
g(x+(3,4)-w)
This will be undefined by the time g is replaced with f.
This results in confusion by contradicting Examples with G.2.
If the examples in the Examples are omitted, the correction in the Corrigendum does not relieve confusion. First of all, G.2 is not a part of the Standard and this addition does not have grounds in the text of Standard proper. In the text of Standard, it is only written that the "subsequent token sequence" is also to be rescanned. Secondly, even if this Corrigendum is included in the Standard proper,
#define head add(
in the previous example, add is correct since it is not in the end of the replacement list.
#define head add
is undefined, however. This is too unbalanced. Also, there is an issue of the wording, "fully expanded", being unclear in meaning. *1, *2
It goes without saying that these are quirks brought by the specification on "subsequent token sequence" as rescanning subject. The more plausible they try to make it sound, the more confusing it gets. The Standard states the specification forbidding the replacement of the macro with a same name in extremely difficult sentences. A reason for this difficulty comes also from these quirks.
On the other hand, Standard C defines that macro expansion in an argument must be performed only within the argument for function-like macro calls. Since it will be turmoil if the macro expansion in an argument eats up the text behind it, this is no wonder.
As a result of this, however, an imbalance occurs between the macro within an argument and not so.
#define add( x, y) ((x) + (y)) #define head add( #define quirk1( w, x, y) w x, y) #define quirk2( x, y) head x, y) head a, b); quirk1( head, a, b); quirk2( a, b);
In this quirk1() call, it will be a violation of constraint as an incomplete macro call at rescanning after the first argument, head, is replaced with add(. Put simply, it is an error. However, quirk2() and head a, b) will not be an error, but expanded as:
((a) + (b))
It may sound repetitious, but this type of absurdity all comes from the fact that even the "subsequent token sequence" is applicable for macro rescanning in general. As a matter of implementation, the nesting level information needs to be added in order for the argument expansion to be performed independent of other text parts even using the method of sending the replacement list back to input. By using that method, it will be easy not to have the "subsequent token sequence" rescanned in general. Rather, in the current half-baked specification, it is necessary to change the process depending on whether it is inside an argument or not, resulting in extra load for implementations.
Macro expansion in C has been traditionally influenced by editor-like string replacement. We can say that pre-Standard macro expansion is something that has been added to string replacement for editors and become complicated to excess.
By contrast, Standard C took the trouble to name the macro with arguments a function-like macro. I can guess that it tried to bring the call syntax closer to a function call. The specification that the macro in an argument is replaced with a parameter after being fully expanded and the one that the expansion is performed only within the argument conform this principle. However, this principle is spoiled by the specification that macro rescanning in general includes the subsequent token sequence. It is an inheritance of text replacement repetition from its ancestor.
If the subsequent token sequence was removed from the rescanning subject, it could have been defined that macro expansion is completely recursive, and that the applicable range is narrowed or (at least not extended) forward or backward on every recursion. And, it would have been clear as an appropriate macro for the name, function-like macro. I cannot think there is much source code which would have had problems by this decision. I can only think that ANSI C committee could not make a decision on cutting an appendix inherited from an ancestor. *3
I wished C99 would cut it off cleanly, but the appendix has survived again.
Note:
*1 The object-like macro which is expanded into a function-like macro name is sometimes seen in actual programs. It is as below.
#define add( x, y) ((x) + (y)) #define sub( x, y) ((x) - (y)) #define OP add OP( x, y);
This is not as abnormal as an expansion into the first half of the function-like macro call as the former, but there is no reason why it must be this way. It is good to define a function-like macro nesting in a function-like macro as below.
#define OP( x, y) add( x, y)
*2 The reason for this correction by Corrigendum is in "Record of Responses to Defect Reports" by C90 ISO C committee (SC 22 / WG 14) (#017 / Question 19.) The question on the macro expansion in f(2)(9) of ANSI C Rationale 3.8.3.4 was brought up again. The direct issue in this example must have been the application range of the specification on "prohibiting the re-replacement of macros with a same name", but the committee has answered as a common problem unlimited to macros with a same name. They did not realize that this interpretation might cause contradictions in the Examples.
In addition, this wording, "fully expanded", is strange. When f(2) is replaced with 2*g and rescanned up to g, is it fully expanded? If so, no more replacements will be performed. Therefore, it will not be undefined, either. If not fully expanded yet, g is rescanned with the succeeding (9) and replaced by f(9). If this is fully expanded, the last pp-token of 2*f(9) is not a function-like macro name. Therefore, this answer does not apply. In other words, it says "after macro expansion is completed" where the issue is when macro expansion ends. Thus, when macro expansion ends became more confusing.
In C99 draft in November, 1997, this item in the Corrigendum was included in Annex K.2 Undefined behavior but deleted in the draft in August, 1998 replaced by a following paragraph below in Annex J.1 Unspecified behavior. This eventually was adopted in C99.
When a fully expanded macro replacement list contains a function- like macro name as its last preprocessing token and the next preprocessing token from the source file is a (, and the fully expanded replacement of that macro ends with the name of the first macro and the next preprocessing token from the source file is again a (, whether that is considered a nested replacement.
It seems that the committee finally realized the contradiction in the Corrigendum. Fundamental problems in the text proper of the Standard, however, still remain. Also, when macro expansion ends is unspecified in the end. Furthermore, the distinction by the presence of '(' in the source means the difference in the result when the same macro exists in the source and when it exists in the replacement list of another macro. This is an inconsistent specification.
On this issue, refer also to section 3.4.26.
Also, the specification of macro expansion in C++ Standard is same as C90 without an equivalent of Corrigendum 1 in C90 nor the specification added in Annex J.1 of C99.
*3 Even with this decision, FUNC2( x, y) in the previous example will be FUNC1 (x, y) in argument expansion if this is in the argument of another macro call and again expanded into ((x) + (y)) at the original macro rescanning. In other words, the final expansion results differ depending whether in an argument or not so. However, this is another level of problem and not an inconvenience.
With respect to ISO/IEC 9899:1990, Corrigendum 1 was released in 1994, Amendment 1 in 1995, and finally Corrigendum 2 in 1996.
Corrigendum 1 contains trivial corrections in wording mostly but only 2 impacts preprocessing. One is regarding macro rescanning of 2.7.6 described above.
Another is a specification extremely special regarding the case that a macro name in macro definition includes $ and others.
In Standard C, '$' is not accepted as a character in an identifier though there are implementations allowing this traditionally. In an example of 18.9 in test-t/e_18_4.t, $ is a character and interpreted as a pp-token in Standard C. The macro name is THIS and $ and after becomes the replacement list of an object-like macro, which is totally different result from the intention of the program which is a function-like macro with the name, THIS$AND$THAT.
In Corrigendum 1, an exception specification was added regarding this type of example; "if object-like macro replacement list starts with a non-basic character, a macro name and a replacement list must be separated by white-space." Standard C must output a diagnostic message to this example in 18.9. It is supposed to preventing the situation where the source with $ or @ used in macro names is silently preprocessed into an unintended result. It is a painstaking specification, but it is annoying that this type of exception increases. In implementations not accepting $ and/or @ as an identifier, macros like this always become an error in the compilation phase even if they are not an error in preprocessing. So, it does not seem to be necessary to define this exception specification. *1
In addition, ISO 9899:1990 had an ambiguous constraint that a pp-token, header-name, can be appear only in #include directives. However, it was corrected in Corrigendum 1 so that a header-name is recognized only in a #include directive.
In Amendment 1, the core is multi-byte characters, wide characters, and the library functions operating those strings. Accompanied by those, <wchar.h> and <wctype.h> standard headers were added. In addition, the <iso646.h> standard header and the specification on digraphs are added as alternatives to trigraphs as characters not included in ISO 646 character set or a notation method for token and pp-token using those characters. <iso646.h> is a quite easy header which define some operators as macros and it does not have any special problems. *2
The problem is a digraph. This is very similar to a trigraph and the usage is almost the same though the positioning in preprocessing is completely different. A trigraph is a character and converted into a regular character in translation phase 1 while a digraph is a token and pp-token. If a digraph sequence is stringized by the # operator, it must be stringized as is without conversion (though this # itself is also written as %: in a digraph.) Because of this, and only this, implementations need to retain this as a pp-token at least until phase 4 completes. If it were to be converted, it is later (convert the digraph sequence left as a token, not in string literal.)
This imposes an unnecessary burden on implementations. Implementations are more concise recognizing a digraph as a character just as a trigraph and to convert it in phase 1. There is no benefit to keep this as a pp-token. The Amendment also notes that the difference between a digraph and a usual token occurs only when they are stringized. It might be seen that it would be troublesome in writing a string literal like "%:" if a conversion is done in phase 1. But this is similar in trigraphs and too special of a problem to consider. If it must be written, just "%" ":" is enough. Digraphs should be re-positioned so that they are converted in phase 1 as an alternative to trigraphs.
Corrigendum 2 has no corrections regarding preprocessing.
Note:
*1 This specification disappeared in C99 and C++ Standard.
In C99, the following generalized specification was added to 6.10.3 Macro replacement/Constraints instead.
There shall be white space between the identifier and the replacement list in the definition of an object-like macro.
As this is also a tokenization exception specification, it is not praiseworthy.
*3 In C++ Standard, these identifier-like operators are tokens, not macros. Though it is difficult to understand why it is so (could it be an idea to cut down what preprocessing must do?), it is troublesome for implementations in any case.
In C90 5.1.1.2, C99 5.1.1.2 Translation phases 3, there are redundant specifications though they are harmless.
A source file shall not end in a partial preprocessing token or comment.
Since translation phase 2 specifies that source files must not end without a <new-line> or with <backslash><newline>, the source file that passes phase 2 always end with a <newline> without a <backslash>. It never ends with a partial preprocessing token. Within Partial preprocessing token categories, there are ", ', <, and >, which are unmatched in logical lines, which are considered to be undefined in C90 6.1 Lexical Elements/Semantics and not problems limited to the source ending. "Partial preprocessing token or" is unnecessary wording.
C90 6.8.1, C99 6.10.1 Conditional inclusion/Constraints contains expressions, which cause a misunderstanding.
it shall not contain a cast; identifiers (including those lexically identical to keywords) are interpreted as described below;
This "it shall not contain a cast; " is superfluous. In the succeeding parts and Semantics, it is made clear that all identifiers including an identifier the same as a keyword are expanded if macro and remaining identifiers are evaluated as 0. A cast does not need to be considered. In the (type) syntax, it is clear that type is handled as a simple identifier.
On the contrary, if this is in a constraint, it can be interpreted that the implementation recognizes the cast syntax and must output a diagnostic message. That is not the intention of the Standard. There is no keyword in translation phase 4, cast has no way to be recognized. As far as this is concerned, sizeof is also the same. It is strange that only cast is mentioned without mentioning sizeof. This type of wording is called "superfluous."
The following specification regarding preprocessing was added to C99.
Length of a source logical line 4095 bytes Length of a string literal, character constant, and header name 4095 bytes Length of an internal identifier 63 characters Number of #include nesting 15 levels Number of #if, #ifdef, #ifndef nesting 63 levels Number of parenthesis nesting in an expression 63 levels Number of parameters of a macro 127 Number of macros definable 4095
Variable argument macros are as below. If there is a macro definition,
#define debug(...) fprintf(stderr, __VA_ARGS__)
a macro call,
debug( "X = %d\n", x);
is expanded as:
fprintf(stderr, "X = %d\n", x);
In other words, ... in the parameter list means one or more parameters and __VA_ARGS__ in the replacement list corresponds to it. Even if there are multiple arguments which correspond to ... in a macro call, the result of merging those including ',' is handled as one argument.
Among undefined behaviors in C90, there are some in which adequately meaningful interpretations are possible. An empty argument in macro calls is one of them and there is a case that it is useful to interpret this as 0 pp-token. This became a valid argument in C99.
C99 mentions an extension operator called _Pragma which is converted into #pragma foo bar if written as _Pragma( "foo bar"). In C90, the argument for the #pragma line does not get macro-expanded and the line similar to a #pragma directive as a result of macro expansion is not handled as a directive and cannot write #pragma in the replacement list of macro definition. On the other hand, the _Pragma expression can be written in the macro replacement list and #pragma which came from its result is handled as a directive. The extension by _Pragma tries to improve the portability of cumbersome #pragma.
It is simpler to make a modification that the argument of #pragma is subject to macro expansion, without this type of irregular extension. It will largely achieve the intention of portability improvement. However, in that case, there still remains a constraint that #pragma cannot be written in a macro and there will be an issue that the argument of #pragma which must not be macro-expanded has to have its name changed to start with __ in order to separate from the user name space. Though _Pragma() operator is irregular, its implementation is not so troublesome and it is a reasonable specification.
There are too many issues on the introduction of Unicode. First of all, implementations must prepare a huge table for multi-byte characters and Unicode conversion, causing large overheads. It is virtually impossible to implement it on the systems with 16 bits and less. There are systems that do not handle Unicode. In addition, there are many cases that a Unicode and a multi-byte character do not have a one-on-one mapping. It seems too aggressive to place Unicode in a C language standard in the name of programming language internationalization.
In C99, UCN handling drastically reduced compared with the draft in November, 1997 and C++ Standard. The preprocessing load became small relatively. Therefore, a certain implementation became possible in mcpp as well. *1
However, there are still some large loads on compiler proper. Also, since these are unreadable expressions, I expect that they will end up not being used much as trigraphs. *2
Note:
*1 In the draft in November, 1997, almost same as C++ Standard, it was supposed to be that the extended characters which are not in the basic source character set are all converted into UCN in translation phase 1 and converted again into the execution character set at phase 5.
In case of implementing this, it is speculated that a tool will be called to convert these before and after processing. As the conversion is OS-dependent, separate tools will be realistic.
*2 According to C99 Rationale 5.2.1 Character sets, this specification assumes that unreadable expressions are converted between the source in multi-byte characters by a tool included in the compiler system to be used. This must mean separating multi-byte character string literal parts in a separate file to process. I wonder how practical that is.
Problems in Standard C preprocessing specifications and what I think mentioned above are also requests for Standard C in the future. In summary, there are following items.
These are all intended to reorganize irregular rules and make preprocessing specifications simple and clear. There is no doubt that these will make preprocessing easier to understand. On the contrary, there should be little annoyance.
I believe that mcpp V.2 implemented all preprocessing specifications in Standard C in the Standard mode including the parts I do not think highly of. In the 'post-Standard' mode, preprocessing with modifications above is implemented (also excluding UCN, the use of multi-byte characters in an identifier.)
Amendment 1 and Corrigendum 1 in C90 took the direction of increasing the irregularity of preprocessing rather than cleaning it up.
C99 added various new features, but did not clean up the confusion of logic above either, unfortunately. *1
Regarding the specifications added in C99, I have the following request.
In addition, there is an issue other than the problems described above regarding evaluation rules for the integer type applicable to #if expressions.
Since this is not a preprocessing specific issue, it will not be discussed further.
Also, in C90 it was defined that the result of / or % when one or both operands are negative is implementation-defined, which was a terrible specification. This became the same specification as div() and ldiv() in C99.
Note:
*1 Various defect reports regarding C99, responses to them, and corrigendum drafts are on the ftp site below. This is the official ftp server for ISO/SC22/WG14 and you can ftp as anonymous at least for now (SC stands for a steering committee and WG means a working group. SC22 deliberates programming language Standards and WG14 handles C Standards.)
Items in the test-t, test-c, test-l, and tool directories and this cpp-test.html itself are "Validation Suite for Standard C Conformance of Preprocessing" developed by myself. It tests the level of Standard C (ANSI/ISO/JIS C) conformance for preprocessing in optional compiler systems in detail. It is intended to cover all preprocessing specifications defined in Standard C. It also covers C++ preprocessing. There are many additions addressing issues outside the specification.
As Standard C conformance requirements, not only that compiler systems shall behave correctly, but also that documents shall contain necessary items mentioned accurately. I will explain this in 3.5.
The test-t directory contains 183 sample text files. Out of 183, 30 are header files, 145 sample text files, and 8 are files gathering small pieces of sample text. All but header files have a name in the *.t format, except some of *.cc. These have nothing to do with compilation phases, but test preprocessing phases. Therefore, they are not necessarily in the correct C programming format. They are rather sample text files for testing preprocessing.
As Standard C implementations can compress preprocessing and compilation to a single process, it is not possible to test preprocessing separately, depending on the implementation. You can say that these *.t samples themselves do not conform Standard C. However, there are many implementations that can be tested by separating preprocessing only. In fact, specifications and problems are clear if they can be separated. The *.t sample files are for those.
*.cc files are samples for C++ preprocessing, provided for some preprocessors which do not accept the files named *.c or *.t as C++ source. Those have the same content with corresponding *.t files.
Among sample text files, there are ones with names starting with n_ (meaning normal), i_ (meaning implementation-dependent), m_ (meaning multi-byte character) and e_ (meaning erroneous.)
Files starting with n_ are samples that do not contain errors, something causing undefined behavior, or implementation-defined parts. Preprocessors conforming to Standard C must be able to process these properly.
Files starting with i_ are samples dependent on implementation-defined specifications regarding character sets assuming the ASCII basic character set. (*) Preprocessors for the implementations with ASCII character set conforming to Standard C must be able to process these properly without errors.
Files starting with e_ are samples that contain some sort of violation of syntax rules or constraints, in other words, errors. Preprocessors conforming to Standard C must be able to diagnose these, but not overlook these.
Files with a number succeeding n_, i_, m_ or e_ are samples that test preprocessing in C90 and common preprocessing specifications in C90 and C99. Among header files, pragmas.h, ifdef15.h, ifdef31.h, ifdef63.h, long4095.h and nest9.h to nest15.h are samples to test C99 preprocessing specifications and others are for common specifications in C90 and C99.
Files with alphabetics other than std or post after n_, i_, e_, or u_, are samples for C99 and C++. n_dslcom.t, n_ucn1.t, e_ucn.t and u_concat.t are samples to test preprocessing specifications common in C99 and C++98, n_bool.t, n_cnvucn.t, n_cplus.t, e_operat.t and u_cplus.t for C++, and the rest for C99.
The file named ?_std.t combines pieces of files in C90 together.
?_std99.t is an equivalent for C99. ?_post.t and ?_post99.t files are bonus files and used for testing mcpp in the 'post-Standard ' mode.
The files named u_*.t are bonus files and the pieces of files to test undefined behaviors. undefs.t combines those as one file. unbal?.h is a header file used in those. unspcs.t tests unspecified behaviors and warns.t does not belong to any of the above, but is the file describing texts for which warnings are desirable. unspcs.t and warns.t are also bonus. Files named m_*.t are samples for several encodings as multi-byte character and wide character sets. It is desirable to process many encodings properly. m_*.t belong to quality test items like u_*.t.
misc.t, recurs.t and trad.t are real bonuses. misc.t is a collection of what is in Standards and other documentation, tests with different results depending on the internal representation of the integer type, tests related to translation phase 5 or 6, tests for enhanced functions, and others. recurs.t is a special case of recursive macro, and trad.t is a sample for the old "Reiser model cpp".
There are 133 files in the directory called test-c. 26 of those are header files (24 files are same as the ones in test-t), 102 of them are pieces of sample source files, 3 of them are files which combine pieces of sample source, and the other 2 are files used for automatic testing. Among these, 32 files are bonus sample source files. Source files other than header files are named *.c. This is in the C program format.
Naturally, file names start with n_, i_, m_ or e_. Ones starting with n_ are strictly conforming programs (which does not have any errors nor implementation-dependent portions) in Standard C. Implementations must be able to compile these files correctly without errors and execute them correctly. In case of correct execution, the messages below are displayed.
started success
With exceptions of n_std.c and i_std.c, these messages are not displayed. However, only the end message,
<End of "n_std.c">
, is displayed. Otherwise, some sort of error message is displayed. Some files starting with i_ are samples of character constant assuming ASCII. Implementations with ASCII character set must be able to compiles these files correctly and execute them correctly as ones starting with n_. Files starting with e_ must be correctly diagnosed by compiler systems at compilation (preprocessing.)
Testing by compilation or execution is the most proper testing method. However, the method detects the existence of error in an implementation, but there are cases it is not clear where errors are. You can give more accurate evaluation by applying *.c files only to a preprocessor and performing testing by looking through the results as far as the implementation allows (*.t files are even more straightforward.)
The files called ?_std.c combines pieces of files.
Files named u_*.c are bonus and pieces of files which test undefined behaviors. undefs.c collects them in one file. unspcs.c tests unspecified behaviors while warns.c does not belong to any of the above, but is the file of texts for which warnings are desirable. unspcs.c and warns.c are a bonus. Those starting with m_ are samples of several multi-byte character encodings.
C99 tests are not included in the test-c directory since there is no compiler proper supporting C99 fully. C++ tests are only in the test-t directory.
The test-l directory contains samples for testing translation limits that exceed specifications. All 144 files are bonus. They are a mix of *.c, *.t, and *.h files.
Many *.h files overlap in each directory, test-t, test-c, and test-l. If the duplicate header files are gathered in one directory, the following way of including method is necessary, for example.
#include "../test-t/nest1.h"
However, the method of searching this type of path list format or files (where to locate the base directory etc.) is all implementation-defined and the compatibility is not guaranteed. In order to avoid this problem, those header files are placed in each directory regardless of duplication (Even the concept of "directory" is excluded from C Standard.)
The tool directory includes tools necessary for automatic testing.
The files in cpp-test directory are the testcases for GCC/testsuite. They are rewrites of the samples in test-t and test-l directories.
When performing tests using the Validation Suite and if a compiler system has options to make closer to Standard C, all should be set (refer to 6.1 for a concrete example.)
Each of test-t and test-c directories contain 2 kinds of samples, big files with multiple pieces put together and small files divided into pieces. If a preprocessor conforms to Standard C well, only big files with multiple pieces put together are necessary to test n_*. However, if the level of conformance is not high, a preprocessor will fall into confusion in the middle with these files and the rest of the items cannot be tested. Therefore, small pieces of files are also provided. Since the number of files gets too large and it is a lot of trouble to do testing if I divide the pieces into too many files, I made a reasonable compromise. Depending on the implementation, even these small pieces of files cannot be processed till the end. In such event, please divide the sample into even smaller pieces to continue testing.
As the #error directive terminates a process depending on implementations, samples testing #error are not included in big files with pieces put together. The #include error also often terminates a process, it is not included in big files with pieces put together.
The *.t samples are used in case a preprocessor is an independent program or that a compiler has an option to output the text after preprocessing. By checking the result of preprocessing these files directly, you can compare if they match correct results written in comments. Since it is possible to view preprocessing results directly, it is possible to make more accurate judgment this way as long as implementations permit.
Many *.c programs include the "defs.h" header. As 2 kinds of assert macro definitions are written in "defs.h", set 0 to 1 in #if 0 for any of these. The first one only includes <assert.h>. The second one is the assert macro which does not abort on an assertion failure. This, of course, is not a correct assert macro, but more convenient in this test.
In multi-byte character processing, behavioral specifications of implementations may differ depending on the runtime environment, thus testing m_* requires attention (refer to 4.1.)
This type of testing has a difficult issue since testing an item may be caught by another failure in an implementation. For example, if <limits.h> has an error and is included to test #if, it is not clear whether <limits.h> or #if is tested. The test which compiles and execute *.c files are more troublesome than one for preprocessing *.t files. If the last result is wrong, it will make it appear that there is some sort of error in the implementation, but not necessarily in the item tested.
I tried to contrive ways to aim the target item in this Validation Suite. However, there is a restriction that the Validation Suite itself has to be portable. In addition, in order to test an item, the correct implementation of other language specification must be assumed. Therefore, the preprocessing item used as this "assumption" is implicitly tested in areas other than the test item that was targeted for the item. Please note such implicit allocation of points also exist in the "allocation of points" which will be described next. It may not be possible to judge whether the sample failed in the test item that it really targeted or by another factor in case an implementation fails a sample process without looking at another test.
Each test item is set by each allocation of points. Marking criteria are also written. Standard C does not have subsets, therefore unless all items match specifications, an implementation cannot be said to be Standard C conforming, strictly speaking. In reality, there are not many such implementations and we cannot help using the measure of Standard C conformance level to evaluate implementations. In addition, as there are large differences in the importance of items, counting the number of passed items will not do, rather a weighting depending on the importance should be applied.
However, this weighting does not have objective criteria, of course. The marking of this Validation Suite was decided by myself and does not have a grounded base. Still, it will be a guideline in evaluating compliance levels for implementations objectively.
n_*, i_*, e_*, and d_* are tests related to Standard conformance and marking for these is in 2 point unit in general. In testing outside of Standards and quality evaluation, marking for q_* is in 2 point and the rest is in 1 point units. Where a diagnostic message should be displayed, no points will be scored in case it is wide of the mark although it is displayed. A partial score may be given to a diagnostic message not absolutely incorrect but rather off the point. An implementation is free to issue diagnostics on correct program if it correctly processes the program, however, wrong diagnostics will be subject to subtraction.
If you compile the cpp_test.c program in the tool directory and run it in the test-c directory, you can test n_*.c and i_*.c for C90 automatically. However, this only scores pass or fail and it does not provide any detail. It does not include tests such as e_*.?. It just takes aim at the conformance level of preprocessors for C90 briefly. No tests regarding C99 are included. That is because most compilers do not support C99 sufficiently yet. *1, *2, *3
How to use cpp_test is, in an example of Visual C++ 2005, as follows.
cpp_test VC2005 "cl -Za -TC -Fe%s %s.c" "del %s.exe" < n_i_.lst
The second argument and on need to be enclosed by " and " respectively (in case the shell removes ", ", it is necessary to take the measure to enclose the second argument to the last one all together in ' and '.) %s will be replaced by a sample program name without .c such as n_* and i_*.
The first argument: Specifies the name of a compiler system. This must be within 8 bytes and must not include '.'. Files with this name plus .out, .err, or .sum are created.
The second argument: Writes the command to compile.
The third argument and later: Writes the command to delete the files no longer necessary. Multiple of these are allowed.
n_i_.lst is in the test-c directory. It includes the list of n_*.c and i_*.c without .c respectively.
Depending on the implementation, they may start runaway processing some source files. In such an event, change the source name in n_i_.lst to a name which does not exist, none, for example, then run the test again.
By running cpp_test this way, n_*.c and i_*.c are compiled and executed sequentially. The outputs to stderr for sample programs are recorded in the n_*.err and i_*.err files. In addition, the score results are written on a column in VC2005.sum. However, there are only 3 kinds of marking below.
*: Pass
o: Compiles, but the execution result failed.
-: Could not be compiled.
In VC2005.out, the command line that called cpp_test is recorded and so is the message which was output to stdout by the compiler system if any. Messages output to stderr by a compiler system are recorded in VC2005.err if any.
Look at these for more information.
Now, use the following command.
paste -d'\0' side_cpp *.sum > cpp_test.sum
By doing so, the *.sum files which are test results for each compiler system are combined horizontally to create one table to be recorded in cpp_test.sum. side_cpp is the table side portion where test item titles are written and exists in the test-c directory.
cpp_test.sum that I created this way is located in the doc directory. In 6.2, the detail results of manual testing are written. They test more preprocessors than cpp_test.sum. Among those preprocessors, there are some that do not support compiler drivers for any compiler systems. They cannot be tested automatically by cpp_test.
Note:
*1 This cpp_test.c was written based on runtest.c and summtest.c in "Plum-Hall Validation Sampler."
*2 cpp_test.c does not operate with expected behavior if it is compiled on Borland C / bcc32. This is because cpp_test calls system() to redirect stdout and stderr but standard I/O path does not seem to get inherited by the descendant process in bcc32. If cpp_test.c is compiled in Visual C or LCC-Win32, it operates without problems.
*3 m_36_*.c are the tests of encoding which has a byte of 0x5C ('\\') value. cpp-test does not use them, since some systems do not use these encodings.
GCC source contains something called testsuite. Do 'make check' after compiling the GCC source files, testcases of this testsuite are checked one after another and the results are reported.
My Validation Suite, since V.1.3, was appended the edition which is rewritten so that it could be used as testsuite of GCC. Putting this in testsuite allows automatic checking by 'make check'. While the cpp_test tool in 3.2.2 can test only samples with n_* or i_* as a name, testsuite allows samples which require diagnostic messages such as e_*, w_*, and u_* to be tested automatically. This set of testcases is applicable to cpp0 (cc1, cc1plus) of GCC 2.9x and later and mcpp.
Here, I will explain how to use the Validation Suite in GCC / testsuite.
The cpp-test directory of the Validation Suite is the edition for GCC / testsuite created by rewriting the test-t and test-l directories and cpp-test contains each directory of test-t and test-l.
GCC / testsuite, however, cannot change the execution environment. The files named m_* or u_1_7_* are the testcases for several multi-byte character encodings. Since those testcases need different environments each other for at least GCC 3.3 or former, those are excluded from this testsuite edition. *1
GCC and testsuite specifications have been changed many times thus far and are expected to be changed in the future as well. It may require a partial fix to the Validation Suite accordingly, especially in case of addition or change of diagnostics. However, no extensive fix seems to be necessary so far unless the version of GCC is extremely old. The testcases in cpp-test have been verified in each cpp0 (cc1, cc1plus) of GCC 2.95.3, 3.2, 3.3.2, 3.4.3, 4.0.2 and 4.1.1, and mcpp.
Runtime options cannot be changed in the testsuite depending on the target implementation. As a matter of fact, multiple standards coexist and it is necessary to specify a version of the standard using the 'std=' option. However, this option does not exist in older versions of GCC. Therefore, my testsuite applies to GCC 2.9x and later and mcpp V.2.3 and later.
Testsuite is executed by interpreting the comments in the following format written in the testcase. This is a comment which does not affect tests in other compiler systems.
/* { dg-do preprocess } */ /* { dg-error "out of range" "" } */
The samples with the comment, dg-error or dg-warning, written test diagnostic messages. Testing multiple compiler systems is supported by writing diagnostic messages of each compiler system with '|' (OR) in-between.
This is executed by the tool called DejaGnu and it is directly a shell-script called runtest. The setup of DejaGnu is written in some files named *.exp. *.exp are the scripts for the tool called Expect. And, Expect is a program written in the command language called Tcl.
Therefore, using testsuite requires these many tools of appropriate versions according to the testsuite. This is same when my Validation Suite is used.
Note:
*1 In fact, GCC does not work properly even if the environment variable is set.
My Validation Suite is used in GCC / testsuite in the following manner.
First, copy the cpp-test directory to an appropriate directory in testsuite of GCC.
The cpp-test directory is the one created by copying necessary files in each directory of test-t and test-l and adding the configuration file cpp-test.exp. The suffix of the files named *.t is mostly changed to .c, the suffix of the files for C++ is changed to .C.
Most samples test the preprocessor only. Since two samples cannot test the preprocessor due to the problems in DejaGnu and Tcl, they are for compiling and running (named *_run.c). These two samples contain the line:
{ dg-options "-ansi -no-integrated-cpp" }
where -no-integrated-cpp is an option for GCC 3 and 4. GCC 2 does not support the option, which need to be removed in order to test in GCC 2. To accommodate both GCC 2 and GCC 3 or 4, there are two types of files, *_run.c.gcc2 and *_run.c.gcc3, for these two testcases. Link the appropriate one to *_run.c.
Below, I will take an example of GCC 3.4.3 on my Linux. Suppose the source files of GCC 3.4.3 are located in /usr/local/gcc-3.4.3-src. Also, the GCC compilation is done by /usr/local/gcc-3.4.3-objs.
cp -r cpp-test /usr/local/gcc-3.4.3-src/gcc/testsuite/gcc.dg
This copies files under cpp-test to the gcc.dg directory.
By doing this, if you
make bootstrap
in /usr/local/gcc-3.4.3-objs to compile the GCC source files and you
make -k check
then the entire testsuite including testcases in cpp-test will be tested.
Also, testing by using cpp-test only is done as below in the /usr/local/gcc-3.4.3-objs/gcc directory.
make check-gcc RUNTESTFLAGS=cpp-test.exp
The testsuite logs are recorded in gcc.log and gcc.sum under the ./testsuite directory.
When you do 'make check', depending on the environment, you need to set up the environment variable called DEJAGNULIBS, TCL_LIBRARY as explained in INSTALL/test.html of the GCC source files.
In addition, the environment variable, LANG and LC_ALL, should be C to set the environment to English.
Please note that it is xgcc, cc1, cc1plus, cpp0 etc. generated in the gcc directory that are used in make check at compiling GCC, not gcc, cc1 and such that have already been installed.
Tests can be executed in any directory as follows.
runtest --tool gcc --srcdir /usr/local/gcc-3.4.3-src/gcc/testsuite cpp-test.exp
Logs are output to the current directory. In this case, what is to be tested is gcc, cc1, and cpp0 which have already been installed. cpp-test requires testsuite as it contains various configuration files for GCC (config.*, *.exp).
The argument 'gcc' of "runtest --tool gcc" should be exactly 'gcc'. If the name of the compiler to be tested is not 'gcc', for example 'cc' or 'gcc-3.4.3', you should make symbolic link so that the compiler is invoked by the name of 'gcc'.
Also, cpp-test contains the testcases for warnings in the cases where it is thought to be desirable for a preprocessor to issue a warning. The GCC preprocessor passes less than a half of those cases, however, not passing does not mean that the behavior is wrong or that the preprocessor was not compiled properly. This is not the issue of being right or wrong, but rather of the "quality" of the preprocessor.
This cpp-test can test mcpp also. Therefore, substituting the GCC preprocessing with mcpp and calling
make check-gcc RUNTESTFLAGS=cpp-test.exp
in the gcc directory checks mcpp of Standard mode automatically. Tests can be done also using runtest command in any directory.
runtest --tool gcc --srcdir /usr/local/gcc-3.4.3-src/gcc/testsuite cpp-test.exp
If mcpp is executed in GCC 3 or 4, all testcases for cpp-test except one should pass. There is another testcase which does not pass when executed in GCC 2. However, it is because gcc calls mcpp with the -D__cplusplus=1 option and not mcpp's fault.
Please refer to mcpp-manual.html#3.9.5 and mcpp-manual.html#3.9.7 for how to substitute the preprocessing with mcpp. To apply the testsuite, mcpp startup needs the -23j options to be set. -2 is an option to enable digraph and -3 is one to enable trigraphs. -j is an option for not adding information such as source lines to diagnostic message output. Do not use other options. Additionally, testsuite can test mcpp in standard mode only, no other.
The method above is done after the make with GCC, however, automatic testing can be done by 'configure' and 'make' of mcpp itself as long as GCC / testsuite is installed and is ready to execute. This case is the easiest as 'make check' automatically performs necessary settings. See the INSTALL in mcpp for this method.
GCC has had testsuite for a long time, but very few samples about preprocessing up to V.2.9x. You can see how little attention preprocessing was paid to. The number of testcases for preprocessing increased quite a lot in V.3.x. You can tell preprocessing was given more importance as it was completely changed with up-to-date preprocessor source and documents.
However, these testcases are still quite unbalanced. The causes seem to come from the following nature of testsuite.
This is the way of debugging special to the open source project and became possible as GCC has been used by many excellent programmers in the world. However, this method might have brought the randomness and imbalance of testcases at the same time.
In addition, most of these testcases are valid only in GCC and cannot be used in other compiler systems. Also, testcases for GCC 3 contain many testcases which cannot be applied to even GCC 2 / cpp. The reason is the differences in preprocessing output spacing and diagnostic messages.
On the other hand, my Validation Suite was originally written by myself only in order to debug my preprocessor and rewritten so that the entire preprocessing specifications are tested. Many samples are organized systematically on the whole.
It will have considerable meaning to add these systematic testcases to GCC / testsuite.
Also, my testsuite edition of Validation Suite is written so that it can test three preprocessors, GCC 2.9x / cpp, GCC 3.x, 4.x / cc1 (cc1plus), and mcpp. In other words, the use of the regular expression facility in DejaGnu and Tcl can absorb the implementation differences in preprocessing output spacing and diagnostic messages. *1
Below are the results of applying the testsuite edition of the Validation Suite to these three preprocessors (tested mcpp in March, 2007, others in October, 2006).
Below is the case where the preprocessor is replaced by mcpp V.2.6.3 in GCC 4.1.1.
=== gcc Summary === # of expected passes 264 # of unexpected failures 1 # of expected failures 4 /usr/local/bin/gcc version 4.1.1
There is one failure due to a lack of the universal-character-name <=> multi-byte character conversion implementation in C++98.
Here is the GCC 3.2 / cc1 case.
=== gcc Summary === # of expected passes 216 # of unexpected failures 51 # of unexpected successes 2 # of expected failures 2 /usr/local/bin/gcc version 3.2
Most of failures are due to a missing warning.
GCC 4.1.1 / cc1 is almost the same.
=== gcc Summary === # of expected passes 214 # of unexpected failures 53 # of unexpected successes 2 # of expected failures 2 /usr/local/bin/gcc version 4.1.1
Here is the GCC 2.95.3 / cpp0 case.
=== gcc Summary === # of expected passes 181 # of unexpected failures 87 # of unexpected successes 3 # of expected failures 1 gcc version 2.95.3 20010315 (release)
There are less warnings than GCC 3, 4 / cc1. There are also some diagnostic messages that are off the point. Half of new C99 and C++98 specifications have not been implemented yet, either.
The number of items differ among different versions of GCC, since multiple failures can occur in one testcase.
Note:
*1 This makes dg script in my testcases difficult to read with frequent use of \ and symbols. The regular expression processing in DejaGnu and Tcl has a considerable number of peculiarities and flaws requiring ingenuity to perform all the automatic testing achieved on multiple compiler systems. Currently, however, runtime options used in testcases for those compiler systems have to be common.
Standard C implementations must certainly process correct source correctly, but they also must issue diagnostic messages for erroneous source. Standard C also contains portions where behavior specifications are up to the implementation or not defined. They are as below in summary. *1
Among these, programs and data in 1 only are called strictly conforming (it is interpreted that 2 and 3 may be included if their results do not differ depending on implementation or special cases.)
Programs and data in 1, 2, and 3 only are called conforming programs.
How to issue diagnostic messages is implementation-defined. Supposedly, one or multiple diagnostic messages of some sort are issued for one translation unit that includes some sort of violations of syntax rules or constraints. It is up to the implementation whether diagnostic messages should be issued for the programs with no violation of syntax rules or constraint. However, strictly conforming programs or conforming programs matching implementation-defined or unspecified specifications for the implementation must be able to be processed correctly until the end.
Violations of syntax rules or constraints are called an "error" in this document. Among e_* files in this Validation Suite, there are many which include multiple errors. In scoring below, it is expected that a compiler system issues one or more diagnostic messages. However, there may be compiler systems that issue just one diagnostic message (such as "violation of syntax rules or constraints") for one translation unit no matter how many errors there are. In addition, there may be compiler systems that get confused after an error. These types of problems are of "quality", but not of Standard conformance level. Please make samples into pieces and test them again as needed. The problems of quality will be discussed in 4 separately.
Note:
*1 C90 3 Definitions of Terms C90 4 Compliance C90 5.1.1.3 Diagnostics C99 3 Terms, definitions, and symbols C99 4 Conformance C99 5.1.1.3 Diagnostics
*2 Although C++98 differs from C90 or C99 in these terms, it does not much differ in the meanings.
Each test item is explained one by one below. This is also a description of Standard C preprocessing itself. The specifications in common with K&R 1st are not explained again. Item numbers are common in *.t and *.c files.
As there are 9 characters not included in the Invariant Code Set of ISO 646:1983 among the basic character set in C, these can be written in source using 3 character sequence below. This is a new specification introduced in C90. *1
??= # ??( [ ??/ \ ??) ] ??' ^ ??< { ??! | ??> } ??- ~
Equivalent characters replace these 9 trigraph sequences in translation phase 1. On systems where you can type in these 9 characters on keyboard, it is not necessary to use trigraphs, of course. However, it is necessary for preprocessing conforming Standard C to be able to do trigraph conversion on even those systems.
Scoring: 6. 6 points if all 9 are processed correctly. Each trigraph which cannot be processed properly, 2 points are reduced with 0 at the lowest limit.
Note:
*1 C90 5.2.1.1 Trigraph sequences C90 5.1.1.2 Translation phases C99 5.2.1.1 Trigraph sequences C99 5.1.1.2 Translation phases
Since trigraph conversion is performed prior to tokenization in translation phase 3 or control line processing in phase 4, trigraphs can be written wherever on a control line.
Scoring: 2.
There are only 9 trigraphs mentioned above, therefore sequences starting with ?? other than those are never translated into another character nor ?? can be skipped. Preprocessing must be able to handle the case of sequences with a trigraph and ?'s which are not trigraphs.
Scoring: 2.
In case there is a \ at the end of a line and a <newline> immediately afterward, this sequence of <backslash><newline> is deleted in translation phase 2 unconditionally. As a result, 2 lines are connected. In Standards, the line on a source file is called a physical line to distinguish it while the line connected by removing <backslash><newline> (if any) is called a logical line. Processing in translation phase 3 is performed with this logical line as subject. *1
In K&R 1st, the #define line and string constants can continue on the next source line using <backslash><newline>, but other cases are not mentioned. Actual implementations allow other control lines may connected, not only #define.
Note:
*1 C90 5.1.1.2 Translation phases C99 5.1.1.2 Translation phases
The #define line connections are accepted in K&R 1st and most of implementations.
Scoring: 4. 4 points for processing correctly and 0 point otherwise.
There are some implementations which cannot handle <backslash><newline> in unusual places such as inside a parameter list even on the #define line.
Scoring: 2.
<backslash><newline> inside a string literal has been supported since K&R 1st.
Scoring: 2.
In Standard C, <backslash><newline> must be removed unconditionally even if it is inside an identifier or anywhere.
Scoring: 2.
<backslash> is not only \, but also ??/ as a trigraph. ??/ in source is converted into \ in translation phase 1, it is obviously \ itself in phase 2.
Scoring: 2.
In translation phase 3, a logical line is broken into pp-tokens and white spaces. A comment is converted into a single space at that time. *1
Here, implementations may convert consecutive white spaces (including comments) into a single space. However, <newline> is not converted and stays as is in any case. That's because the process of preprocessing directive in the next phase 4 is subject to this "line."
In case the comment expands over lines, line splicing is performed virtually by a comment.
Note:
*1 C90 5.1.1.2 Translation phases C99 5.1.1.2 Translation phases
In the old cpp so-called Reiser type, comments functioned as token separators only internally in cpp and were removed before output. By taking advantage of it, there was a method of using comments for token concatenation. However, this specification derailed K&R 1st and was clearly rejected by Standard C. In Standard C, the ## operator is used for token concatenation.
Scoring: 6.
From K&R 1st to C90, comments started with /* and ended with */. *1
However, C99 started supporting C++ style of comment, //. *2
Scoring: 4.
In C90, this should be processed as just a sequence of pp-token '/' and '/', not a comment. However, as implementations which handle // as comments even prior to C99 were common, mcpp treats this as a comment and issues a warning in C90 mode.
Note:
*1 C90 6.1.9 Comments *2 C99 6.4.9 Comments
The preprocessing directive starting with # is for a "line", but this "line" is not necessarily a physical line in source. It could be a logical line combined by <backslash><newline> or the "line" which extend over multiple physical and logical lines by a comment. This is not surprising if you think about the order of translation phase 1 through 4.
Scoring: 4.
There are pp-directive lines that extend over some physical lines by both <backslash><newline> and a comment. Preprocessors that do not implement translation phases properly cannot handle this correctly.
Scoring: 2.
In C90/Amendment 1 (1994), alternative spelling called digraph was added for some of operators and punctuators. *1
In C99, a character code called UCN (universal character sequence) was added. *2
In e.4.?, token errors are covered.
Note:
*1 Amendment 1/3.1 Operators, 3.2 Punctuators (added to ISO 9899/6.1.5, 6.1.6) C99 6.4.6 Punctuators *2 C99 6.4.3 Universal character names
Digraphs are handled as tokens (pp-tokens.) '%:' is another spelling for '#'. It can certainly be used as the first pp-token of a preprocessing directive line or as a string operator.
Scoring: 6.
Different from trigraphs, as digraphs are tokens (pp-tokens), they are stringized as they are in spelling without being converted (meaningless specification.)
Scoring: 2.
UCN is recognized in string literals, character constants, and identifiers.
Scoring: 8. 4 points if UCN in a string literal passes preprocessing as is. 2 points each if UCN is processed correctly in a character constant or identifier. No good if UCN is not recognized and output as is.
UCN can be used inside a pp-number as well. However, it has to disappear from a number-token by the end of preprocessing. This specification exists in C99, not in C++.
Scoring: 2.
UCN must be 8 digit hexadecimal if it starts with \U or 4 digit hexadecimal if it starts with \u.
UCN must not be in the range between [0..9F] and [D800..DFFF]. However, 24($), 40(@), and 60(`) are valid.
Scoring: 4. 1 point each for each correct diagnosis regarding 4 samples.
Even sequences not in C token format are also recognized as pp-tokens. Therefore, there are not many error cases in tokenization of preprocessing.
However, there are some cases other than this which become undefined behavior (refer to 4.2.)
Empty character constants are violations of syntax rules in a preprocessing #if line or compiling. *1
Scoring: 2.
Note:
*1 C90 6.1.3.4 Character constants -- Syntax C99 6.4.4.4 Character constants -- Syntax
Spaces, tabs, vertical-tabs, form-feeds, carriage-returns, and new-lines are all white spaces. White spaces that are not in string literals, character constants, header names, or comments usually have a meaning as a token separator. However, new lines that remain until translation phase 4 are special and become a pp-directive separator. There are slight restrictions in white spaces that can be used in pp-directive lines.
In Standard C, spaces and tabs before and after the #, which is the first pp-token on a preprocessing directive line are guaranteed to end in the same results whether they exist or not (*1.) In K&R 1st, this is not clear and there were actually implementations which do not accept spaces and tabs before and after #.
In K&R 1st, it is interpreted that spaces and tabs after that in the line are accepted as just a token separator as in Standard C.
However, in the case where there are white spaces other than spaces and tabs on the pp-directive line, it is undefined (refer to u.1.6 of 4.2.)
Scoring: 6.
Note:
*1 C90 6.8 Preprocessing directives -- Description, Constraints C99 6.10 Preprocessing directives -- Description, Constraints
#include is the most basic pp-directive since K&R 1st. However, the specifications on this directive in Standard C have more undefined and implementation-defined portions (*1.) The reasons are below.
In n.6.*, the most basic test below is not performed by different categories. This is included with other test items on the premise that not being able to process this is out of the question since many tests will not be able to be performed.
#include <ctype.h> #include "header.h"
Note:
*1 C90 6.8.2 Source file inclusion C99 6.10.2 Source file inclusion
There are 2 formats in header-names. The difference is only that the format enclosed by < and > searches the header from a implementation-defined specific location (may be multiple locations) while the one enclosed by " and " searches the source file first in implementation-defined method and performs the same process as the one enclosed by < and > upon failure. Therefore, the format enclosed by " and " can include standard headers as well. This point was the same in K&R 1st.
In Standard C, same standard headers can be included many times. Either way, however, this is not a preprocessor issue, but on how to write standard headers (refer to 5.1.1.)
Scoring: 10. 4 scores if only one of 2 samples is processed.
In K&R 1st, no macro could be used on the #include line, however, it was officially permitted in Standard C. In case the #include argument does not match either of the 2 formats, the macro included there is expanded. The result must match either one of the 2 formats.
This specification has something subtle. For example, how should the following source be handled?
#define MACRO header.h> #include <MACRO
Should MACRO be expanded first as below?
#include <header.h>
Or, should this be an error as > matching < does not exist prior to macro expansion?
I cannot think that Standards are written with this level of detail in mind. Therefore, I believe it is more straightforward to handle < and > as quotation delimiters similar to " and ". Though, expanding macros first cannot be said to be against Standards. This Validation Suite does not include tests which dig holes of this type of specifications.
Scoring: 6.
This is not so interesting, but it does not have to be a single macro.
Scoring: 2.
The #line pp-directive is not usually used by a user, but used to pass along the filename of the original source and line numbers occasionally in case another tool or something pre-preprocesses source (not necessarily C.) As this has been around since K&R 1st, it must have some purpose in its own way traditionally.
In addition, #line or its variant is used to pass along a filename and line number information to a compiler proper for preprocessor output in general. However, this is not defined as a specification.
The filename and line number specified in #line become the value of predefined macro, __FILE__ and __LINE__ (in addition, __LINE__ will be incremented for every physical line.) *1
Note:
*1 C90 6.8.4 Line control C99 6.10.4 Line control C90 6.8.8 Predefined macro names C99 6.10.8 Predefined macro names
#line specifying a line number and a filename was in K&R 1st, but the filename was not surrounded by " and ".
A filename is a string literal in Standard C, but a different token from the header-name for #include. Strictly speaking, it has subtle problems in \ handling and the like. However, no problems will arise for valid source (it is fortunate that the filename for #line are not in the <stdio.h> format.)
Scoring: 6.
The filename argument is optional and it does not have to exist. This is same as K&R 1st (undefined if no line number is specified.)
Scoring: 4.
In K&R 1st, the #line argument could not use a macro. This is permitted in Standard C.
Scoring: 4.
The line number range for the #line directive was [1.32767] in C90, but was extended to [1.2147483647] in C99.
Scoring: 2.
The filename argument must be a string literal. This is not so interesting, but wide string literals become a violation of constraint (pp-tokens other than that are undefined for some reasons. This is an imbalanced specification.)
Scoring: 2.
#error is a directive newly introduced in Standard C. It displays an error message that includes an argument as a part at preprocessing. In old implementations, there were some with directives such as #assert. However, these were not standard.
It is not specified that #error should end a process. According to the Rationale, it was not specified since the Standard cannot make requirements to that extent. However, it says that ending a process is the ANSI C committee's intention. *1
Note:
*1 C90 6.8.5 Error directive ANSI C Rationale 3.8.5 C99 6.10.5 Error directive C99 Rationale 6.10.5
Macros on the #error line are not expanded. Macros which are expanded on control lines are only for #if (#elif), #include, and #line. *1
Scoring: 8. 2 points for processing by expanding macros. Whether to terminate the process does not matter.
Note:
*1 C90 6.8 Preprocessing directives -- Semantics C99 6.10 Preprocessing directives -- Semantics
This is not so interesting, but the #error line argument is optional and it does not have to exist.
Scoring: 2.
#pragma was also introduced in Standard C. Extended directives which are unique to an implementation are all supposed to be implemented by #pragma sub-directives. *1
Note:
*1 C90 6.8.6 Pragma directive C99 6.10.6 Pragma directive
Different #pragma sub-directives are recognized for each implementation. As it is not possible to write portable programs if each unrecognized #pragma becomes an error, #pragma which cannot be recognized by implementations is ignored. In preprocessing, only the #pragma recognizably regarding preprocessing is processed while the rest of #pragma is all passed to compiler proper as is.
Scoring: 10. #pragma must not be an error, however, it is acceptable to issue a warning. 10 points for the case that preprocessing does not issue an error but the compiler proper does, which should not occur, though it is not a mistake in preprocessing even if it does. 0 points as a matter of convenience in case this distinction does not exist in the implementation where preprocessors are not independent.
C99 introduced the _Pragma() operator which has the same effect as #pragma but can be written in a macro definition as opposed to #pragma.
In addition, when a pp-token following a pragma has a standard feature if it is STDC and macro expansion is prohibited in this case. In other cases, however, whether macro expansion is applicable is implementation-defined.
Scoring: 6.
The _Pragma() operator argument must be a string literal.
Scoring: 2.
#if, #else, and #endif have been supported since K&R 1st. (*1) In the implementations which cannot use #if, none of n.10.1, n.11.*, n.12.*, n.13.*, e.12.*, and e.14.* can be processed as well as many other tests.
Note:
*1 C90 6.8.1 Conditional inclusion C99 6.10.1 Conditional inclusion
#elif was added in Standard C. By using this, we can avoid illegibility of multiple #if nesting.
Macros can be used in the #if expression. The identifier not defined as a macro is evaluated as 0.
In Standards, #if to the corresponding #endif is called a #if section and a block divided by #if (#ifdef, #ifndef), #elif, #else, or #endif in the section is called a #if group. The #if and #elif line expressions are called a #if expression.
Scoring: 10.
In the #if group skipped, preprocessing directives are not processed other than checking #if, #ifdef, #ifndef, #elif, #else, and #endif in order to trace the correspondence of #if group and macros are not expanded, either.
However, tokenization takes place. First of all, that is because C source is a sequence of comments and pp-tokens from the beginning till the end. Secondly, it is necessary to process comments at least in order to check the corresponding relation of #if and others, and in order to check if /* and others are comment symbols, they must be made sure that they are not in a string literal or character constant.
Scoring: 6.
An operator, defined, was introduced for #if expressions in Standard C. This integrates #ifdef and #ifndef into #if (#elif) and prevents illegibility of multiple #ifdef nesting.
The operand for the defined operator is an identifier. Both enclosing it by ( and ) and not doing so are supported as a writing style. Both mean the same and are evaluated as 1 if an operand is defined as a macro, 0 otherwise.
Scoring: 8. 2 scores if only one of 2 #if sections is processed.
defined is one of the operators and provides either value of 1 or 0. Therefore, it is possible to do an operation on its result and another expression.
Scoring: 2.
The #if expressions were vaguely defined as constant expressions in K&R 1st and their types were not clear. In C90, the #if expressions are integer constant expressions and it was made clear that int and unsigned int have the same internal representation as long and unsigned long respectively. In other words, the #if expressions including sub-expressions in them are all evaluated in long and unsigned long. To restate, they are handled as if constant tokens all had L or UL suffix. *1
In addition, the #if expression type became the maximum integer type for each implementation in C99. typedef to the type names, intmax_t and uintmax_t, is applied to this type in the standard header called <stdint.h>. As long long is required in C99, the #if expression type is long long or a wider size. Suffixes, LL (ll)/ULL (ull), are used to write long long/unsigned long long constants. *2
A length modifier, %ll, and an appropriate conversion specifier (such as %lld, %llu, and %llx) are used to display the value of long long/unsigned long long in printf(). The length modifier to display the value of intmax_t or uintmax_t is %j (%jd, %ju, %jx and others.) *3
Note:
*1 C90 6.8.1 Conditional inclusion -- Semantics *2 C99 6.10.1 Conditional inclusion -- Semantics *3 C99 7.19. 6.1 The fprintf function -- Description
In C90, constant expressions of the long type in the #if expressions must be evaluated.
Scoring: 6.
In C90, constant expressions of the unsigned long type in the #if expressions must be evaluated.
Scoring: 4.
Octal numbers must be evaluated in the #if expressions. It is similar in K&R 1st, however the constant exceeding the maximum value of long was evaluated as a negative value of long in K&R 1st while it is evaluated as unsigned long in C90. *1
Scoring: 4. 2 points when recognized as an octal number but evaluated as a negative number or overflow. 0 points for not recognized as an octal number.
Note:
*1 C90 6.1.3.2 Integer constants C99 6.4.4.1 Integer constants
Hexadecimal numbers must be also evaluated in the #if expressions. It is similar in K&R 1st, however the difference is that the constant exceeding the maximum value of long was evaluated as a negative value of long in K&R 1st while it is evaluated as unsigned long in C90.
Scoring: 4. 2 points when recognized as a hexadecimal number but evaluated as a negative number or overflow. 0 points for not recognized as a hexadecimal number.
Constant tokens with a suffix, L or l, must be also evaluated in the #if expressions. The constants not exceeding the maximum value of long are the same in K&R 1st also. This suffix does not matter in preprocessing for evaluation.
Scoring: 2.
Constant tokens with a suffix, U or u, must be also evaluated in the #if expressions. This was the notation not supported in K&R 1st and was officially accepted in C90.
Scoring: 6.
Negative numbers must be also handled in the #if expressions. This is a specification since K&R 1st.
Scoring: 4.
In C90, it is an violation of constraint or error, to put it simply, if the integer constant token value which appears in the #if expression is not in the range which can be represented in long or unsigned long. This is not something directly defined as a specification for the #if expressions, however, there is a specification regarding constant expressions in general. The #if expressions are not exceptions, either. *1
The character constant overflow is tested in e.32.5, e.33.2, and e.35.2.
Scoring: 2.
Note:
*1 C90 6.4 Constant expressions -- Constraints C99 6.6 Constant expressions -- Constraints
At least, constant expressions in long long/unsigned long long must be evaluated in the #if expressions in C99.
The constant exceeding the maximum value of long long is evaluated as unsigned long long.
Suffixes, LL, ll, ULL, and ull, are added in C99. *1
Scoring: 10. 2 points each for processing each of 5 samples correctly.
Note:
*1 C99 6.4.4.1 Integer constants C99 6.6 Constant expressions -- Constraints
The constant or constant expression which exceeds the intmax_t or uintmax_t range is a violation of constraint in the #if expression for C99.
Scoring: 2. In case there is no <stdint.h>, it is acceptable to write a macro appropriately.
The #if expressions are a kind of integer constant expressions. Compared with usual integer constant expressions, there are differences below.
Function calls and comma operators cannot be used in integer constant expressions. Since constant expressions are not variables, no assignments, increments, decrements, nor arrays can be used.
In n.13, evaluation rules common with generic integer constant expressions are tested. Among these, n.13.5 is different from K&R 1st. n.13.6 is the area which was all different in pre-Standard #if expressions. n.13.13 and n.13.14 were not clear in K&R 1st. The rest is unchanged since K&R 1st. *2
The n.13.* uses only small values so that this rule can be tested in the implementations where only int values can be evaluated in the #if expressions. The defined operator and >= are not in n.13, but somewhere else.
Note:
*1 In C++ Standard, 'true' and 'false' are treated differently and evaluated as 1 and 0 respectively. These are not macros, but keywords; however, they are treated as boolean literals in preprocessing.
*2 C90 6.3 Expressions C90 6.4 Constant expressions C99 6.5 Expressions C99 6.6 Constant expressions
Bit shift operations have no troublesome issues regarding positive numbers, at least.
Scoring: 2.
Since the bit pattern of the same value is same despite of CPUs or implementations in positive integers as long as it is in the type range, operations such as ^, |, and & which may appear to be dependent on CPU specifications have the exact same results in that range on any implementation.
Scoring: 2.
All implementations should be able to process these.
Scoring: 4.
All implementations should be able to process these, too (it seemed so, but not so in reality.)
Scoring: 2.
Usual arithmetic conversions are performed in many binary operators in order to match types on both sides. In K&R 1st, usual arithmetic conversions were performed in shift operators while this is not the case in Standard C. This is an adequate specification considering that the right side value is always small positive number and that a bit pattern changes also if a negative number is converted into a positive number in internal representation other than 2's complement which creates a confusion. *1
Scoring: 2.
Note:
*1 C90 6.3.7 Bitwise shift operators -- Semantics C99 6.5.7 Bitwise shift operators -- Semantics
It says point by point, "If both operands have arithmetic type, the usual arithmetic conversions are performed on them.", regarding many binary operators. However, this is not mentioned as far as << and >> are concerned. There is an explanation regarding this in C89 Rationale 3.3.7 (C99 Rationale 6.5.7.)
Usual arithmetic conversions are applied to operands on both sides for binary operators such as *, /, %, +, -, <, >, <=, >=, ==, !=, &, ^, and |, in order to match types on both sides. Same for the second and third operands for tertiary operator, ? :. Therefore, if one side is an unsigned type, the other is converted into an unsigned type that causes a negative value to be converted into a positive number.
In standard integer constant expressions, the integer promotion is applied to the operand before the usual arithmetic conversion if each operand is in the integer type shorter than int. However, in the #if expressions, no integer promotion occurs since all operands are handled as the same size type.
Scoring: 6. 2 points each for processing one of 3 tests correctly.
This evaluation rule is same in K&R 1st, however, this sample cannot be processed in K&R 1st based implementations since the constant token, 0U, uses the U suffix.
The order of evaluation is defined for the || and && operators and the right side is not evaluated if the result is determined by the left side evaluation. In ? :, either the second or third is evaluated but not the other as a result of the first operand evaluation. *1
Therefore, even division by 0 is not an error in the term not evaluated.
Scoring: 6. Subtract 2 points each from 6 if one of the 5 samples fails. 0 point for failure of 3 or more samples. Subtract 2 points for wrong diagnostics, even if an implementation succeeds to process.
Note:
*1 In the ? : operator, however, the usual arithmetic conversion is performed between the second operand and the third one. It is strange to perform a conversion even when no evaluation is done. Especially, as the type for an integer constant token used in the #if expression is not determined until a value is evaluated, the value cannot be helped being evaluated in order for the type (though it is just if signed or unsigned) to be determined. However, no division by 0 is allowed. This is rather messy.
From n.13.8 to n.13.12 are tests for grouping sub-expressions in the #if expressions. Sub-expressions are grouped according to the precedence and associativity of operators. Though there are parts decided by the syntax prior to the precedence in standard integer constant expressions, the #if expressions do not have areas with syntax problems other than ( and ) of grouping. n.13.8 to n.13.10 are tests for associativity and n.13.8 is a test for the associativity for unary operators, -, +, !, and ~. All unary operators associate from right to left.
Scoring: 2.
The conditional operator, ? :, is associated from right to left.
Scoring: 2.
All binary operators are associated from left to right. n.13.10 tests << and >>.
Scoring: 2.
Here, we test expressions including unary operators, -, +, and !, and binary operators, *, /, and >>, which have different precedence and associativity.
Scoring: 2.
Here, we test grouping of even more complex expressions including unary operators, -, +, ~, and !, binary operators, -, *, %, >>, &, | ^, ==, and !=, and tertiary operator, ? :.
Scoring: 2.
The use of macros are allowed in the #if expressions. These macros are usually expanded into integer constants, however, we do not test these here since they are included in n.10.1, n.12.1, n_12.2, and n.13.7 tests.
Though macros expanded into operators are not ordinary, they should be handled as operators in principle. A standard header called <iso646.h> was defined as a specification in ISO C 1990/Amendment 1 which defines some operators using macros (*1.) The purpose seems to be that source can be written without using characters such as &, |, !, ^, and ~. Preprocessing is required to expand these macros in #if and handle them as operators.
On the other hand, there is a specification in which it is undefined if macros in the #if expressions are expanded to 'defined'. I suspect that defined is handled separately since it is similar to an identifier (refer to u.1.19.)
Scoring: 4.
Note:
*1 In C++ Standard, these identifier-like operators are not macros but tokens for some reasons.
The #if expression including a macro which expands into 0 piece of token is not ordinary, either. This should be evaluated after the macro is removed (expanded.)
Scoring: 2.
From e.14.1 to e.14.10 are tests for violations of syntax rules and violations of constraint in the #if expressions. Compiler systems must issue diagnostic messages for all source including one of these. *1
Note:
*1 C90 6.8.1 Conditional inclusion C90 6.4 Constant expressions C99 6.10.1 Conditional inclusion C99 6.6 Constant expressions
As the #if expressions are integer constant expressions and pointers cannot be used, string literals cannot be used.
Scoring: 2.
As the #if expressions are constant expressions, operators and variables with side effects cannot be used. A --B is different from A - -B and a violation of constraint.
Scoring: 4. 2 points if one of 4 samples cannot be correctly diagnosed. 0 point if 2 or more.
Missing one operand in a binary operator or parenthesis is also a violation of syntax rules.
Scoring: 2.
An argument for defined operator on the #if line may or may not be enclosed by ( and ), however, it is a violation of constrains if only one of the parenthesis pair exists.
Scoring: 2.
Only #if without any expression is certainly a violation of syntax rules.
Scoring: 2.
The identifier not defined as a macro is evaluated as 0, however, the argument of the #if line which disappears after macro expansion is a violation of syntax rules.
Scoring: 2.
sizeof, a pp-token, is simply treated as an identifier and evaluated as 0 in the #if expression if it is not defined as a macro. A pp-token called int is the same. Therefore, sizeof (int) becomes 0 (0) and it is a violation of syntax rules.
Scoring: 2.
Just as e.14.7, (int)0x8000 becomes (0)0x8000 and it is a violation of syntax rules.
Scoring: 2.
This e.14.9 and the next e.14.10 admit of several interpretations regarding a diagnostic message should be issued and specifications are ambiguous. There are following specifications in the Standards.
C90 6.4 Constant expressions -- Constraints
C99 6.6 Constant expressions -- Constraints
Each constant expression shall evaluate to a constant that is in the range of representable values for its type.
The applicable range in this specification is not clear, however, it is clear that this is applied to at least where constant expressions are necessary. The #if expressions must be constant expressions. On the other hand, there are specifications below.
C90 6.3.5 Multiplicative operators -- Semantics
C99 6.5.5 Multiplicative operators -- Semantics
if the value of the second operand is zero, the behavior is undefined.
C90 6.1.2.5 Types
C99 6.2.5 Types
A computation involving unsigned operands can never overflow,
Which specification should be applied between division by 0 and unsigned operation? It seems either interpretation is possible.
However, we will make an interpretation this way here -- include division by 0 where a constant expression is required and a diagnostic message must be issued in case it does not fit in the range of the type --. That is because it seems appropriate to issue a diagnostic since only an error in program can cause this type of result and constant expressions are something evaluated at compilation rather than execution. In addition, that is also because it is unnatural to treat only division by 0 as an exception. However, since the specification gets doubly vague in case the result of an unsigned operation is out of range, we do not include it here, but interpret it as undefined.
ISO 9899:1990/Corrigendum 1 added a specification, "A conforming implementation shall produce at least one diagnostic message .. if .. translation unit contains a violation of any syntax rule or constraint, even if the behavior is also explicitly specified as undefined or implementation-defined." This was carried on by C99. *1
Scoring: 2.
Note:
*1 C99 5.1.1.3 Diagnostics
In C90, values of the #if expressions must be in the range representable as long/unsigned long.
Scoring: 4. 4 points for correctly diagnosing all 4 tests. 2 points for correctly diagnosing 2 or 3 tests. 0 points if only 1 or none is correctly diagnosed.
The n.15.* test #ifdef and #ifndef. These are exactly same between K&R 1st and Standard C. e.15 tests the violation of syntax rules for that. *1
Note:
*1 C90 6.8 Preprocessing directives -- Syntax C90 6.8.1 Conditional inclusion C99 6.10 Preprocessing directives -- Syntax C99 6.10.1 Conditional inclusion
Scoring: 6.
Scoring: 6.
Arguments on the #ifdef and #ifndef lines must be identifiers.
Scoring: 2.
Arguments on the #ifdef and #ifndef lines must not have extra tokens other than identifiers.
Scoring: 2.
It is a violation of syntax rules not to have any arguments.
Scoring: 2.
Next is a test of violations of syntax rules for #else and #endif. This syntax has not changed since K&R 1st. (However, Standard C introduced a new specification that a diagnostic message must be issued for a violation of syntax rules or constraints.) *1
Note:
*1 C90 6.8 Preprocessing directives -- Syntax C99 6.10 Preprocessing directives -- Syntax
The #else line must not have any other tokens.
Scoring: 2.
The #endif line must not have any other tokens.
Do not write below.
#if MACRO #else ! MACRO #endif MACRO
Use below instead.
#if MACRO #else /* ! MACRO */ #endif /* MACRO */
Scoring: 2.
Next tests violations of syntax rules for matching #if (#ifdef, #ifndef), #elif, #else, and #endif. This syntax is almost the same since K&R 1st except that #elif was added to Standard C. In addition, K&R 1st was not clear in the point that these must match within the source file unit. *1
Note:
*1 C90 6.8 Preprocessing directives -- Syntax C99 6.10 Preprocessing directives -- Syntax
#endif without a preceding #if is obviously a violation of syntax rule.
Scoring: 2.
#else without a corresponding #if is also an error.
Scoring: 2.
Having another #else after a #else is also prohibited.
Scoring: 2.
#elif after #else is not allowed.
Scoring: 2.
#if, #else, and #endif must be matched in the source file (preprocessing file) unit. It is not acceptable for the file included to be treated as if it existed from the beginning or in the original file.
Scoring: 2.
Scoring: 2.
Forgetting #endif actually happens quite often, however, compiler systems must issue a diagnostic message for that.
Scoring: 2.
For the #define syntax, the # and ## operators were added in C90 whereas they did not exit in K&R. The rest is unchanged. *1
In C99, variable argument macros were added (refer to 2.8.) *2
Note:
*1 C90 6.8 Preprocessing directives -- Syntax C90 6.8.3 Macro replacement *2 C99 6.10 Preprocessing directives -- Syntax C99 6.10.3 Macro replacement
The first token on the #define line is a macro name. However, in case there are white spaces immediately after, the second token is considered to be the beginning of the replacement list even if it is '(' and not considered as a function-like macro definition. If there is no token after a macro name, the macro is defined as 0 pieces of token.
Scoring: 30. 10 points if only one of 2 macros are defined correctly.
If '(' is immediately after a macro name without having white spaces between, it is considered to be the beginning of the function-like macro parameter list. This specification has been around since K&R 1st and has a trace of character-oriented preprocessing which is influenced by the existence of white spaces. Nothing can be done at this point.
Scoring: 20.
In a so-called "Reiser" model preprocessor, the same spelling as a parameter in string literals or character constants in the replacement list, that portion was substituted for an argument by macro expansion. However, this was not accepted by Standard C nor K&R 1st. This replacement is a specification characterized well by character-oriented preprocessing, however, it is out of the question in token-oriented processing.
Scoring: 10.
Variable argument macros were introduced in C99.
Scoring: 10.
The first token on the #define line must be an identifier.
Scoring: 2.
If there is not a single token on the #define line, it is a violation of syntax rules.
Scoring: 2.
Empty parameter is also a violation of syntax rules. *1
Scoring: 2.
Note:
*1 C90 6.5.4 Declarators -- Syntax C99 6.7.5 Declarators -- Syntax
Duplicate parameter names in the parameter list for one macro definition is a violation of constraints. *1
Scoring: 2.
Note:
*1 C90 6.8.3 Macro replacement -- Constraints C99 6.10.3 Macro replacement -- Constraints
A parameter in a macro definition must be an identifier. *1
Scoring: 2.
Note:
*1 The ... parameter was added in C99. __VA_ARGS__ in the replacement list is a special parameter name that corresponds to it.
Though '$' is not accepted as a character within an identifier in Standard C, there are compiler systems which accept it. This sample is the kind that can be seen in the source compiled by those compiler systems. Since this example interprets $ as one character and one pp-token in Standard C, the macro name is THIS and $ and after becomes a replacement list for an object-like macro which is totally different from the program purpose of a function-like macro named THIS$AND$THAT.
In the Corrigendum 1 of ISO 9899:1990, an exceptional specification was added to this type of example. A conforming implementation must issue a diagnostic message regarding this example. *1
Conversely, C99 defined that white spaces must exist between a macro name a replacement list in object-like macro definition in general. *2
Scoring: 2.
Note:
*1 Addition to Constraints in C90 6.8 by Corrigendum 1.
This specification, however, has disappeared in C++ Standard.
*2 C99 6.10.3 Macro replacement -- Constraints
Variable argument macros in C99 use __VA_ARGS__ in the replacement list corresponding to a parameter ... in the parameter list of the macro definition. This identifier must not be used anywhere else.
Scoring: 2. 2 points if 2 samples are correctly diagnosed. 0 points if only one is diagnosed correctly.
Macro re-definition was not mentioned in K&R 1st and implementations were all different as well. In Standard C, some re-definition of the original definition is allowed, but not a different one. Macros are virtually not re-defined (unless they are voided by #undef). *1, *2
Note:
*1 C90 6.8.3 Macro replacement -- Constraints C99 6.10.3 Macro replacement -- Constraints*2 However, many compiler systems issue a warning and accept redefinition. So does mcpp starting with V.2.4, to maintain compatibility with existing compiler systems.
Re-definition where only the number of white spaces are different, is allowed.
Scoring: 4.
White spaces include ones extending over source lines by <backslash><newline> sequence or comments.
Scoring: 4.
Re-definition where token sequences in a replacement list are different is a violation of constraint.
Scoring: 4.
Re-definition where the existence of white spaces in a replacement list is different is a violation of constraints. This has a trace of character-oriented preprocessing.
Scoring: 4.
As re-definition of different parameter usage is essentially a different definition, it is a violation of constraints.
Scoring: 4.
Re-definition where only parameter names are different but which is essentially the same is a violation of constraints. This seems to be an excessive constraint.
Scoring: 2.
As a macro name belongs to one name space, a function-like macro and an object-like macro cannot use the same name.
Scoring: 2.
Since no keyword exists in preprocessing, an identifier with same name as a keyword can be defined as a macro and expanded. *1
Note:
*1 C90 6.1 Lexical elements -- Syntax C99 6.4 Lexical elements -- Syntax C89 Rationale 3.8.3 (C99 Rationale 6.10.3) Macro replacement
Scoring: 6.
Tokenizing a source file into pp-tokens is performed in translation phase 3. And the case where multiple pp-tokens are concatenated into 1 pp-token later is defined only by the case where the concatenation is done by expanding the macro defined using the ## operator, the case where stringizing takes place by macro expansion defined by the # operator, and the case of concatenating adjacent string literals. Therefore, it is interpreted that implicit concatenation of multiple pp-tokens must not happen. This is obvious in the principle of token-oriented preprocessing. *1
Note:
*1 C90 5.1.1.2 Translation phases C90 6.8.3 Macro replacement C99 5.1.1.2 Translation phases C99 6.10.3 Macro replacement
In case preprocessing is done in an independent program, it is necessary to separate and pass 3 -'s in the output of this sample by some sort of token separator so that compiler proper can figure out 3 are separate pp-tokens.
Scoring: 4.
Even if a macro is invoked in an argument of outer macro, expansion result of the macro should not be merged with its surrounding pp-tokens in outer macro's replacement list.
Scoring: 2.
Preprocessing-numbers were introduced by Standard C for the first time. They have a wider range than integer constant tokens and floating point constant tokens put together and may include identifier-like portion. They were defined as a specification in order to simplify tokenization in preprocessing. However, in case there is a macro-like sequence in a pp-number, a wrong result may occur unless this simple tokenization is done exactly. *1
Note:
*1 C90 6.1.8 Preprocessing numbers C99 6.4.8 Preprocessing numbers
Since the sequence, 12E+EXP, is one pp-number, it will not be expanded even if a macro, EXP, is defined.
Scoring: 4.
A pp-number starts with a digit or .
Scoring: 2.
In C90, + or - can appear inside a pp-number only if it immediately follows E or e. 12+EXP is different from 12E+EXP and is divided into 3 pp-tokens, 12, +, and EXP. These are a pp-number, an operator, and an identifier, respectively. EXP is expanded if it is a macro.
Scoring: 2.
In C99, the sequence of + or - following P or p in a pp-number was added to write a floating point number in hexadecimal.
In order to display a floating point number in printf(), conversion specifiers such as %a and %A are used. *1
Scoring: 4.
Note:
*1 C99 7.19. 6.1 The fprintf function -- Description
## is an operator newly introduced in Standard C and used only in the replacement list on the #define line. The pp-tokens before and after a ## are concatenated into one pp-token. If pp-tokens before and after ## are parameters, they are first replaced by actual arguments at macro expansion and concatenated. *1
Note:
*1 C90 6.8.3.3 The ## operator C99 6.10.3.3 The ## operator
This is an example of the most simple function-like macro using the ## operator.
Scoring: 6.
Since operands of the ## operator are not macro-expanded, using another macro which appears to be meaningless together such as xglue() in this example often takes place. This is for expanding a macro in an argument, then concatenating the result. The 12e+2 sequence which was generated by a macro call in this sample is a valid pp-number.
Scoring: 2.
There must be some pp-tokens before and after the ## operator in a replacement list. This is an example of an object-like macro. It is meaningless to use ## in an object-like macro, but not an error.
Scoring: 2.
This is an example of a function-like macro definition without pp-tokens before or after the ## operator in a replacement list.
Scoring: 2.
The # operator was introduced in Standard C. It is used only in the replacement list for the #define line which defines a function-like macro. The operands of the # operator are parameters and corresponding actual arguments at those macro expansion are converted to string literals. *1
Note:
*1 C90 6.8.3.2 The # operator C99 6.10.3.2 The # operator
The argument corresponding to the operand for the # operator is enclosed by " and " on both ends and stringized.
Scoring: 6. 2 points if a space is inserted between token as "a + b".
In case the argument corresponding to the operand for the # operator comprise of multiple pp-token sequence, white spaces between those pp-tokens are converted into a space and stringized. No space is inserted if there are no white spaces. That is to say, the results differ by the existence of white spaces though they are not influenced by the number of white spaces (this still has a trace of character-oriented preprocessing.) White spaces before and after an argument are deleted.
Scoring: 4.
In case a string literal or character constant is in the argument corresponding to the operand for the # operator, \ is inserted immediately prior to the \ or " in it and the " which encloses a string literal. This is same as the method of writing string literals to display string literals or character constants as they are.
Scoring: 6.
As the <backslash><newline> sequences are removed in translation phase 2, they do not exist at macro expansion.
Scoring: 2.
Macro expansion routine generally guards the expansion result with token separators in order to avoid token-merging with surrounding tokens. (See 3.4.21.) In case of stringization, however, the inserted token-separators should not remain.
These cumbersome issues arise from character-oriented portion of the Standard mixed into token-based principle.
Scoring: 2.
The operand for the # operator must be a parameter name.
Scoring: 2.
When macros in an argument should be expanded in a function-like macro call was not mentioned in K&R 1st. Implementations were all different in pre-C90 preprocessors. More might have expanded at rescanning a replacement list. In Standard C, it was defined as a specification that this was expanded after an argument was identified first, then before substituted for a parameter. The order is similar to the argument evaluation of function calls, which is easy to understand. However, in case the argument corresponds the parameter which is an operand for the # or ## operator, a macro is not considered as a macro call and is not get expanded even if the macro name is included. *1
Note:
*1 C90 6.8.3.1 Argument substitution C99 6.10.3.1 Argument substitution
A macro in an argument is expanded after the argument is identified, then substituted for a parameter in the replacement list. Therefore, the one identified as one argument is one argument even if it appears multiple arguments after expansion.
Scoring: 4.
Similarly, arguments which become 0 piece of token after expansion are legitimate.
Scoring: 2.
In case the operand for the ## operator is a parameter, the argument corresponding to it does not get macro-expanded. Therefore, it is necessary to nest another macro in order to do macro expansion.
Since xglue() does not use the ## operator in this example, glue( a, b), which is an argument for it, gets macro-expanded and becomes 'ab' and the replacement for xglue() becomes glue(ab, c). This is rescanned and 'abc' is the final result.
Scoring: 2.
Since glue() is directly called, a macro does not get expanded even if a macro name is in the argument.
Scoring: 6.
The argument corresponding to the parameter which is an operand for the # operator is not macro-expanded.
Scoring: 4.
Macro expansion in an argument is done only in that argument. Incomplete macro is a violation of constraints. Though a function-like macro name is not a macro call by itself, it becomes the beginning of a macro call sequence if it is followed by '('. Once it begins, ')' corresponding to this '(' must exist. *1
Scoring: 4.
Note:
*1 C90 6.8.3 Macro replacement -- Constraints C99 6.10.3 Macro replacement -- Constraints
In case macro definition is recursive, expanding the macro as is causes infinite recursion. Because of this, recursive macros could not be used in K&R 1st. Standard C introduced a new specification that the macro name replaced once does not get replaced again in order to allow recursive macro expansion, preventing infinite recursion. This specification is quite difficult, but can be paraphrased as below. *1
After all these paraphrasing, it is still difficult. Especially, 3 comes from a traditional macro rescan specification of including subsequent token sequence and this complicates the problem unnecessarily. The Standard has issued a corrigendum and made corrections concerning this subject, however, it just gets more confusing. Furthermore, the Standard changes macro expansion depending whether the subsequent token sequence is in the source file or not. This is an inconsistent specification. *2
Not only that the macro expansion involving the succeeding token sequence is uncommon, but also it is doubly uncommon that the macro, whose re-replacement at that part has been prohibited, appears again. The Validation Suite does not have this type of macro as a test item other than n.27.6. I hope this abnormal specification regarding macro expansion of "including a succeeding token sequence" will be removed. *3
Note:
*1 C90 6.8.3.4 Rescanning and further replacement C99 6.10.3.4 Rescanning and further replacement
*2 Refer to 2.7.6.
*3 At the newsgroup comp.std.c, there has been some controversy on the Standard's specification about recursive macro expansion. Mainly two interpretations have been insisted on this subject. recurs.t is one of the samples used in the discussion. Refer to the comment in recurs.t. This Validation Suite does not evaluate preprocessors behavior on this sample.
mcpp V.2.4.1 or later in Standard mode implements the two ways of recursive macro expansion. mcpp sets, by default, the range of non-re-replacing of the same name as wide as the explanation above 1-5, and expands this sample as 'NIL(42)'. mcpp sets, by '-@compat' option, the range narrower, and expands this sample as '42'. The difference of these two specifications appears, in 3 above, when the first half of a function-like macro call of A is in the replacement list of B. mcpp, by default, prohibit re-replacing of A even if only the name of A is in the replacement list of B. On the other hand, mcpp, by -@compat option, prohibit re-replacing if and only the name of A and the arguments list with '(', ')' surrounding it are all in the replacement list of B, and does not distinguish whether the name is in the source file or not.
This is an example of direct recursion for object-like macros.
Scoring: 2.
This is an example of indirect recursion for object-like macros.
Scoring: 2.
This is an example of direct recursion for function-like macros.
Scoring: 2.
This is an example of indirect recursion for function-like macros.
Scoring: 2.
In Standard C, there is a difficult specification meaning "the macro whose re-replacement has been prohibited will not be replaced at rescan in another context." What applies to this in concrete is the handling at rescanning the parent macro for the macro in the argument. When there is a recursive macro in an argument, it is replaced only once. It is not replaced at the rescan of the parent macro, either.
Scoring: 2.
Rescanning of macro replacement lists has been a specification since K&R 1st. Macros found at rescan are replaced as long as they are not recursive macros. This takes care of nested macro definitions. There was no special change in Standard C, though what was not obvious in K&R 1st became somewhat clearer. *1
Note:
*1 C90 6.8.3.4 Rescanning and further replacement C99 6.10.3.4 Rescanning and further replacement
This is same in K&R 1st as well.
Scoring: 6.
This is same in K&R 1st as well.
Scoring: 4.
The argument corresponding to the operand for the ## operator is not macro-expanded, however, the pp-token newly generated by pp-token concatenation is subject to macro expansion at rescan.
Scoring: 2.
In case a function-like macro name, even if any, does not follow (, it is not considered to be a macro call. If a function-like macro name is obtained from an argument and a function-like macro call is formed by using the name in the replacement list, it will be expanded.
Scoring: 6.
The unusual macro that a replacement list forms the first half of a function-like macro call was an unspoken specification in pre-Standard, but was officially accepted in Standard C. The pp-token sequence subsequent to the replacement list is included to complete a macro call.
Scoring: 2.
In general, the same name macro is not re-replaced while rescanning. Some unusual cases are, however, re-replaced. In case of nested macro call, when the replacement list involves the subsequent pp-tokens and finds the same name in source file, the name is re-replaced.
Scoring: 2.
The number of arguments must match that of parameters in function-like macro calls. It is also the same in the function-like macros found at the rescan of the replacement list. However, the number of arguments may not be easy to recognize intuitively in tricky macros since arguments are separated by ,.
Scoring: 2.
C90 introduced a specification that 5 special macros are predefined by implementations. *1, *2
Furthermore, a macro named __STDC_VERSION__ was introduced in C90/Amendment 1.
__FILE__ and __LINE__ are extremely special macros that change definitions dynamically. It is used in the assert(), debug tools, and others. The rest of Standard predefined macros do not change during the processing of one translation unit.
Note:
*1 C90 6.8.8 Predefined macro names *2 C99 6.10.8 Predefined macro names
This is defined to be the string literal for the source file name being preprocessed. As the file name is used for the source file included by #include, it changes in the same translation unit also.
Scoring: 4. 0 points for just a name as in n_28.t, rather than a string literal as in "n_28.t".
This is defined to the decimal constant for the line number in the source file being preprocessed. The line number starts with 1.
Scoring: 4. 2 points if the line number starts with 0.
This is defined to the string literal for the date on which preprocessing is performed. The format is "Mmm dd yyyy" and almost the same as the one generated by the asctime() function with an exception that the 1st digit of dd is a space, not 0, in case the day is prior to the 10th.
Scoring: 4. 2 points if the 1st digit of dd is 0 in case the day is prior to the 10th.
This is defined to the string literal for the time at which preprocessing is performed. The format is "hh:mm:ss", same as the one generated by the asctime() function.
Scoring: 4.
This is defined to a constant, 1, in C90 or C99 compliant implementations.
Scoring: 4.
This is defined to 199409L in implementations supporting C90/Amendment 1:1995. *1
Scoring: 4.
Note:
*1 Amendment 1/3.3 Version macro (addition to ISO 9899:1990 / 6.8.8)
In C99, the value of __STDC_VERSION__ is 199901L.
Also, a predefined macro, __STDC_HOSTED__, was added. This is defined to 1 for a hosted implementation, 0 otherwise.
Scoring: 4. 2 points each.
Since __FILE__ and __LINE__ are subject to the source files, not translation units, they are the name of file included and a line number in the source included.
Scoring: 4. 2 points if the line number starts with 0.
#undef has been supported since K&R 1st and there are no major changes. It cancels the macro definition previously defined. The valid range for the macro is from when it is defined by #define until when it is canceled by #undef. *1
Note:
*1 C90 6.8.3.5 Scope of macro definitions C99 6.10.3.5 Scope of macro definitions
The macro name is no longer a macro after #undef.
Scoring: 10.
Applying #undef to the name which has not been defined as a macro is allowed. Compiler systems must not reject this as an error.
Scoring: 4.
An identifier is required on the #undef line.
Scoring: 2.
The #undef line must not have other than one identifier.
Scoring: 2.
Missing an argument on the #undef line is a violation of syntax rules.
Scoring: 2.
In a macro call, a <newline> is treated as simply one of white spaces. Therefore, a macro call can extend over multiple lines, which was not clear in K&R 1st. *1
Arguments for function-like macro calls are separated by ','. However, the , inside of a pair of ( and ) is not considered to be a separator for arguments. This is not tested here directly, however, it is implicitly tested throughout the Suite, in n.25 as an example. Since many of *.c samples use the assert() macro, quite complicated testing regarding this is performed.
Note:
*1 C90 6.8.3 Macro replacement -- Semantics C99 6.10.3 Macro replacement -- Semantics
Scoring: 6.
In C99, an empty argument in macro calls was accepted. This is different from insufficient arguments. The ',' to separate arguments cannot be omitted (though both cannot be identified in case of one parameter.)
Scoring: 6.
The next are some errors regarding macro calls.
Different number of arguments and parameters is a violation of constraint, not undefined. *1
Scoring: 2.
Note:
*1 C90 6.8.3 Macro replacement -- Constraints C99 6.10.3 Macro replacement -- Constraints
The number of arguments less than that of parameters is a violation of constraints.
In C99, an empty argument was accepted. This is different from insufficient arguments. The ',' to separate arguments must exist.
Scoring: 2.
In general, a macro call can extend over multiple lines. However, since the preprocessing directive line starting with # completes in the line (possibly spliced by multi-line comment), a macro call in it must complete in the line as well.
Scoring: 2.
Variable argument macros in C99 need at least one actual argument for __VA_ARGS__.
Scoring: 2.
The #if expressions can use character constants. However, the evaluation is mostly implementation-defined and does not have much portability. 32.? Covers the most simple single byte character constants. *1
Note:
*1 C90 6.1.3.4 Character constants C90 6.8.1 Conditional inclusion -- Semantics C99 6.4.4.4 Character constants C99 6.10.1 Conditional inclusion -- Semantics
Below, the sources of the Standards for the 33, 34, and 35 tests are also the same.
In character constants, octal escape sequences are supported. These are same in any implementation and there are no implementation-defined areas.
Scoring: 2.
In character constants, hexadecimal escape sequences are also supported. There are no implementation-defined areas here either. Hexadecimal escape sequences did not exist in K&R 1st.
Scoring: 2.
Single byte character constants not in an escape sequence are simple, however, values depend on basic character sets. In a cross compiler whose basic character set varies at compilation and at runtime, it is acceptable to use either of them in the #if expression evaluation.
Even the same basic character sets are implementation-defined in terms of sign handling. Moreover, handling can be different between the compiler proper (translation phase 7) and preprocessing (phase 4.)
Therefore, judging a basic character set from the value of a character constant in the #if expression is not a guaranteed method.
Scoring: 2.
Standard C added new escape sequences, '\a' and '\v'.
Scoring: 2.
As an escape sequence in a character constant represents a single-byte character value, it must be in the unsigned char range.
Scoring: 2.
Wide character constants were introduced in Standard C. The value evaluation is even more implementation-defined than a single-byte character constant and even byte evaluation order is not specified.
Although there are various encodings of wide character, only the wide character corresponding to ASCII character is tested here. For the other encodings, see 4.1.
Hexadecimal or octal escape sequences can be used in wide character constant values as well, however, the values must be in the range representing one wide character value unsigned.
Scoring: 2.
Character constants include something called multi-character character constants. They appear similar to multi-byte character constants and confusing, however they are different concept and means character constants comprising of multiple characters. Among these characters, there are single-byte characters, multi-byte characters, and wide characters and the multi-character character constants corresponding to each exist (in Standards, the term, character, is used as a single-byte character, but it refers to 3 kinds of characters here.)
There seems to be no usage for multi-character character constants. The reason why these has been accepted since K&R 1st seems to be simply because whatever is in the int range is fine as the character constant type is int.
Multi-character character constants for single-byte characters have been around since K&R 1st. (A.16.) However, the evaluation byte order is not defined in K&R 1st or Standard C.
Scoring: 2.
Multi-character character constant values in a hexadecimal or octal escape sequence must be in the int or unsigned int range. However, since int/unsigned int are treated as if they have the same internal representation as long/unsigned long in the #if expressions in C90, checking if values are in the long or unsigned long range seems enough. This point is not clear in the Standards. It can be interpreted that range checking is implementation-defined since the method of value evaluation itself is implementation-defined.
In any case, it is appropriate for this sample to issue a diagnostic message since this sample exceeds the unsigned long range in the 64-bit long or lower compiler systems no matter how evaluation is performed.
In C99, the #if type became the maximum integer type in the implementation.
Scoring: 2.
There exists something called wide character multi-character constants. The evaluation method is also implementation-defined overall, however, a wide character multi-character constant needs to match the multi-character constant for corresponding multi-byte character.
Scoring: 0.
Standard C specified the minimum for each translation limit that can be handled by compiler systems. However, this specification is quite lenient. Regarding 22 kinds of limitation values, each program including one or more examples meeting each limitation value must be processed and executed. As you can see in this Validation Suite sample, it is possible to write this program extremely simple and impractically with the minimum load for a compiler system. Please note that these translation limits are not guaranteed all the time. The translation limit specification is only an indication. These samples test only 8 kinds of translation limits regarding preprocessing. *1, *2, *3
A part of these test samples has lines wrapped to fit on a screen. These tests may fail in the compiler system that cannot process line splicing correctly (for example, Borland C.) Since line splicing testing is not the purpose here, please concatenate lines with an editor to re-test the sample if it fails.
Note:
*1 C90 5.2.4.1 Translation limits *2 C99 5.2.4.1 Translation limits
In C99, translation limits are expanded to a large extent. They are even more so in the C++ Standard (refer to 4.6.)
In C90, up to 31 parameters in a macro definition are guaranteed in any case.
Scoring: 2.
Similarly, up to 31 arguments in a macro call are guaranteed in C90 in any case.
Scoring: 2.
In C90, the top 31 characters of the internal identifier (including macro names) in a translation unit are guaranteed to be significant. Preprocessing goes without saying, but 31 byte names also need to be passed. *1
Scoring: 4.
Note:
*1 C90 6.1.2 Identifiers - Implementation limits
In C90, 8 levels of #if (#ifdef, #ifndef) section nesting is guaranteed in any case.
Scoring: 2.
In C90, 8 levels of #include nesting is guaranteed in any case.
Scoring: 4.
In C90, 32 nesting levels of parentheses are guaranteed in an expression in any case. This seems to apply to the #if expressions as well (Different from generic expressions, it does not seem to be necessary to guarantee to that extent for the #if expressions. The only specifications which are defined as an exception are the #if expressions are integer constant expressions, which are evaluated only in long/unsigned long, do not require a query to execution environment, and that same evaluation methods as runtime or phase 7 are not required. Since the rest receive no special treatment, there are some parts where the specification seems somewhat extreme.)
Scoring: 2.
In C90, the length of string literals is guaranteed up to 509 bytes. This length is that of a token and not the number of elements in a char array. It includes " in both ends and \n is counted as 2 bytes. In a wide string literal, the L prefix is included in the number.
Scoring: 2.
In C90, the length of a logical line is guaranteed up to 509 bytes in any case.
Scoring: 2.
In C90, 1024 macro definitions are guaranteed in any case. This is the most ambiguous among specifications regarding translation limits. The amount of memory required by compiler systems for the simplest macros in this sample are totally different from that for many long macros. Also, test programs vary depending predefined macros are included in the 1024 macros. In an actual program, many macros are defined in standard headers before they are defined in a user program. This specification is truly only rough indication. The real limitations shall be determined by the memory amount, which can be provided by the system.
Scoring: 4.
In C99, translation limits were extended to a large extent.
Scoring: 2 for each of below.
Standard C has items called implementation-defined. Specifications in these items vary depending on implementations. However, it is required that implementations describe specifications in a document. *1
Among the specifications that are implementation-defined, there are ones determined by CPU and OS in addition to the ones by compiler systems themselves. In case of cross compilers, CPU and OS may differ at translation and runtime.
The items below check if the implementation-defined items in preprocessing are described in a document. Since this is for preprocessing, it is for translation time as far as CPU and OS are concerned. d.1.* are for preprocessing specific specifications and d.2.* are related to compiler proper specifications. However, the #if expression evaluation can have different specifications from compiler proper.
Other than items below, there are some implementation-defined aspects in the #if expression evaluation. The first one is character sets (basic character set is ASCII or EBCDIC and such.) The integer representation (2's complement, 1's complement, or sign+absolute value) is another one. Same as a result converted from a signed integer into an unsigned by usual arithmetic conversion. However, since these are all machine and OS dependent, just those documents should be enough and they do not need to be described in the preprocessor documents. Therefore, these are not subject to scoring.
Note:
*1 C90 3 Definitions of Terms C99 3 Terms, definitions, and symbols
A header-name is a special pp-token. How to combine sequences enclosed by < and > or " and " into one pp-token called a header-name is implementation-defined. It is easy since what is enclosed by " and " is treated as a pp-token called a string literal as far as implementation is concerned, however, what is enclosed by < and > has extremely special problems. That is because <stdio.h> is divided into 5 pp-tokens, <, stdio, ., h, and >, in translation phase 3 and composed into 1 pp-token in phase 4. In case this part is written using macros, even more delicate issues arise. *1
Scoring: 2. 2 points if this specification is described in implementation documents, 0 points otherwise.
Case-sensitivity in header-names and file name rules are also implementation-defined, however they do not necessarily need describing in implementation documents since these are OS dependent.
Note:
*1 C90 6.8.2 Source file inclusion -- Semantics C99 6.10.2 Source file inclusion
How to search the header file after a header-name is taken out of the #include line is also implementation-defined. In case of the header-name enclosed by " and ", the files is searched in an implementation-defined method first and searched in the same way as the header-name enclosed by < and > if not found. However, the latter is also implementation-defined. This specification does not make any sense at all, however, it cannot be helped since Standard C cannot make assumptions regarding the OS.
In the OS with directory structures, the former searches the relative path to the current directory while the latter searches the directory specified by the implementation. However, some implementations search the relative path to the file that is the source of include. This cannot be said wrong, either, since it is implementation-defined. This is explained by the Rationale that the committee's intention is the specification of the search in the relative path to the current directory, but that it could not be officially defined since the OS cannot be assumed. *1
Also, the former search does not have to be supported (it is acceptable to treat the header-name enclosed by " and " in the same way as one enclosed by < and >.) The latter does not necessarily have to be a file, either. There may be a header built in implementations. *2
Scoring: 4. 4 points if these header search methods are fully described in documents, 2 points if not fully, and 0 point if almost no description.
Note:
*1 ANSI C Rationale 3.8.2 Source file inclusion C99 Rationale 6.10.2 Source file inclusion *2 C90 6.8.2 Source file inclusion -- Semantics C99 6.10.2 Source file inclusion
The number of the #include nesting level is implementation-defined. Though, at least 8 levels for C90 and 15 levels for C99 must be guaranteed. *1, *2
Scoring: 2.
Note:
*1 C90 5.2.4.1 Translation limits C99 5.2.4.1 Translation limits
*2 C90 6.8.2 Source file inclusion -- Semantics C99 6.10.2 Source file inclusion
The #pragma preprocessing directive is a directive to specify enhanced functionalities specific to implementations. Enhanced functionalities must be all implemented as #pragma sub-directives in preprocessing as well. *1
Scoring: 4. 4 points if enough description is in documents regarding the #pragma sub-directives valid in the implementation (at least all #pragma sub-directives for preprocessing in preprocessing documents), 2 points if the descriptions are not enough, and 0 points if almost no description. Deduct 2 points if there is a directive specific to the implementation other than #pragma sub-directives (0 point is the lowest limit. The directives, which are prohibited by options specifying the closest specification to Standard C, are not included.)
Note:
*1 C90 6.8.6 Pragma directive C99 6.10.6 Pragma directive
In C90, pp-tokens on the #pragma line was not subject to macro expansion. In C99, however, the line with the STDC token following #pragma is not subject to macro expansion and the macro expansion for the rest of #pragma sub-directives became implementation-defined.
Scoring: 2.
Predefined macros other than __FILE__, __LINE__, __DATE__, __TIME__, __STDC__, __STDC_VERSION__ (also __STDC_HOSTED__ in C99) are implementation-defined. They must have a name with one _ followed by uppercase letters or starting with 2 _'s. *1
Scoring: 4. 2 points if the description is not enough. Deduct 2 points in case there is a predefined macro with the name which does not follow the specification restrictions (0 point is the lowest limit. The directives, which are prohibited by options specifying the closest specification to Standard C, are not included.)
Note:
*1 C90 6.8.8 Predefined macro names C99 6.10.8 Predefined macro names
C99 added the macros, __STDC_IEC_559__, __STDC_IEC_559_COMPLEX__, and __STDC_ISO_10646__, which are predefined conditionally.
__STDC_IEC_559__ and __STDC_IEC_559_COMPLEX__ are defined as 1 respectively in the implementation matching IEC 60559 floating point specification. These 2 are determined by the floating point operation library and it may be appropriate to define these in <fenv.h> or another actually. They do not necessarily have to be predefined in a preprocessor.
__STDC_ISO_10646__ is defined as a constant in the format such as 199712L which represents the year and month of the specification including amendments and corrigenda in complying ISO/IEC 10646 in case values of characters in the wchar_t type are all some sort of coded implementations of ISO/IEC 10646 (Universal Character Set of Unicode system.) This may also be defined in <wchar.h> or another and it does not seem to have to be predefined in a preprocessor.
In any case, the explanation is required in documents.
Scoring: 6. 2 points each for each of 3.
In Standard C, tokenization is performed in translation phase 3. Whether a white-space sequence other than <newline> should be compressed into one space at that time is implementation-defined. *1
However, as this is an internal behavior that does not usually influence the compilation result, it is not a user concern. White-spaces in the beginning and end of a line may be removed.
If you are wondering this specification is unnecessary, it is not so, but there is one case necessary. That is in case the preprocessing directive line has [VT] and [FF]. This is defined as a specification in an incomprehensible way in Standard C. On one hand, this is treated as a violation of constraints; the specification above is established on the other hand. In other words, [VT] and [FF] can be compressed into one space together with spaces or tabs before and after in phase 3. In that case, [VT] and [FF] do not remain in phase 4, but they do remain in case they are not compressed to cause a violation of constraints.
I think it is enough if [VT] and [FF] handlings are described in documents.
Scoring: 2.
Note:
*1 C90 5.1.1.2 Translation phases 3 C90 6.8 Preprocessing directive -- Constraints C99 5.1.1.2 Translation phases 3 C99 6.10 Preprocessing directive -- Constraints
In case stringizing pp-tokens including UCN by the # operator, whether UCN's \ should be repeated is implementation-defined. *1
Scoring: 2.
Note:
*1 C99 6.10.3.2 The # operator
In general, the evaluation of character constant values is implementation-defined. This has several levels.
Among these, 1 is not subject to scoring since they are OS dependent. The problems are 2, 3 and 4.
Even single byte single-character character constants are implementation-defined in terms of sign handling. Moreover, different evaluation between preprocessing and compilation is permitted. *1
Scoring: 2. 2 points if a document includes descriptions or if it has descriptions regarding the evaluation in the compilation phase and the same evaluation is performed in the #if expression. 0 points if no accurate description is written.
Note:
*1 C90 6.1.3.4 Character constants C90 6.8.1 Conditional inclusion -- Semantics C99 6.4.4. 4 Character constants C99 6.10.1 Conditional inclusion -- Semantics
There is an issue of a byte order in multi-character character constant evaluation such as 'ab'. This is also implementation-defined.
Scoring: 2. The scoring method is same as d.2.1.
Other than encoding, there are differences in sign handling and byte order in multi-byte character constant evaluation. These are all implementation-defined.
Scoring: 2. The scoring method is same as d.2.1.
Other than encoding, there are differences in sign handling and byte order in wide character constant evaluation. These are all implementation-defined.
Scoring: 2. The scoring method is same as d.2.1.
In general, how sign bits are handled in the right bit shift for a negative integer is implementation-defined. This depends on the CPU specification, but also the implementation of the compiler system. *1
Scoring: 2. The scoring method is same as d.2.1.
Note:
*1 C90 6.3.7 Bitwise shift operators -- Semantics C99 6.5.7 Bitwise shift operators -- Semantics
In general, in case one or both of right-hand side and left-hand sides are negative integers, the results of division and modulo are implementation-defined. This depends on the CPU specification, but also the implementation of the compiler system. *1, *2
Scoring: 2. The scoring method is same as d.2.1.
Note:
*1 C90 6.3.5 Multiplicative operators -- Semantics C99 6.5.5 Multiplicative operators -- Semantics
*2 In C99, quotients are rounded off to the direction of 0 just as div() and ldiv().
Up to which byte of the identifier including a macro name from the beginning is significant is implementation-defined. For macro names and internal identifiers, 31 bytes for C90 and 63 bytes for C99 must be guaranteed. *1
Scoring: 2.
Note:
*1 C90 6.1.2 Identifiers -- Implementation limits C99 6.4.2.1 Identifiers -- General -- Implementation limits
In C99, implementations using multi-byte characters in identifiers are permitted. This specification is implementation-defined. *1
Scoring: 2.
Note:
*1 C99 6.4.2.1 Identifiers -- General
Even if something is not required to implementations by Standards, they may be important in order to evaluate the "quality" of implementations. In this chapter, quality evaluation testing is explained.
There are various encodings for multi-byte character and wide character. The specification of encoding is implementation-defined. In this section, however, the "quality" matter of implementations are explained, that is how many encodings the implementation can handle and to what degrees can handle.
This Validation Suite provides samples supporting several encodings for m_33, m_34, and m_36. Compiler systems not only have to support the standard encoding on the system, but also must support many encodings in order to process source files supporting multiple languages. *1
However, the method of testing differs depending on the system and compiler systems and is not easy.
In GCC, use environment variables, LC_ALL, LC_CTYPE, and LANG, change behavior, but their implementations are half-finished and not reliable. Moreover, whether this feature is available depends on the configuration when GCC itself was compiled.
GCC 3.4 changed its way of multi-byte character processing entirely. It converts every encoding of source file to UTF-8 at the start of preprocessing (so to say the translation phase 1). Hence, a multi-byte character constant in #if expression cannot be evaluated in other than UTF-8 which is irrelevant to the original encoding.
The C++98 Standard has a similar problem. Since it stipulates to convert multi-byte character into UCN at translation phase 1, multi-byte character constant in #if expression cannot be evaluated in other than UCN.
Considering the rather confused status of the Standards and the implementations, and considering the lack of portability and lack of meaning of character constant in #if expression, I excluded the tests of multi-byte/wide character constants in #if expression (m.33, m.34, m.36.1) from the subject of scoring since Validation Suite V.1.5.
In Visual C++, use #pragma setlocale, which is a convenient directive. On Windows, "Regional and Language Options" is supposed to change the language to use but is ill-defined and cumbersome. #pragma setlocale is convenient for a programmer since it can be used without tampering with Windows (though how well Visual C++ itself implements it is a separate story).
As far as other compiler systems I have tested are concerned, they support only their default encoding only. Many of them support functions such as setlocale() in the library, which have nothing to do with preprocessing or compiling source files. What is necessary is a capability for a compiler system itself to recognize and process the encoding of source files.
Note:
*1 In C99, Unicode sequences starting with \u or \U were introduced which made it difficult to understand the relationship with multi-byte/wide character. The C++ Standard is even more complicated.
For wide character constant, see 3.4.33.
Scoring: none.
Multiple byte multi-byte character constants are supported in the #if expressions (in Standards, the term, multi-byte characters, include single-byte characters, however, in this document, they refer to only multi-byte characters which are not single-byte to avoid confusion.) The evaluation is even more implementation-defined than single-byte character constants.
Scoring: none.
This test needs to be determined along with the test in u.1.7 described later. Simply evaluating a character value does not mean recognizing an encoding. u.1.7 tests whether a multi-byte character is in the range accepted for the encoding. The encoding of a character is properly recognized only after its value is correctly evaluated in m.34 and after an appropriate diagnostic message is displayed in u.1.7.
If the encoding of (multi-byte | wide) characters is shift-JIS, Big-5, or ISO-2022-*, cumbersome issues arise as there may be a byte with the 0x5c value in a character which is same as '\\'. Compiler systems must not interpret this as a \ (backslash) escape character. A (multi-byte | wide) character is one character and not two single-byte characters.
0x5c in the value for multi-byte character constants must not be interpreted as the beginning of an escape sequence.
Scoring: none.
'\\' must not be inserted where a multi-byte character with 0x5c is included in an argument of an operand for the # operator. Though there is a method of supporting multi-byte characters, not supported in the compiler proper, in a preprocessor by inserting '\\', is at another level.
In addition, there are troublesome issues different from other literals in the tokenization for this type of character constant and string literal including multi-byte characters.
As multi-byte characters encoded in ISO-2022-* include not only '\\', but also bytes of values matching '\'' or '"', sloppy processing will end up with a tokenization failure.
Scoring: 7. 2 points each for the Shift-JIS and Big-5 encoding. 1 point each for 3 samples of ISO-2022-JP.
This item needs to test not only m_36_*.t, but also m_36_*.c. This is because the preprocessor which correctly processes stringizing may fail identifying the complex string literal including a Kanji character with 0x5c as an argument in the assert() macro. m_36.c also tests the tokenization of string literals.
GCC 3.4-4.1 converts the encodings to UTF-8, and often fails to convert back to the original encoding. Though the GCC's conversions are usually unwanted ones, these testcases do not check such conversions, they check only the handling of 0x5c byte (lenient tests.)
In Standard C, there are many specifications of an undefined behavior. What causes an undefined behavior is incorrect programs or data or at least programs without portability. However, it is not mandatory to issue a diagnostic message, different from violation of syntax rules or constraints. Implementations can process this in any way. Some sort of reasonable processing as a valid program or handling it as an error by issuing a diagnostic message are acceptable, and it is not against Standards even if the process is canceled or run out of control without issuing a diagnostic message.
However, in order to evaluate the "quality" of an implementation, concretely what undefined behavior is a question. I think it is appropriate for an implementation to issue some sort of diagnostic message. Not to mention an error case, even in the case of handling it as a valid program, it is helpful to issue a warning in order to indicate portability issues. Runaway is out of the question.
The following tests check whether implementations issue an appropriate diagnostic message for source that causes an undefined behavior. Diagnostic messages may be an error or warning. In case of a warning, some sort of reasonable processing is necessary.
u.1.* are preprocessing specific problems and u.2.* are common in constant expressions in general.
Scoring: 1 point each for an item unless otherwise noted if an appropriate diagnostic message is issued. 0 points for an off-center or no diagnostic message.
Source files which are not empty and whose end is not the <newline> cause undefined behavior (Depending on OS, no newline character data exists in a file and a newline is automatically added by the implementation when a file is read.) *1
u.1.1, u.1.2, u.1.3, and u.1.4 are all ones with incomplete source files. In case the translation unit ends within the file, most implementations seem to issue a diagnostic message. However, even with such implementations, there is a possibility that the incomplete file is treated as valid source by processing the file included within another. Although this is also a type of undefined behavior and not an erroneous process, it is appropriate to issue a diagnostic message.
Note:
*1 C90 5.1.1.2 Translation phases C99 5.1.1.2 Translation phases
Source files ending with the <backslash><newline> sequence cause an undefined behavior. *1
Note:
*1 C90 5.1.1.2 Translation phases C99 5.1.1.2 Translation phases
Source files ending in the middle of comments cause an undefined behavior. This is actually the case of unclosed or nested comments. *1
Note:
*1 C90 5.1.1.2 Translation phases C99 5.1.1.2 Translation phases
Source files ending with an incomplete macro call is undefined. *1
This occurs in case that a parenthesis is missing to close a macro argument and it is important to have a diagnostic message.
Note:
*1 C90 6.8.3.4 Rescanning and further replacement C99 6.10.3.4 Rescanning and further replacement
Extremely limited characters can be written in places other than string literals, character constants, header-names, and comments. They are uppercase and lowercase alphabets, numbers, 29 symbols, and 6 kinds of white space characters. It is no wonder since these are for source. *1
Here, we test the case of control codes other than white spaces. Although control codes are invalid even in string literals and elsewhere, compiler proper can check this. As there are many locale-specific or implementation-defined aspects in character sets in general and the range is not necessarily clear, we do not test those. Kanji characters are undefined in places other than above, but not tested for similar reasons.
Note:
*1 C90 5.2.1 Character sets C99 5.2.1 Character setsUCN was added in C99.
Preprocessing directive lines starting with #, even if white space characters, cause a violation of constraints when other than [SPACE][TAB] exists. However, this is the case of translation phase 4 and it is possible to compress those with a sequence of white spaces before and after which are other than <newline> into one space in phase 3 prior to phase 4. In such event, no violation occurs. *1
So as in Standards, it is appropriate to issue a diagnostic message to this. This is not an undefined behavior test, however, it is included here as a matter of convenience since it is difficult to classify this elsewhere.
Note:
*1 C90 5.1.1.2 Translation phases C99 5.1.1.2 Translation phases C90 6.8 Preprocessing directives -- Constraints C99 6.10 Preprocessing directives -- Constraints
Even inside a string literal, character constant, header-name, or comment, a sequence that is not accepted as multi-byte characters causes undefined behavior. This is the case where the next byte of the first byte of a multi-byte character cannot be used as the second byte. *1
Scoring: 9. 1 point each for 6 types of encodings other than UTF-8, 3 points for UTF-8. The UTF-8 testcase has 4 testing lines: the first is normal sequence, and the rest 3 are illegal ones. Give 1 point each for the 3 illegal lines diagnosed appropriately. Give no point to a preprocessor which issues off-target message at the normal case. Also, refer to the explanation of m.34.
Note:
*1 C90 5.2.1.2 Multibyte characters C99 5.2.1.2 Multibyte characters
The character constant pp-token must complete on that logical line. If there is a ' without corresponding ' on a logical line, it is undefined. *1
An optional message can be written on the #error line. However, it must be arranged in the pp-token sequence and a single apostrophe is not allowed. In this sample, the part intended for a comment is eaten by searching the ' which should be at the end of the character constant.
Note:
*1 C90 6.1 Lexical elements -- Semantics C99 6.4 Lexical elements -- Semantics
String literals must complete on that logical line. A single " is undefined. *1
String literals extending over lines seem to have been accepted in many UNIX compiler systems. Source files expecting it can still be seen sometimes.
Note:
*1 C90 6.1 Lexical elements -- Semantics C99 6.4 Lexical elements -- Semantics
Incomplete header-names on the #include logical line is also undefined. *1
Note:
*1 C90 6.8.2 Source file inclusion -- Semantics C99 6.10.2 Source file inclusion -- Semantics
', /*, or \ in the header-name pp-token is undefined. The case " exists in the header-name enclosed by < and > is also the same (error since " becomes the end of the pp-token from the beginning in the header-name of a string literal format.) *1
These, except for \, are all confusable with a character constants, string literal, or comment and can be interpreted either way.
\ is mistakable with an escape sequence. Though a header-name has no escape sequence, this is not determined as a header-name until phase 4 after tokenization in translation phase 3. Therefore, implementations suffer from identifying this case. Though escape sequences are processed in phase 6, it is necessary to recognize an escape sequence since \" is interpreted as an escape sequence rather than the end of a string literal in phase 3.
However, \ is a Standard path-delimiter in the OS such as DOS/Windows and implementations on those OS certainly handle this as a valid character (except for the case that the last character is \ in the header-name of a string literal format.) Many cases causing an undefined behavior are erroneous programs, but not always so. It is not recommended to write \ where / is adequate. Implementations should issue a warning. On other OS's, it will be an error when opening files; no diagnosis is necessary in preprocessing tokenization.
Note:
*1 C90 6.1.7 Header names -- Semantics C99 6.4.7 Header names -- Semantics
It is undefined if the argument on the #include line is not a header-name. In other words, this is the case where an argument is not in the string literal format or not enclosed by < and >, or no macro which is expanded into either of these exists. *1
Note:
*1 C90 6.8.2 Source file inclusion -- Semantics C99 6.10.2 Source file inclusion -- Semantics
The argument on the #include line is one header-name only. Extra pp-token other than that is undefined. *1
Note:
*1 C90 6.8.2 Source file inclusion -- Semantics C99 6.10.2 Source file inclusion -- Semantics
It is undefined if the argument on the #line has no line number (A file name is optional. A line number must be the first argument.) *1
Note:
*1 The sources of Standards up to u.1.18 are all same below.
C90 6.8.4 Line control -- Semantics C99 6.10.4 Line control -- Semantics
The file name which is the second argument of #line must be a string literal.
If this is a wide string literal, it is a violation of constraints while the rest of the #line errors are undefined. This specification lacks balance.
Three or more arguments on the #line line cause undefined errors.
In C90, the line number argument for #line must be in the range of [1, 32767]. It is undefined otherwise.
This sample tests the case where the #line specification is within this range, but the line number of the source exceeds the range afterward. Depending on implementations, the line number may silently wrap. It is appropriate to issue a warning.
Scoring: 2. 1 point if 1 or 2 out of 3 samples can be diagnosed.
In C99, this range is [1, 2147483647].
The line number argument for #line must be a decimal number. Hexadecimal and others are undefined.
The fact that 'defined' is an operator and confusable with an identifier causes various problems. Since it is once tokenized as an identifier in translation phase 3 and recognized as an operator in phase 4 only if this exists on the #if line, defining a macro which is expanded into defined on the #define line is possible. If this macro actually appears on the #if line, it is undefined. There is no guarantee for the result of its expansion to be treated as an operator. *1
Defining a macro named defined is itself undefined (refer to u.1.21), however, such an example is not seen in reality. Although, the macro definition which has a token named defined in the replacement list can be seen from time to time. Some compiler systems treat this as legitimate by performing a special process, which is not a logical specification.
Scoring: 2. 1 point if only 1 out of 2 samples can be diagnosed.
Note:
*1 C90 6.8.1 Conditional inclusion -- Semantics C99 6.10.1 Conditional inclusion -- Semantics
It is undefined if the #undef argument is defined, __LINE__, __FILE__, __DATE__, __TIME__, __STDC__, or __STDC_VERSION__. *1, *2, *3
Note:
*1 C90 6.8.8 Predefined macro names C99 6.10.8 Predefined macro names
*2 Amendment 1/3.3 Version macro
*3 __STDC_HOSTED__, __STDC_ISO_10646__, __STDC_IEC_559__, and __STDC_IEC_559_COMPLEX were added in C99.
It is undefined if the macro name defined by #define is defined, __LINE__, __FILE__, __DATE__, __TIME__, __STDC__, or __STDC_VERSION__. *1, *2, *3
Scoring: 2. 1 point if 2 out of 3 samples are diagnosed.
Note:
*1 C90 6.8.8 Predefined macro names
*2 Amendment 1/3.3 Version macro
*3 __STDC_HOSTED__, __STDC_ISO_10646__, __STDC_IEC_559__, and __STDC_IEC_559_COMPLEX were added in C99.
The result of pp-token concatenation by the ## operator must be a valid (single) pp-token. It is undefined, otherwise. *1
In this sample, the subject matter is the pp-token called a pp-number which was defined as a new specification in Standard C.
Note:
*1 C90 6.8.3.3 The ## operator -- Semantics C99 6.10.3.3 The ## operator -- Semantics
The // mark is used for a comment in C99 and C++, but is not a pp-token. The result of generating this sequence with the ## operator is undefined.
A comment is converted into a space before a macro is defined or prior to expansion, thus it is not possible for a macro to generate a comment.
The result of pp-token concatenation by the # operator must be a valid (single) string literal. It is undefined otherwise. *1
This problem rarely occurs. As you can see in this sample, it is limited to odd cases where \ is outside of a literal and even more special cases. This sample shall be diagnosed in the compilation phase rather than preprocessing and that is enough. However, implementations should not crash or silently ignore the problem.
Note:
*1 C90 6.8.3.2 The # operator -- Semantics C99 6.10.3.2 The # operator -- Semantics
Having an empty argument in a function-like macro call is undefined in C90. *1
Performing reasonable macro expansion by interpreting an empty argument as 0 pieces of token is very possible. *2
This is an example where implementations can give the undefined specification a meaningful definition. Such implementations, however, also have no portability in pre-C99 at least. It will be appropriate to issue a warning.
Scoring: 2. 1 point if 3 of 4 out of 5 samples are diagnosed.
Note:
*1 C90 6.8.3 Macro replacement -- Semantics *2 C99 6.10.3 Macro replacement -- Semantics
A function-like macro call can extend over multiple logical lines. Therefore, there may be a line confusable with a preprocessing directive in an argument; the result is undefined. *1
This type of "argument" will be interpreted as a preprocessing directive if it is in the #if group to be skipped.
Note:
*1 C90 6.8.3 Macro replacement -- Semantics C99 6.10.3 Macro replacement -- Semantics
In C90, the macro expansion result ending with a function-like macro name is considered to be undefined. This is an interpretation added later, but not clear in meaning. Please refer to 2.7.6. *1
This was included as a test in this Validation Suite up to V.1.2, but omitted from V.1.3.
Note:
*1 ISO/IEC 9899:1990 / Corrigendum 1
This specification was removed in C99 and a complicated specification was added.
In case the first pp-token on a line is # and there is another pp-token after that, what follows # must be a preprocessing directive usually. The line of # only is accepted for some reason. *1
However, # at the beginning of a line or followed by an invalid directive or other pp-token are not violations of syntax or constraint for preprocessing. As you can see in u.1.25, that is because all lines starting with # do not have to be preprocessing directive lines. Which one is a preprocessing directive line depends on the context.
Standards do not specify this as undefined, however, it is in a way undefined in the sense that "it is not defined as a specification." It is desirable for an implementation to issue some sort of diagnostic message. However, this does not necessarily have to be diagnosed by preprocessing. If preprocessing outputs this line as is, it will be an error in the compilation phase. That is acceptable as well. There is no danger as long as preprocessing avoids the case where it interprets silently #ifdefined as #if defined.
In C99, something not clear in the meaning called # non-directive was added. However, its content is not defined and I can say it is undefined, practically speaking. *2
Note:
*1 C90 6.8 Preprocessing directives -- Syntax *2 C99 6.10 Preprocessing directives -- Syntax
In order for the line starting with # to be a preprocessing directive, only a directive name is allowed as the next pp-token. Directive names are never macro-expanded.
In the case where an identifier which is not a directive name comes after # and it is a macro name, there are 2 processing steps; diagnosing it as a missing directive and considering it as usual text to expand a macro for an output. Even the latter needs some sort of diagnosis. Since the latter should cause an error in the compilation phase, that is acceptable as well. Only processing what is macro-expanded as a "correct" preprocessing directive should not happen.
Character escape sequences in a string literal or character constant support only \', \", \?, \\, \a, \b, \f, \n, \r, \t, and \v as a specification. Other character sequences starting with \ are undefined. Especially, sequences with \ followed by a lowercase are reserved for future use. *1
Many of these diagnoses are handled by the compilation phase, however, only preprocessing diagnoses the case where these exist in a character constant for the #if expression.
Note:
*1 C90 6.1.3.4 Character constants -- Description C99 6.4.4.4 Character constants -- Description C90 6.9.2 Character escape sequences C99 6.11.4 Character escape sequences
In C99, UCN (universal-character-name) escape sequences in the \uxxxx or \Uxxxxxxxx format were added.
In a bit shift operation for a integer type, the case that the right operand (shift count) is a negative value or greater than or equal to the width of the left operand type is undefined. *1
If this is in the #if expression, it should be diagnosed by preprocessing.
Note:
*1 C90 6.3.7 Bitwise shift operators -- Semantics C99 6.5.7 Bitwise shift operators -- Semantics
In Standard C, there is a specification called unspecified. This is for valid programs without a processing method unspecified and whose processing method is not necessarily documented by the implementation.
There are not many examples for this. Only extremely special cases have different results according to the processing method. However, it is desirable to issue a warning even if it is a special case.
The result of a program depending on unspecified behavior is undefined.
It is unspecified in preprocessing and the ones which have different results according to processing methods are the following. In these 2 tests, 2 points are given for each, whether an invalid pp-token is generated and a diagnostic message is issued or whether a warning for no portability is issued. In case of the latter, it can be at macro definition or expansion.
Additionally, evaluation order within an #if expression is unspecified. This is not included in testing since the #if expression result does not change because of this.
In case both the # and ## operators are in one macro definition, which is evaluated first is not specified. *1
This sample is the example of different results depending on which of # and ## is evaluated first. Furthermore, if ## is evaluated first, # is treated as a pp-token rather than an operator, causing the concatenation result to generate an invalid pp-token. As this type of macro does not have portability, it is preferable for an implementation to issue a warning.
Scoring: 2.
Note:
*1 C90 6.8.3.2 The # operator -- Semantics C99 6.10.3.2 The # operator -- Semantics
In case there are multiple ## operators in one macro definition, the evaluation order is not specified. *1
In this sample, an invalid pp-token is generated on the way of evaluation depending on the evaluation order. As this type of macro does not have portability, it is preferable for an implementation to issue a warning.
Scoring: 2.
Note:
*1 C90 6.8.3.3 The ## operator -- Semantics C99 6.10.3.3 The ## operator -- Semantics
Other than undefined and unspecified, there are cases for which it is preferable for implementations to issue a warning. Those are below.
The w.1.* and w.2.* are totally valid programs as far as Standards are concerned, however, they have probably some sort of errors and diagnosis is important. w.1.* are preprocessing specific problems and w.2.* are #if expression versions of the problems common with operations in the compilation phase.
The w.3.* concern a specification with implementation-defined aspects of translation limits. Implementing the translation limits beyond the minimum value guaranteed by a Standard can be said to improve the quality of the implementation. On the other hand, programs depending on it will have the problem of limited portability. Therefore, it is preferable for an implementation which implements translation limits beyond the minimum value to issue warnings.
In tests below, it is a pass if an appropriate diagnostic message is issued. w.3.* allow an error when the translation limits of an implementation match the minimum value of the Standard. It is a failure if an error occurs without meeting the minimum value (whether meeting the minimum value is determined in n.37.*.)
There are many nested comments and typos in source missing a comment mark. Among those, nested comments, /* /* */ */, and a comment with only */ always cause an error in the compilation phase since the */ sequence does not exist in the C language. However, when */ is missing, it may not cause an error since the part up to the end of the next comment is interpreted as a comment. This is a dangerous mistake and it is important for preprocessing to issue a warning. Even if the case some sort of error occurs in the compilation phase, the cause is no longer clear at that time.
Scoring: 4.
The case of the inclusion of a token sequence after a replacement list at macro rescan is an unwritten specification in K&R 1st and the official Standard C. However, this situation is brought about only by an unusual macro. Especially, the one where the replacement list comprises the first half part of another function-like macro call is an extremely unusual macro. In reality, the possibility of an error in a macro definition is large, it is preferable for the implementations to issue a warning. I sometimes see an object-like macro expanded into a function-like macro name though such writing style causes bad readability.
This is a problem of Standard. I think that incomplete rescan in a replacement list should be specified as an error (violation of constraints) (refer to 2.7.6.)
Scoring: 4. 1 point if only 1 out of 2 samples can be diagnosed.
In the mixed mode operation of signed and unsigned integers, "usual arithmetic conversion" is performed and signed numbers are converted into unsigned numbers. In case the original signed integer is positive, the value does not change. However, if it is negative, it is converted into a large positive number. This is not an error, but not normal and there is a possibility of an error. It is preferable to issue a warning. In preprocessing, this phenomenon is seen in the #if expression.
Scoring: 2.
In the case the result of an unsigned operation is out of the range, it may not be an error since it is supposed to be wrapped round. Since there is a possibility of an error, it is preferable to issue a warning.
Scoring: 1.
w.3.1, w.3.2, w.3.3, w.3.4, and w.3.5 are all tests regarding the translation limits in C90. The contents are clear by themselves and no explanation is necessary. Compare these with n.37.*.
Scoring: 3. 1 point each for each of 3 items.
Scoring: 1.
Scoring: 1.
Scoring: 1.
Scoring: 1.
Scoring: 1.
Only this is same as n.37.8. This is the most approximate one in the specification for translation limits. Whether built-in macros should be counted in 1024 macros and whether the macros defined in standard headers should be counted vary the number. The macro tested in this sample is the 1024th in header files, but there are some macros defined prior in warns.t and warns.c. In any case, this macro exceeds the 1024th. Implementations are expected to issue warnings at appropriate places.
Scoring: 1.
In C99, translation limits were largely extended. It is preferable even for the implementation with the specification exceeding this to issue a warning for the source exceeding the specification for the portability.
Scoring: 9. 1 point each for the following 9 items.
Test samples are in the test-l directory. l_37_8.c is pseudo source which can be preprocessed, but not compiled.
Below is the items concerning the quality such as implementation's ease of use. Cases other than q.1.1 cannot be tested by sample programs.
q.1.* are regarding behaviors.
q.2.* are regarding options and extended functionalities.
q.3.* are regarding the runnable systems and the efficiency on them.
q.4.* are regarding documents.
Among these, some cannot help relying on rather subjective evaluation. Others can be evaluated objectively, but the measure may not be specified clearly. Refer to 6.2 to score them appropriately.
Concerning translation limits, the minimum specification is defined leniently in Standards. However, the actual implementations should have the specification above this. Especially, the requirement of nesting levels of #if and #include are too low in C90.
In C99, translation limits were largely raised. Also, restricting identifier length to less than 255 bytes is described as an obsolescent feature.
In the q.* items, only these are prepared as test samples. l_37_?.t and l_37_?.c in the test-l directory each test the translation limits as below. These exceed the C99 specification (however, the translation limit values in the C++ Standard guideline are higher than these.)
37.1L Number of parameters in a macro definition 255 37.2L Number of arguments in a macro call 255 37.3L Length of an identifier 255 bytes 37.4L Nesting level for #if 255 37.5L Nesting level for #include 127 37.6L Nesting level for sub-expressions in the #if expressions 255 37.7L Length of a string literal 8191 bytes 37.8L Length of a logical line in source 16383 bytes 37.9L Number of macro definitions 8191
l_37_8.c does not become an execution program even if it is compiled. Only preprocessing is necessary, however, make an object file by doing cc -c l_37_8.c to compile. If preprocessing is successful, you can find out how long line the compiler proper can accept.
Scoring: 9. 1 point for each of 9 kinds of samples. Compiler proper testing is not included.
Though a diagnostic message is issued, it can be difficult to understand, too vague, or not obvious about the point of the problem. Some diagnostic messages are detailed while others may lack focus. A diagnostic message should not be simply "syntax error", but should show the reason for the error. The implementation should indicate the line and point out the token of the problem.
The error message of a mismatched #if section is desired to indicate the corresponding line. Otherwise, it is not possible to detect where the error resides.
Duplicate diagnostic messages for the same error are not desirable.
Scoring: 10.
It is not acceptable for the line number that a preprocessor passes to the compiler proper to be mismatched. This appears in a diagnostic message, however, it is treated as a separate item as a matter of convenience. This scoring is done whether line numbers are accurately displayed in the sample programs thus far. There are preprocessors that do not output the line number information. These are out of the question.
Scoring: 4.
Scoring: 20. Deduct points as below for implementations which runaway or abort in any of the samples in this Validation Suite. A "runaway" means that the program must be aborted by ^C, must be reset, or leaves a discrepancy in the file system. An "abort" means that the damage is not done though the program ends prematurely.
The so-called include directory where standard header files are located is fixed to one location in the most simple case. However, it is often the case that multiples exist and there are cases that users have to specify one. About user level header files, there is no problem in case those are in the current directory, however, the directory search rule is different depending on the implementation in case they are in another directory and they include another header file. In any case, it is inconvenient if the include directory cannot be specified by users explicitly by using options and environment variables (many implementations use -I option.) In addition, the option to exclude specified directories in the system is necessary in the case of searching multiple directories in sequence. The option to change the search rule itself justifies its existence.
Scoring: 4.
It is valuable to have the option that allows an object-like macro to be defined at compilation rather than in source (many implementations use the -D option.) This makes it possible for the same source to generate objects of different specifications or to be compiled in different systems. In case of omitting a replacement list, there are some implementations defining 1 and others defining 0 piece of token, requiring a user to check documentation.
Some implementations have options to define the macro with arguments.
Scoring: 4.
There should be an option to cancel implementation-specific built-in macros. There are the following types.
Scoring: 2. 2 points with 1 or 2 option. 3 is not applicable as it is evaluated in d.1.5.
Though trigraphs are always used in the environment that needs them, they are hardly used in most environments since they are not needed. This should be enabled or disabled by an option at compilation.
Scoring: 2.
Similarly, digraphs also should be enabled or disabled by an option at compilation.
Scoring: 2.
It will be more helpful to have warnings other than violations of syntax rules or constraints issued as much as possible, however, they can be annoying sometimes. The option to specify a warning level or one to enable or disable warnings by type is needed.
Scoring: 4.
There are other helpful options in preprocessing. Having too many options makes things complicated, however, there are some that are convenient. The option not to output the line number information (-P in most cases) is relatively common and it seems to be used for the purposes other than C language. There are some which output without deleting comments. Depending on command processors of OS, there are cases where implementations need to implement redirection of diagnostic messages. In addition, so-called one-pass compilers in which preprocessing is not an independent program should have an option to specify the output after preprocessing.
The option to identify C90 (C95), C99, and C++ is obviously necessary. Furthermore, the option to achieve the compatibility between C99 and C++ (making C++ preprocessing closer to C99's) is helpful.
There is an option to output source file dependency relations in order to generate a Makefile.
Scoring: 10.
In Standard C, implementation specific directives are all implemented as sub-directives of #pragma. Preprocessing usually passes #pragma to a compiler as is and processes some of #pragma on its own. There are not many examples of the #pragma that preprocessing processes.
The header file with #pragma once written may be read only once even if #include to it is applied many times. This not only prevents multiple inclusion, but also is effective in improving processing speed. In case the whole header file is included without using #pragma once, for example, by
#ifndef __STDIO_H #define __STDIO_H .. #endif
some implementations automatically do not access this for a second time.
In mcpp, there is a #pragma directive which makes many header files combined as one file as a "pre-preprocess." In other words, this is the functionality to preprocess the header included by #include to output and add the #define directives which appeared there to the combined output. Including this is enough in compiling the original source. This way, the size of the header file gets smaller since comments, #if, and macro calls disappear. There is only one file to access. As a result, preprocessing speed is faster.
Some implementations are equipped with a header pre-compilation feature. This seems to have been introduced to process huge header files in C++, however, there is a tendency that the size of pre-compiled header gets larger than the total of original header files. The speed improvement cannot be seen very much in C, at least. The content of the pre-compiled header depends on the internal specification of compiler-proper. And the fact that it becomes a black box, which a user cannot see, is also a drawback.
In any case, above are all intended to speed up preprocessing and nothing else. Therefore, these features are not evaluated here, but rather in q.3.1.
Some implementations have #pragma which specifies the encoding of multi-byte characters. It is complete as a method of passing along the encoding of source files to a preprocessor or compiler.
mcpp also has #pragma which traces preprocessing and outputs debug information. As no debugging can be performed for preprocessing using a regular debugger, this is an important feature that can be done only in a preprocessor. This feature is easier to use since debug points can be restricted by using #pragma rather than specifying an option.
Some implementations implement what is normally specified as an option at compilation by #pragma such as error and warning control specification. While #pragma has merits of no portability issues in as far as Standard C compliant implementations are concerned and of being able to specify the location on source, it has the demerit that the source files have to be rewritten when making changes to it. If this were to be implemented, it should be done so in addition to the implementation of the compilation option.
There are not very many other #pragma which are processed in preprocessing.
By the way, #pragma sub-directives are implementation-defined and have a possibility of name collision between implementations. Some device to avoid name collision is desirable. GCC 3 uses the name GCC as '#pragma GCC poison'. This is a good idea. mcpp has adopted this idea as '#pragma MCPP debug' since V.2.5.
Scoring: 10.
Although extended functionalities should be implemented as #pragma sub-directives, some preprocessors implement directives other than #pragma as proposals of new preprocessing specifications.
In Linux system headers, GCC / #include_next directive is used. Using of this directive means, however, organization of the system headers are unnecessarily complicated, and is not praiseworthy. GCC implements also non-Standard directives such as #warning, #assert, etc. whose needs are not high.
GCC / cc1 has a "traditional" mode option other than the standard behavioral mode. mcpp has options for various behavioral specifications. Those experiments also have some meanings.
Scoring: 6.
Though the accuracy of processing and diagnosis is the most important thing, the faster the speed is, the better.
The #pragma and option to improve speed are called out to find out the result of the speed.
Scoring: 20. The speed of the program that does nothing but copying input to output is set as 20 points. Comparative scoring is done speed relative to this benchmark. Refer to 6.2 for concrete measurements. Since the absolute speed varies depending on hardware, comparison should be done with same level of hardware. In addition, the speed of processing same program depends on the amount of standard headers to be read. Comparison should be done with mcpp as a point of reference.
In order to measure time, the time command (built-in command in bash or tcsh) is used for UNIX systems. On the Windows systems, there is a time command in bash if CygWIN is installed. Also, WindowsNT has a similar command called TimeThis in the "resource kit." On systems where these are not available, compile tool/clock_of.c for use (though it is rather inaccurate.) *1
Since preprocessing is usually done in an instant on recent hardware, it is hard to measure the time of preprocessing. On Windows, the inclusion of Windows.h is a very heavy task, fortunately or unfortunately, and can be used to measure it. On UNIX-like systems, we have to find some relatively heavy task from glibc sources or such.
Note:
*1 Some WindowsNT resource kit programs can be used on WindowsXP while others do not. TimeThis seems to be usable.
The smaller the memory usage is, the better it is. This is a serious problem especially where there is a strict limitation in memory usage.
As preprocessing is a part of compilation, the memory usage for overall compiler system actually becomes an issue. Since the compiler proper itself eats up more memory usually in case preprocessing is performed by an independent preprocessor, memory usage by a preprocessor shall not be a problem. However, in case there are many macro definitions and such, a preprocessor may consume more memory. Memory usage includes not only the size of a program, but also data area usage.
Scoring: 20. There is no appropriate command to measure memory usage of a program. On UNIX-like systems, we can roughly know the memory usage of cc1 or mcpp by 'top' command while doing 'make' of some large software such as glibc or firefox. On Windows, memory usage of a preprocessor can be roughly known by 'task manager' about inclusion of Windows.h.
The portability of preprocessor source itself becomes an issue when it replaces a resident preprocessor of the compiler system or when it is being updated or customized. The following shall be subject to evaluation.
Scoring: 20. I only read one and a half source. I just took a glance at the rest. So, this scoring is just a guess.
In d.*, we saw only whether there are documents regarding "implementation-defined items" by Standard C. Here, we will also evaluate the quality of documents.
In addition to implementation-defined items, the following are necessary documents at minimum.
In addition, having the description of the overall preprocessing specification including the Standard C would be of course much better.
Accuracy, readability, searchability, easiness of viewing and others become subjects for evaluation. The document for porting is included in the q.3.3 evaluation.
Scoring: 10.
Nowadays, in most compiler systems C and C++ are provided together. In that case, the same preprocessor seems to be used in both C and C++. Since preprocessing is almost the same in both, it is not necessary to prepare separate preprocessors for each. However, it is not exactly the same in both.
If you compare C++ Standard with C90, the C++ preprocessing is the C90 preprocessing plus the specifications below.
Length of a source logical line 65536 bytes Length of a string literal, a character constant, and a header name 65536 bytes Length of an identifier 1024 characters #include nesting 256 levels #if, #ifdef, and #ifndef nesting 256 levels #if expression parentheses nesting 256 levels Number of macro parameters 256 Number of definable macros 65536
In C99, only the second of these are the same and the others are different. In C99, there occurred new differences due to additions such as p+ and P+ sequences in floating point numbers, official multi-byte character support in identifiers, variable macros, legitimization of empty arguments, macro expansion for the argument on the #pragma line, _Pragma() operator, #if expression evaluation in long long and wider, concatenation of a neighboring wide-character-string-literal and a character-string-literal as a wide-character-string-literal and others. UCN became a specification only in translation phase 5 in C99. The constraint on UCN is also slightly different. Though translation limits were also largely extended in C99 compared with C90, it was not as extreme as C++ Standard and there are differences here, too.
These may not seem to be too many differences but enough not to use the same preprocessing in C and C++. In addition, in C, C90 (C95) and C99 cannot use the same preprocessing.
Besides, predefining __STDC__ in C++ is a cause of trouble and not desirable.
Although some implementations define __cplusplus using the -D option, it is inappropriate, as it becomes one of the user-defined macros.
Although whether each of ::, .*, and ->* should be treated as one token is hardly an issue in preprocessing, handling it correctly is always the best.
Therefore, in order to share a preprocessor among C90 (C95), C99, and C++, a decent implementation seems to use a runtime option and change processing to accommodate differences above.
Please note that mcpp does not support the following two points among the specifications above since implementing them is too much burden for the value. I believe what I have is almost enough in practice.
The conformance tests for C++ specific preprocessing specifications added to C90 preprocessing specifications are shown below.
In the implementations which do not recognize file names as C++ source unless they are in the format of *.cc, copy files to *.cc to test.
Samples more than what are in the test-l directory are not provided for translation limits. In C++, translation limits are only guidelines, so they are not subjects for scoring here. In addition, the length of header names will not be tested since it is OS dependent.
Note:
*1 C++ 2.1 Phases of translation *2 C++ 2.7 Comments *3 C++ 2.12 Operators and punctuators *4 C++ 2.13.5 Boolean literals
In C99, bool, true, false, and __bool_true_false_are_defined are defined in <stdbool.h> as _Bool, 1, 0, and 1 respectively as a macro.
*5 C++ 2.5 Alternative tokens *6 C++ 16.8 Predefined macro names *7 C++ Annex B Implementation quantities *8 C++ 16.2 Source file inclusion
In C90 6.8.2. only up to 6 characters to the left starting with '.' in a header name were guaranteed. 8 characters in C99 6.10.2. This restriction was removed in C++.
*9 In C99, whether extra \ is inserted when UCN is stringized by the # operator is implementation-defined. C++ does not have this specification.
If extra \ is added, the UCN no longer goes back to a multi-byte character. Therefore, it is a better implementation without an extra \. However, not having the extra \ is an "wrong" implementation in the C++ specification.
Scoring: 4.
Scoring: 2.
Scoring: 4.
Scoring: 2.
Scoring: 2. Even if the test appears successful, it is invalid if token concatenation "succeeds" without any warning when processing it in C as well.
Scoring: 2.
Scoring: 4. 1 point for __cplusplus < 199711L.
Scoring: 1. 1 point if a diagnostic message such as a warning is issued.
Scoring: 2.
Below are issues faced when a preprocessor is actually used other than those of C preprocessor Standard conformance level and quality.
Samples in this Validation Suite include some standard headers. Without those headers correctly written, testing a preprocessor itself cannot be performed accurately.
The followings are prone to problems in practice in the standard header implementation.
The standard headers must not only include all function declarations, type definitions, and macro definitions, but also meet the conditions below.
The range of the identifier reserved is specified by the Standard and other identifiers must be open to users. Since names starting with one or two '_' are reserved for some sort of usage, they can be used in standard headers by implementations (On the other hand, they must not be defined by users.)
This is a little constrained specification. No traditional names outside of Standards can be used in Standard C unless they are changed to the names starting with '_'. In POSIX which became a starting point for libraries and standard headers in Standard C, names outside of Standard C are also enclosed by:
#ifdef _POSIX_SOURCE ... #endif
At least when this part is used, implementations are no longer Standard C.
Also, even if function names such as open(), creat(), close(), read(), write(), and lseek() do not appear in standard headers, implementing functions such as fopen(), fclose(), fread(), fgets(), fgetc(), fwrite(), fputs(), fputc(), fseek() etc. by using open() etc. violates user's name space indirectly. Therefore, dividing open() etc. in _POSIX_SOURCE or separating them in separate headers such as <unistd.h> on the surface is meaningless.
This type of "system call functions" should be changed to the names starting with '_' or ones that are essential in reality should be included in Standard C.
Although 2 is not clearly described in a specification, a standard header including another standard header usually results in the declaration of the name which the standard header cannot declare which is caught by 1. Each header including <stddef.h> is against the rule. To avoid this, non-standard header of a different name called <sys/_defs.h> or so can be provided and included by a standard header (also <stddef.h> itself.) And, names used there should all start with '_'. *4, *5
3 will not become an issue in reality.
4 had problems in old implementations but is complied by most current implementations.
The method of enclosing the whole standard header as below is common.
#ifndef _STDIO_H #define _STDIO_H ... #endif
In addition, there is a method of using extended directive such as #pragma once.
What becomes an issue in 5 is there are implementations where macros using sizeof or cast are written in standard headers. In Standard C, sizeof and cast are not allowed in the #if expression. Since sizeof and cast are available in the #if expression as well in the implementations actually using sizeof or cast in standard headers (Borland C 5.5), they must consider this as an extended specification.
As long as a user does not use sizeof and cast in the #if expression in his/her program, no portability problem or no other problems will arise. However, this preprocessing implementation is not an "extension" of Standard C, but rather a "deviation." That is because #if is processed in translation phase 4 in Standard C and no keyword exists in this phase. Keywords are recognized in phase 7 for the first time. In phase 4, names same as keywords are all treated as simple identifiers and identifiers not defined as macros are all evaluated as 0 on the #if line. So, sizeof (int) becomes 0 (0) and (int)0x8000 becomes (0)0x8000, causing a violation of syntax rule. Implementations must issue a diagnostic message for this. Not issuing a diagnostic message is not an "extension" but rather a "deviation" from Standard C. Recognizing only a part of keywords in phase 4 is a bit of a stretch as preprocessing logical configuration and confuses the meaning as "pre"-process phase in the compilation phase (translation phase 7.) *6
Note:
*1 C90 7.1.3 Reserved identifiers C99 7.1.3 Reserved identifiers
*2 C90 7.1.2 standard headers C99 7.1.2 standard headers
*3 C90 7.1.7 Use of library functions C99 7.1.4 Use of library functions
*4 This method is used in the book below. This book has many points which serve as useful references to the implementation of compiler systems especially.
P. J. Plauger "The Standard C Library", 1992, Prentice Hall
*5 In the GNU glibc system, other standard headers such as <stddef.h> are read in by a standard header itself multiple times. However, what is defined at that time seems to be only the name in the range reserved. Although this is not against Standards, it is not a good method as it loses the readability of standard headers and it makes the maintenance difficult. It is better to use the file such as <sys/_defs.h>.
Next, we will look at each standard header. From what I see the standard headers attached to some implementations, the ones with most problems seem to be <assert.h> and <limits.h>. Although these 2 are the most easy headers, they tend to have errors in the implementation since they are newly defined as specifications in Standard C. The usage for these 2 is covered a little here.
At first, it is <assert.h>. *1, *2
Different from other standard headers, including this file many times is not the same. Depending on the NDEBUG is defined by a user changes the results every time the file is included. In other words, as needed, this header is used as below.
#undef NDEBUG #include <assert.h> assert( ...);
And, starting with area where debugging is complete, it becomes below.
#define NDEBUG #include <assert.h> assert( ...);
If NDEBUG is defined, assert(...); disappears after macro expansion even if it remains in source (even if ... is the expression with side effects, the side effects do not occur since the expression is not evaluated.)
In order for <assert.h> to be used like this, it must not be enclosed by:
#ifndef _ASSERT_H #define _ASSERT_H ... #endif
#pragma once must not be written, either.
Also, as you can see from this, assert() is a macro and changes its definition. <assert.h> must apply #undef assert and the assert macro must be redefined according to NDEBUG.
In the assert(expression) call, when NDEBUG is not defined, nothing happens if the expression is true. If it is false, that is reported in the standard error output. This report displays the expression as is (not expanded even if there is a macro) and also the file name of the source and the line number. This can be easily implemented as long as the # operator and the __FILE__ and __LINE__ macros are correctly implemented.
In reality, some of the old implementations do not implement the # operator correctly or do not implement <assert.h> correctly. There are many samples in this Validation Suite which includes <assert.h>, testing the preprocessor itself cannot be correctly performed if <assert.h> is not written correctly. Since it is easy to write <assert.h> correctly, it is better to rewrite the ones that are not right among the files in an implementation. The following is an example from C89 Rationale 4.2.1.1. In this as well, the correct result cannot be obtained if the # operator is not correctly implemented. However, it is a preprocessor problem and cannot be helped.
#undef assert #ifndef NDEBUG # define assert( ignore) ((void) 0) #else extern void __gripe( char *_Expr, char *_File, int _Line); # define assert( expr) \\ ((expr) ? (void)0 : __gripe( #expr, __FILE__, __LINE__)) #endif
The __gripe() function can be written as below (the name, __gripe, can be anything as long as it starts with '_'.)
#include <stdio.h> #include <stdlib.h> void __gripe( char *_Expr, char *_File, int _Line) { fprintf( stderr, "Assertion failed: %s, file %s, line %d\n", _Expr, _File, _Line); abort(); }
Some implementations write fprintf(), fputs(), or abort() directly in <assert.h> without using __gripe(). That is also acceptable, however, it requires declarations for these functions. The FILE and stderr declarations are also necessary. However, it is quite complicated since <stdio.h> cannot be included. There will be no mistake if a separate function is implemented.
This is not so significant, but a duplicate string literal is generated every call in case all are implemented by macros. If implementations do not perform optimization by merging duplicate string literals into one, this is not wise in terms of the code size.
Note:
*1 C90 7.2 Diagnostics <assert.h> C99 7.2 Diagnostics <assert.h>
*2 Starting C99, the assert() macro started displaying from which function a macro was called. The internal identifier, __func__, is defined for this type of purpose.
This standard header is where macros representing the range of the integer type and others are written. These macros must be written so that their values match specifications and also the following conditions must be met. *1
There are implementations that use cast. We discussed that sizeof and cast in the #if expression are not in the range of Standard C in 5.1. First of all, the meaning that <limits.h> was newly specified is in that a preprocessor does not have to perform query regarding the execution environment such as cast and sizeof.
For example, instead of
#if (int)0x8000 < 0
and
#if sizeof (int) < 4
the following is used:
#include <limits.h> #define VALMAX ... #if INT_MAX < VALMAX
In #if, cast and sizeof are not necessary if <limits.h> macros are used.
Examples where the <limits.h> macros are wrong in the type can be seen from time to time. Those do not come from the preprocessor specification, but it seems that the person writing <limits.h> forgets 2 above and integer promotion, the usual arithmetic conversion rule, or the evaluation rule for integer constant tokens.
For example, there is a definition as below.
#define UCHAR_MAX 255U
The unsigned char values are all in the int range (if CHAR_BIT is 8) and the data object values in the unsigned char type becomes int by integer promotion. Therefore, UCHAR_MAX also must be evaluated as int. However, it becomes unsigned int in 255U. This must be:
#define UCHAR_MAX 255
Although either way does not seem to have any issues in practice, it is not necessarily so. Operations including unsigned type cause "usual arithmetic conversion" and force the conversion from signed type to unsigned of the same size. Therefore, the result of value comparison varies.
This is easy to see in the following:
assert( -255 < UCHAR_MAX);
This mistake is related to the circumstance that the integer promotion and usual arithmetic conversion rules were changed in Standard C from the ones adopted in many conventional implementations. The unsigned char, unsigned short, and unsigned long types were not in K&R 1st, but were implemented in many implementations later. In addition, in most of those implementations, unsigned was always converted into unsigned.
In the integer promotion in Standard C, however, unsigned char and unsigned short are promoted into int as long as all values stay in the int range and they are promoted into unsigned int otherwise. Also, in the usual arithmetic conversion between unsigned int and long, all values of unsigned int are converted into long if they are in the long range and into unsigned long otherwise. This is called the change from "unsigned preserving rules" to "value preserving rules." The reason for the specification is that this is supposed to be more predictable intuitively. A caution is necessary for this rule in <limits.h>. *2
In all examples below, short is 16 bit and long is 32 bit. The value of USHRT_MAX is 65535, but how to write depends on if int is 16 bit or 32 bit.
#define USHRT_MAX 65535U /* if int is 16 bit */ #define USHRT_MAX 65535 /* if int is 32 bit */
Since unsigned short is not in the int range if int is 16 bit, it is promoted into unsigned int. Therefore, USHRT_MAX also must be evaluated as unsigned int. In 65535, it is evaluated as long. The suffix, 'U', is necessary. On the other hand, since unsigned short values are all in the int range if int is 32 bit, they are promoted into int. Therefore, USHRT_MAX also must be evaluated as int. 'U' must not be attached. However, there is an example opposite of this.
#define USHRT_MAX 0xFFFF
In this example, correct evaluation will be performed whether int is 16 bit or 32 bit. In Standard C, octal or hexadecimal integer constant tokens without U, u, L, and l suffix are evaluated in the type which can express the value non-negative in the order of int, unsigned int, long, and unsigned long. In other words, 0xFFFF is evaluated as 65535 of unsigned int if int is 16 bit and 65535 of int if int is 32 bit. On the other hand, decimal integer tokens without a suffix are evaluated in the order of int, long, and unsigned long. 65535 is evaluated as long if int is 16 bit and as int if int is 32 bit. *3
C99 added long long/unsigned long long. It also added the _Bool type which has only the 0 or 1 value. Other types of integers became implementable as well. Rules for integer promotion were extended and the integer constant tokens that cannot be expressed as unsigned long are evaluated as long long/unsigned long long.
In accordance with the increased integer types and the acceptance of the implementation-defined integer types, the size relations of types became confusing. Therefore, the concept of integer conversion rank was introduced. This concept is a little complex, but there is no need to worry in practice. In standard integer types, the size relations of the rank are as below.
long long > long > int > short > char > _Bool
Here, the point is that the rank size for the implementations of the same size, for example, long and int which are both 32 bit, are distinguished. *4, *5
Note:
*1 C90 5.2.4.2.1 Sizes of integral types <limits.h> C99 5.2.4.2.1 Sizes of integer types <limits.h> *2 C90 6.2.1 Arithmetic operands C99 6.3.1 Arithmetic operands *3 C90 6.1.3.2 Integer constants C99 6.4.4.1 Integer constants *4 C99 6.4.4.1 Integer constants
*5 C99 added standard headers called <stdint.h> and <inttypes.h> in order to absorb the differences in integer types by implementations. These typedef some type names other than the long and short names since the number of integer types increased due to the arrival of 64 bit systems and the corresponding relations became confusing. However, there are 26 kinds of these type names, 42 kinds of macros representing the maximum and minimum values corresponding to these, 56 kinds of macros converted into the format specifier of corresponding fprintf(), and 56 kinds of macros converted into the format specifier of fscanf() similarly. Although there is no much load on implementations, too much complexity gives the impression of terminal symptoms.
Among all macros in <limits.h>, the most confusing ones are INT_MIN and LONG_MIN in the system with the internal representation of 2's complement. Especially, the INT_MIN in the implementation where int is 16 bit and long is 32 bit shows all the problems above. These are specially covered in separate sections.
In this case, it is understood that the range of int is [-32768,32767]. Additionally, there are no problems by having INT_MAX as 32767 or 0x7FFF in any implementation. However, I see an example where INT_MIN is defined as below.
#define INT_MIN (-32767)
Why is this type of definition different from the reality?
On the other hand, there are no implementations with this type of definition, as might be expected.
#define INT_MIN (-32768)
-32768 consists of 2 tokens, - and 32768. And, 32768 is not in the range which can be expressed in int. So, this is evaluated as long. Therefore, -32768 becomes the meaning of - (long) 32768.
Some make a definition like this:
#define INT_MIN ((int)0x8000)
No comment is repeated for the definition using cast. It is also invalid since 0x8000 only becomes the meaning of (unsigned) 32768.
Then, how can the definition be in order to make an evaluation as (int) -32768 without casting?
#define INT_MIN (-32767-1)
This is fine. 32767 can be INT_MAX or 0x7FFF. This definition has a subtraction operation, but it is not an issue (unary - is an operator in the first place.) *1, *2
#define INT_MIN (~INT_MAX) #define INT_MIN (1<<15)
These are also correct definitions.
#define INT_MIN (-32767)
I can imagine that this gave up defining a correct value since the idea of operation did not occur.
Then, is the definition of -32767 wrong or correct?
The bottom line is that this is wrong.
INT_MIN is defined as a macro representing the minimum value of int. If INT_MIN is -32767, what does this mean? And, what is INT_MIN-1 at all? Or, what are ~INT_MAX and 1<<15?
Regarding the INT_MIN-1 in this case, there seems to be thought of as the bit pattern representing out of range such as "NaN" in floating point operation.
However, compared with the Standard C specification regarding integer type, this interpretation has no basis. First, the result of bit operations on integer types is undefined in case the value of op2 is negative or the number of bit for op1 type and above where op1 << op2 or op1 >> op2, and not undefined, returning all unique values of integer type, otherwise. The result of ~op is int if op is int and the results of op1 & op2, op1 | op2, op1 << op2, and op1 >> op2 are int if both op1 and op2 are int. Therefore, the results of ~INT_MAX and 1<<15 are both int. You may think 1<<15 will overflow, but it is not so. Since the bit operation returns the value corresponding to the bit pattern as result of the bit operation, overflow cannot occur.
In C, the integer type operations are defined well in general. There are extremely few undefined areas. Especially, the relationship between a bit pattern and a value corresponds one-on-one completely except when 2 bit patterns exist for 0 in the internal representations of 1's complement and sign+absolute value. This is consistent from K&R 1st to Standard C. C has no way to write a bit pattern itself and "Not-a-Number" can be written only as (-32767-1) etc. This is an int value itself as you can see. C89 Rationale mentions some grounds and made clear there is no room for bit patterns representing an "invalid integer" or "illegal integer" in the integer type. *3, *4
In the internal representation of 2's complement, ~INT_MAX is the value of INT_MIN and I must say that the definition bigger than that is wrong.
Note:
*1 I saw this definition in "The Standard C Library" by P. J. Plauger for the first time. This style of limits.h is getting popular in recent implementations.
However, limits.h in this book also contains a mistake. The definitions for the compiler system with 16 bit int and 32 bit long are as below.
#define UINT_MAX 65535 #define USHRT_MAX 65535
These will evaluate to long. The correct definitions are:
#define UINT_MAX 65535U #define USHRT_MAX 65535U
*2 In recent compiler systems, *_MIN is typically defined in the form of (-*_MAX - 1) and there are few mistakes though they are still found occasionally. Vc7/include/limits.h and Vc7/crt/src/include/limits.h in Visual C++ 2003 contains:
#define LLONG_MIN 0x8000000000000000
0x8000000000000000 evaluates to unsigned long long. Since this type has the highest rank, the result of integer promotion has the same type. It never becomes a negative value. Therefore,
#if LLONG_MAX > LLONG_MIN
does not turn out as expected.
LLONG_MIN in include/limits.h of LCC-Win32 2003-08, 2006-03 is as below.
#define LLONG_MIN -9223372036854775808LL
9223372036854775808LL is a violation of constraints as this token value overflows the range of signed long long. If LLONG_MIN is defined as:
#define LLONG_MIN -9223372036854775808LLU
9223372036854775808LLU becomes unsigned long long. Using the unary - operator on an unsigned type does not change the result type, however, the result becomes a value which cannot be expressed in unsigned long long and ends up undefined.
In Visual C++ 2003 and LCC-Win32, all other *_MIN definitions are (-*_MAX - 1), but why is only LLONG_MIN wrong? If it is defined as below, there will be no problem.
#define LLONG_MIN (-LLONG_MAX - 1LL)
In Visual C++ 2005, this definition was revised correctly.
*3 C89 Rationale 3.1.2.5 Types C99 Rationale 6.2.6.2 Integer types
*4 In C99, the handling of the specific bit pattern as "trap representation" which causes an exception is allowed in implementations.
I do not know what sort of implementations fall under this specification in reality.
ISO C 9899:1990/ Amendment 1 added the standard header called iso646.h. This provides the operators including &, |, ~, ^, or ! with the replacement spelling expressed only in invariant character set in ISO 646. Replacement spelling is provided for |, ~, and ^ in trigraphs as well, however, trigraphs lack in readability. Alternatively, iso646.h defines 11 types of operators in macros in the token unit.
This implementation is very easy and the following example is enough. Since macro expansion is performed in preprocessing, there is no trouble for implementations. *1
/* iso646.h ISO 9899:1990 / Amendment 1 */ #ifndef _ISO646_H #define _ISO646_H #define and && #define and_eq &= #define bitand & #define bitor | #define compl ~ #define not ! #define not_eq != #define or || #define or_eq |= #define xor ^ #define xor_eq ^= #endif
Note:
*1 In the C++ Standard, these identifier-like operators are specified as operator-tokens rather than macros. This is a troublesome and meaningless specification for implementations.
Compiler systems tested and execution methods are as below. Compiler systems are sorted in the order of release time.
Runtime options slightly vary in each of C95 (C90), C99, and C++98.
If there are problems in <assert.h> and <limits.h>, testing was done after correctly rewriting them.
Number: OS / Compiler System / Execution program (version) Runtime option Comment 1 : Linux / / DECUS cpp C95: cpp
DECUS cpp original version by Martin Minow (June 1985.) It was ported to some systems such as various DEC systems, UNIX, and MS-DOS at that time, but what used in this test was modified by kmatsui and compiled on Linux / GCC. Macros were rewritten so that translation limits clear as many specifications as possible.
2 : FreeBSD 2.2.7 / GCC V.2.7.2.1 / cpp (V.2.0) GO32 / DJGPP V.1.12 / cpp (V.2.0) WIN32 / BC 4.0 / cpp (V.2.0) MS-DOS / BC 4.0, TC 2.0 / cpp (V.2.0) MS-DOS / LSI C-86 V.3.3 / cpp (V.2.0) OS-9/6x09 / Microware C/09 / cpp (V.2.0) C95: cpp -23 (-S1 -S199409L) -W15 gcc -ansi -Wp,-2,-W15 C99: cpp -23 (-S1) -S199901L -W15 C++: cpp -23+ -S199711L -W15
Open-source-software by kmatsui (August, 1998.) Called mcpp. Rewrite of DECUS cpp. I have compiled with GCC on Linux and used the executable for this test.
3 : WIN32 / Borland C++ V.5.5J / cpp32 (August, 2000) C95: cpp32 -A -w bcc32 -A -w C99: cpp32 -A -w C++: cpp32 -A -w
Trigraphs are not processed by cpp32 nor bcc32. Instead, a conversion program called trigraph.exe is provided. It just set up an alibi called "Standard conformance." In Borland C, I used this program to convert trigraphs in advance to test (lenient testing.) This trigraph.exe, however, processes even line splicing by <backslash><newline>. Therefore, the line number is out of alignment (deduct scores in q.1.2.)
4 : Linux, CygWIN / GCC V.2.95.3 (March, 2001) / cpp0 C95: cpp0 -D__STRICT_ANSI__ -std=c89 -$ -pedantic -Wall gcc -ansi -pedantic -Wall C99: cpp0 -std=c9x -$ -Wall C++: g++ -E -trigraphs -$ -Wall
Since GCC is portable source, the person who ported it should prepare the specification for what has been ported to a specific system. However, no such document is provided. Only GNU cpp.info exists as a cpp document.
5 : Linux / GCC V.3.2 (August, 2002) / cpp0 C95: cpp0 -D__STRICT_ANSI__ -std=iso9899:199409 -$ -pedantic -Wall gcc -std=iso9899:199409 -pedantic -Wall C99: cpp0 -std=c99 -$ -Wall C++: g++ -E -trigraphs -$ -Wall
Compiled from the source by kmatsui. Configured with --enable-c-mbchar option.
6 : Linux / / ucpp (V.1.3) C95: ucpp -wa -c90 C99: ucpp -wa
Open-source-software by Thomas Pornin (January, 2003.) A portable compiler-independent preprocessor. I compiled it on Linux by GCC.
7 : WIN32 / Visual C++ 2003 / cl C95: cl -Za -E -Wall -Tc C99: cl -E -Wall -Tc C++: cl -E -Wall -Tp
Since the -E option does not process comments and <backslash><newline> properly, a compilation testing is used together (April, 2003.)
8 : WIN32 / LCC-Win32 2003-08 / lcc C95: lcc -A -E lcc -A C99: lcc -A -E C++: lcc -A -E
An integrated development environment which Jacob Navia wrote based on open-source-software by C. W. Fraser & Dave Hanson (August, 2003.) The preprocessing part is based on source originally written for Plan9 by Dennis Ritchie.
9 : WIN32, Linux, etc. / / wave (V.1.0) C95: wave C99: wave --c99 C++: wave
Open-source-software by Hartmut Kaiser (January, 2004.) A portable compiler-independent preprocessor. Implemented using the C++ library called "Boost preprocessor library" by Paul Mensonides et al. Used a binary on Windows made by the author of wave.
10 : FreeBSD, Linux, CygWIN / GCC 2.95, 3.2 WIN32, MS-DOS / Visual C 2003, BCC, etc. / mcpp_std (V.2.4) C95: mcpp_std -23 (-S1 -V199409L) -W31 gcc -ansi -Wp,-2,-W31 C99: mcpp_std -23 (-S1) -V199901L -W31 C++: mcpp_std -23+ -V199711L -W31
mcpp V.2.4 (February, 2004), in Standard mode. Compiled with Linux / GCC.
11 : Linux / GCC V.3.4.3 (November, 2004) / cc1, cc1plus C95: gcc -E -std=iso9899:199409 -pedantic -Wall C99: gcc -E -std=c99 -$ -Wall C++: g++ -E -std=c++98 -$ -Wall
12 : WIN32 / Visual C++ 2005 / cl C95: cl -Za -E -Wall -Tc C99: cl -E -Wall -Tc C++: cl -E -Wall -Tp
Since the -E option does not process comments and <backslash><newline> properly, a compilation testing is used together. (September, 2005.)
13 : WIN32 / LCC-Win32 2006-03 / lcc C95: lcc -A -E lcc -A C99: lcc -A -E C++: lcc -A -E
LCC-Win32 2006-03 (March, 2006.)
14 : Linux / GCC V.4.1.1 (May, 2006) / cc1, cc1plus C95: gcc -E -std=iso9899:199409 -pedantic -Wall C99: gcc -E -std=c99 -$ -Wall C++: g++ -E -std=c++98 -$ -Wall 15 : WIN32 / Visual C++ 2008 / cl C95: cl -Za -E -Wall -Tc C99: cl -E -Wall -Tc C++: cl -E -Wall -Tp
Since the -E option does not process comments and <backslash><newline> properly, a compilation testing is used together. (December, 2007.)
16: Linux, WIN32, etc. / / wave (V.2.0) C95: wave (--c99) C99: wave --c99 C++: wave
wave V.2.0 (2008/08) Compiled by kmatsui on Linux/GCC and Windows/Visual C++ from the source contained in Boost C++ Library V.1.36.0 with its default settings. Used with configuration files for header files of GCC and Visual C++.
17: FreeBSD, Linux, Mac OS X, CygWIN, MinGW / GCC 2.95-4.1 WIN32 / Visual C 2003-2008, BCC, LCC-Win32 / mcpp (V.2.7.2) C95: mcpp -23 (-S1 -V199409L) -W31 gcc -ansi -Wp,-2,-W31,-fno-dollars-in-identifiers C99: mcpp -23 (-S1) -V199901L -W31 C++: mcpp -23+ -V199711L -W31
mcpp V.2.7.2 (2008/11)
D M B G G u V L W M G V L G V W M E C C C C c C C a C C C C C C A C C P C C C p 2 C v P C 2 C C 2 V P U P 5 2 3 p 0 0 e P 3 0 0 4 0 E P S 2 5 9 2 1 0 3 1 2 4 0 6 1 0 2 2 C 0 C 5 3 3 0 0 4 3 5 0 1 8 0 7 P P 3 8 3 2 P P max 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 [K&R: Processing of sources conforming to K&R and C90] (31 items) n.2.1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.2.2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.2.3 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.6.1 10 10 10 10 10 10 10 10 10 4 10 10 10 10 10 10 10 10 n.7.2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.10.2 6 0 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n.12.3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.12.4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.12.5 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.12.7 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.13.1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.13.2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.13.3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.13.4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.13.7 6 6 6 4 6 6 6 4 6 0 6 6 4 4 6 4 6 6 n.13.8 2 0 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 n.13.9 2 2 2 2 2 2 2 0 2 0 2 2 0 2 2 0 2 2 n.13.10 2 2 2 2 2 2 2 0 0 2 2 2 2 0 2 2 2 2 n.13.11 2 0 2 2 2 2 2 0 0 0 2 2 2 0 2 2 2 2 n.13.12 2 0 2 2 2 2 2 2 0 0 2 2 2 0 2 2 2 2 n.15.1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n.15.2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n.18.1 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 n.18.2 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 n.18.3 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 n.27.1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n.27.2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.29.1 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 n.32.1 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 i.32.3 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 i.35.1 2 2 2 2 2 2 2 0 0 0 2 2 0 0 2 0 1 2 stotal 166 150 166 164 166 166 166 156 158 140 166 166 160 156 166 160 165 166 [C90: Processing of strictly conforming sources] (76 items) n.1.1 6 0 6 6 6 6 6 6 6 0 6 6 6 6 6 6 6 6 n.1.2 2 0 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 n.1.3 2 0 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 n.2.4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.2.5 2 0 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 n.3.1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n.3.3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.3.4 2 0 2 0 2 2 0 2 2 2 2 2 2 2 2 2 2 2 n.4.1 6 0 6 0 6 6 6 6 0 0 6 6 6 0 6 6 6 6 n.4.2 2 0 2 0 2 2 2 2 0 0 2 2 2 0 2 2 2 2 n.5.1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n.6.2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n.6.3 2 0 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.7.1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n.7.3 4 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.8.1 8 0 8 8 8 8 8 8 8 8 8 8 8 8 8 8 2 8 n.8.2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.9.1 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 n.10.1 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 n.11.1 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 n.11.2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.12.1 6 0 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n.12.2 4 0 4 4 4 4 4 4 4 0 4 4 4 4 4 4 4 4 n.12.6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n.13.5 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 n.13.6 6 0 6 6 6 6 4 6 4 0 6 6 4 4 6 4 6 6 n.13.13 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.13.14 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.19.1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.19.2 4 2 4 4 4 4 4 4 4 2 4 4 4 4 4 4 4 4 n.20.1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n.21.1 4 0 4 0 4 4 4 0 4 4 4 4 0 4 4 0 4 4 n.21.2 2 0 2 0 2 2 2 0 2 2 2 2 0 2 2 0 2 2 n.22.1 4 0 4 0 4 4 4 4 4 0 4 4 4 4 4 4 4 4 n.22.2 2 0 2 0 2 2 2 2 2 0 2 2 2 2 2 2 2 2 n.22.3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 n.23.1 6 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n.23.2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 n.24.1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n.24.2 4 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.24.3 6 0 6 0 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n.24.4 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.24.5 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.25.1 4 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.25.2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.25.3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.25.4 6 0 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n.25.5 4 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.26.1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.26.2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.26.3 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.26.4 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.26.5 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.27.3 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.27.4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 2 4 n.27.5 2 2 2 2 2 2 2 0 2 0 2 2 0 2 2 0 0 2 n.27.6 2 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.28.1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.28.2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.28.3 4 0 4 4 4 4 2 4 4 4 4 4 4 4 4 4 4 4 n.28.4 4 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.28.5 4 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.28.6 4 0 4 0 4 4 2 0 0 4 4 4 0 0 4 0 4 4 n.28.7 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.29.2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.30.1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 n.32.2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 n.37.1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.37.2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.37.3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.37.4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.37.5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.37.6 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.37.7 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.37.8 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 n.37.9 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 stotal 286 160 284 252 286 286 278 274 272 240 286 286 272 272 286 272 274 286 [C90: Processing of implementation defined portions] (1 item) i.32.4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 stotal 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [C90: Diagnosing of violation of syntax rule or constraint] (50 items) e.4.3 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.7.4 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.12.8 2 0 2 2 2 2 2 2 0 2 2 2 2 0 2 2 2 2 e.14.1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.14.2 4 2 4 2 4 4 2 2 4 4 4 4 4 4 4 4 4 4 e.14.3 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 e.14.4 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 e.14.5 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.14.6 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.14.7 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.14.8 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.14.9 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.14.10 4 0 4 2 0 0 0 0 0 0 4 0 0 0 0 0 4 4 e.15.3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.15.4 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 e.15.5 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.16.1 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 e.16.2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 e.17.1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.17.2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.17.3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.17.4 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 e.17.5 2 0 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 e.17.6 2 0 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 e.17.7 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.18.4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.18.5 2 2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 e.18.6 2 0 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 e.18.7 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.18.8 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 e.18.9 2 0 2 0 2 2 2 2 0 0 2 0 2 0 2 2 0 2 e.19.3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 e.19.4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 e.19.5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 e.19.6 2 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.19.7 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.23.3 2 0 2 2 2 2 2 2 0 0 2 2 2 0 2 2 0 2 e.23.4 2 2 2 2 2 2 2 2 0 0 2 2 2 0 2 2 0 2 e.24.6 2 2 2 2 2 2 2 2 0 0 2 2 2 0 2 2 0 2 e.25.6 4 0 4 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 e.27.7 2 0 2 2 2 2 2 0 2 2 2 2 0 2 2 0 2 2 e.29.3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.29.4 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.29.5 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.31.1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.31.2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 e.31.3 2 2 2 0 2 2 2 1 2 2 2 2 1 2 2 1 2 2 e.32.5 2 0 2 2 2 2 0 2 0 0 2 2 2 0 2 2 2 2 e.33.2 2 0 2 0 0 2 0 2 0 0 2 2 2 0 2 2 2 2 e.35.2 2 0 2 1 2 2 0 2 0 2 2 2 2 0 2 2 2 2 stotal 112 74 112 92 104 108 98 100 92 86 112 106 105 92 108 105 104 112 [C90: Documents on implementation defined behaviors] (13 items) d.1.1 2 0 2 0 0 2 0 0 0 0 2 2 0 0 2 0 0 2 d.1.2 4 2 4 4 4 4 0 4 0 0 4 4 4 0 4 4 2 4 d.1.3 2 0 2 0 0 2 0 0 2 2 2 2 0 0 2 0 2 2 d.1.4 4 0 4 4 4 4 0 4 4 2 4 4 4 4 4 4 2 4 d.1.5 4 2 4 4 2 4 4 4 4 4 4 2 4 4 2 4 4 4 d.1.6 2 0 2 0 0 1 0 0 0 0 2 1 0 0 1 0 0 2 d.2.1 2 0 2 2 2 2 2 0 0 0 2 2 2 0 2 2 0 2 d.2.2 2 0 2 2 0 2 0 0 0 0 2 2 2 0 2 2 0 2 d.2.3 2 0 2 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 d.2.4 2 0 2 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 d.2.5 2 0 2 0 0 0 0 2 0 0 2 0 2 0 0 2 0 2 d.2.6 2 0 2 2 0 0 0 2 0 0 2 0 2 0 0 2 0 2 d.2.7 2 0 2 2 0 2 0 2 0 0 2 2 2 0 2 2 0 2 stotal 32 4 32 20 12 23 6 18 10 8 32 21 22 8 21 22 10 32 [C90: Degree of Standard C conformance] (171 items) mttl90 598 390 596 530 570 585 550 550 534 476 598 581 561 530 583 561 555 598 [C99: Conformance to new features] (20 items) n.dslcom 4 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.ucn1 8 0 0 0 0 6 8 2 0 2 8 6 8 0 6 8 6 8 n.ucn2 2 0 0 0 0 0 2 0 0 0 2 2 2 0 2 2 0 2 n.ppnum 4 0 4 0 4 4 4 0 0 0 4 4 0 4 4 0 0 4 n.line 2 0 2 2 2 2 2 2 0 2 2 2 2 0 2 2 2 2 n.pragma 6 0 6 0 0 6 6 0 0 2 6 6 0 0 6 0 2 6 n.llong 10 0 0 0 10 10 8 10 0 0 10 10 10 0 10 10 10 10 n.vargs 10 0 10 0 10 10 10 0 0 10 10 10 10 2 10 10 10 10 n.stdmac 4 0 2 0 0 4 4 0 0 4 4 4 0 0 4 0 4 4 n.nularg 6 0 6 0 6 6 6 2 0 6 6 6 2 0 6 2 6 6 n.tlimit 18 0 18 14 18 18 17 18 14 18 18 18 18 12 18 18 16 18 e.ucn 4 0 0 0 0 0 2 0 0 2 4 0 2 0 0 2 2 4 e.intmax 2 0 0 0 2 2 2 0 0 0 2 1 0 0 1 0 2 2 e.pragma 2 0 2 0 0 2 2 0 0 2 2 2 0 0 2 0 2 2 e.vargs1 2 0 0 0 0 2 1 0 0 1 2 2 0 0 2 0 2 2 e.vargs2 2 0 2 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 d.pragma 2 0 2 0 0 2 2 2 0 0 2 2 2 0 2 2 2 2 d.predef 6 0 0 0 0 6 6 0 0 0 6 6 0 0 6 0 0 6 d.ucn 2 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 d.mbiden 2 0 0 0 0 2 2 1 0 0 2 2 1 0 2 1 0 2 mttl99 98 0 58 20 56 86 88 41 18 53 98 87 61 22 87 61 70 98 [C++: Conformance to new features not in C90] (9 items) n.dslcom 4 0 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 n.ucn1 4 0 0 0 0 4 4 2 0 2 4 4 2 0 4 2 2 4 n.cnvucn 4 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 n.bool 2 0 0 0 0 2 0 0 0 2 2 2 0 0 2 0 0 2 n.token1 2 0 2 0 0 2 0 0 2 2 2 2 0 2 2 0 2 2 n.token2 2 0 0 0 0 2 0 2 0 2 2 2 2 0 2 2 2 2 n.cplus 4 0 2 2 2 2 0 4 0 4 4 2 4 0 2 4 4 4 e.operat 2 0 0 0 0 2 0 0 0 2 2 2 0 0 2 0 2 2 d.tlimit 2 0 2 0 0 2 0 1 0 0 2 2 1 0 2 1 0 2 mttl++ 26 0 10 6 6 20 9 13 6 18 22 20 13 6 20 13 16 22 [C90: Qualities / 1 : handling of multibyte character] (1 item) m.36.2 7 0 2 2 0 0 0 4 0 0 7 5 2 0 5 2 0 7 stotal 7 0 2 2 0 0 0 4 0 0 7 5 2 0 5 2 0 7 [C90: Qualities / 2 : diagnosis of undefined behaviors] (29 items) u.1.1 1 0 1 0 1 1 0 0 0 1 1 1 0 0 1 0 1 1 u.1.2 1 0 1 0 1 1 0 1 0 1 1 1 0 0 1 0 1 1 u.1.3 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 u.1.4 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 u.1.5 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 u.1.6 1 0 1 0 1 1 0 0 1 0 1 1 0 1 1 0 0 1 u.1.7 9 0 1 0 0 0 0 0 0 0 6 6 0 0 6 0 0 9 u.1.8 1 1 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 1 u.1.9 1 1 1 0 1 1 1 0 1 0 1 0 0 1 0 0 0 1 u.1.10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 u.1.11 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 u.1.12 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 u.1.13 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 u.1.14 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 u.1.15 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 u.1.16 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 u.1.17 2 0 2 0 1 1 0 0 1 0 2 1 0 1 1 0 1 2 u.1.18 1 0 1 0 1 1 1 0 0 0 1 1 0 0 1 0 1 1 u.1.19 2 0 2 0 0 1 1 0 0 1 2 1 0 0 1 0 1 2 u.1.20 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 u.1.21 2 0 2 1 0 1 2 2 2 2 2 1 2 2 1 2 2 2 u.1.22 1 0 1 0 0 1 1 0 1 1 1 1 0 1 1 0 1 1 u.1.23 1 1 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 1 u.1.24 2 0 2 0 0 0 0 0 0 0 2 0 0 0 0 0 2 2 u.1.25 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 1 1 u.1.27 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 u.1.28 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 u.2.1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 1 1 1 u.2.2 1 0 1 1 0 1 1 0 0 0 1 1 0 0 1 0 0 1 stotal 41 10 33 16 19 26 20 18 19 16 38 30 17 21 30 17 25 41 [C90: Qualities / 3 : Diagnosis of unspecified behaviors] (2 items) s.1.1 2 0 2 0 0 0 2 0 0 0 2 0 0 0 0 0 0 2 s.1.2 2 0 2 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 stotal 4 0 4 0 0 0 2 0 0 0 4 0 0 0 0 0 0 4 [C90: Qualities / 4 : Diagnosis of suspicious cases] (12 items) w.1.1 4 4 4 0 4 4 0 0 0 0 4 4 0 0 4 0 0 4 w.1.2 4 0 4 0 0 0 0 0 0 2 4 0 0 0 0 0 2 4 w.2.1 2 0 2 1 0 0 0 0 0 0 2 2 0 0 2 0 0 2 w.2.2 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 w.3.1 1 1 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 w.3.3 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 w.3.4 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 w.3.5 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 w.3.6 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 w.3.7 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 w.3.8 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 w.3.9 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 stotal 19 5 19 1 4 4 0 0 0 2 19 6 0 1 6 0 2 19 [C90: Qualities / 5 : Other features] (17 items) q.1.1 9 0 9 6 9 9 8 7 4 9 9 9 7 5 9 7 8 9 q.1.2 10 6 10 4 8 10 4 4 4 4 10 10 4 4 10 4 4 10 q.1.3 4 4 4 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 q.1.4 20 10 20 10 20 20 20 10 20 10 20 20 10 20 20 10 10 20 q.2.1 4 2 4 2 4 4 4 4 2 4 4 4 4 2 4 4 4 4 q.2.2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 q.2.3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 q.2.4 2 2 2 0 2 2 0 0 0 0 2 2 0 0 2 0 0 2 q.2.5 2 0 2 0 0 0 0 0 0 0 2 0 0 0 0 0 0 2 q.2.6 4 0 4 2 4 4 2 4 2 0 4 4 4 2 4 4 0 4 q.2.7 10 4 6 4 8 8 4 4 4 2 8 8 4 4 8 4 4 8 q.2.8 10 0 6 2 2 2 0 6 4 4 8 2 6 4 2 6 4 8 q.2.9 6 2 2 0 2 2 0 0 0 2 2 2 0 0 2 0 2 4 q.3.1 20 10 8 8 14 12 8 10 10 6 8 12 10 10 12 10 6 8 q.3.2 20 20 20 18 16 16 18 16 18 14 18 14 14 18 12 14 12 16 q.3.3 20 10 14 0 10 12 8 0 0 8 14 12 0 0 12 0 10 16 q.4.1 10 2 6 6 4 6 2 4 6 4 4 6 4 6 6 4 4 8 stotal 157 78 123 70 113 117 88 79 84 77 123 115 77 85 113 77 78 129 [C90: Qualities] (61 items) mttl90 228 93 181 89 136 147 110 101 103 95 191 156 96 107 154 96 105 200 [C99: Qualities of new features] (3 items) u.line 2 0 2 0 1 1 0 0 0 0 2 1 0 0 1 0 2 2 u.concat 1 0 1 0 0 1 0 0 0 0 1 1 0 0 1 0 0 1 w.tlimit 8 0 8 0 0 0 3 2 0 0 8 0 2 0 0 2 0 8 mttl99 11 0 11 0 1 2 3 2 0 0 11 2 2 0 2 2 2 11 [C++: Qualities of features not in C90] (1 item) u.cplus 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 1 mttl++ 1 0 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 1 [Overall] (265 items) gtotal 962 483 857 646 769 840 760 708 661 643 921 846 734 665 846 734 749 930 D M B G G u V L W M G V L G V W M E C C C C c C C a C C C C C C A C C P C C C p 2 C v P C 2 C C 2 V P U P 5 2 3 p 0 0 e P 3 0 0 4 0 E P S 2 5 9 2 1 0 3 1 2 4 0 6 1 0 2 2 C 0 C 5 3 3 0 0 4 3 5 0 1 8 0 7 P P 3 8 3 2 P P max 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1 : Linux / / DECUS cpp
This was written around the early stage of ANSI draft and the Standard conformance level is low by now. Diagnostic messages are adequate, however, there is almost no documentation. It is a well-structured stable program.
The portability of source is high and it has been ported to some compiler systems. The source code is easy to read as if reading a textbook and you can learn a lot from by just reading it. I modeled mcpp after this source.
3 : WIN32 / Borland C++ V.5.5J / cpp32
The C90 conformance level is relatively high and troublesome shift-JIS is respectably supported. Documents are well provided. Although e_* usually issues diagnostic messages, most of them are hasty and the quality is not good.
"Quality other than Standards" is poor and there are no special extension features. Not many diagnostic messages are issued for undefined parts and the program runs away sometimes. Supporting Standards only seems to be the best it could do.
Different from Turbo C, the speed is no longer fast. Merits in one-pass compiler seem to have disappeared and only demerits seem to have remained. I wonder how long Borland continue this style.
4 : Linux, CygWIN / GCC V.2.95.3 / cpp0
The C90 and C95 Standard conformance level is quite high and diagnostic messages are accurate. The behavior is near stable and the speed is extremely fast. There are plentiful options which are almost too abundant. mcpp also modeled after some of those options.
Though there were a few painful bugs in older versions, V.2.95 has almost no bugs.
The remaining issues are that new specifications in C99 and C++98 have not been implemented, there are not enough diagnostic messages, documentation is lacking, there are many extension features against Standards which do not use #pragma, that many obsolete pre-Standard specifications are hidden, and that multi-byte character encoding support is half-finished and does not reach practical level.
The cpp.info document is excellent as the explanation of overall GCC/cpp and Standard C preprocessing. However, it is really too bad that documentation for implementation-defined areas do not exist in CygWIN, FreeBSD, or Linux. "Portability" is not only for programs.
The source is a full of patches and difficult to read, and the program structure is still dragging old macro processor structure. However, since overall GCC compiler systems are good, this is ported to many systems.
5 : Linux / GCC V.3.2 / cpp0
GCC V.3 changed the source for preprocessing completely from V.2. At the same time, it changed documentations completely. Token-based principles are fully enforced, warnings are issued while allowing pre-Standard specifications, and the number of undocumented specifications decreased. On the whole, it has improved to the direction I had hoped for to a large extent. I suspect that future improvements will be easier since the program structure has changed completely.
Diagnostic messages, documentations, C99 support, multi-byte character support are not enough yet. The speed is slightly slower than V.2, but it is still one of the faster ones.
However, it is troublesome that header files became complex and setting the search order of include directories is getting complicated. Also, while old options are no longer necessary, many new options are introduced and it is taking forever for options to get organized. It is unfortunate that the internal interface between preprocessing and compilation parts is complicated for some reasons although preprocessing become combined with the compiler proper in V.3.
6 : Linux / / ucpp (V.1.3)
The characteristics are the support for C99, open source, and portable. The Standard conformance level is rather high. This version is supposed to support UCN and UTF-8, but the support is insufficient. The diagnostic messages are somewhat poor. Documentation is not sufficient, either.
7 : WIN32 / Visual C++ 2003 / cl
12 : WIN32 / Visual C++ 2005 / cl
15 : WIN32 / Visual C++ 2008 / cl
Though in 2003 few C99 specifications are implemented, more than half of them are implemented in 2005. There remain, however, some bugs regarding the specifications for C90 and prior. The most fundamental problem is confusion in the translation phases. The upgrades must have been rework of some very old source code.
The diagnostic messages are often somewhat off the point. An error often terminates the preprocessing which makes this software bothersome to use. Updating of the manuals is sometimes far behind the implementation.
The merits are large translation limits and a relatively large number of #pragma. #pragma setlocale in particular is useful. However, it is problematic that the #pragma line is macro-expanded even in C90 but that #pragma sub-directive uses user's name space.
2008 is almost the same with 2005 as for preprocessing. The only difference is implementation of __pragma() operator which is a substitute of _Pragma() of C99. The system headers used by Visual C++ have had only a few problems on the whole up to 2005. On 2008, however, number of macros with '$' suddenly increased for some reason.
8 : WIN32 / LCC-Win32 2003-08 / lcc
13 : WIN32 / LCC-Win32 2006-03 / lcc
Jacob Navia modified the preprocessing part of the source code for Plan9 by Dennis Ritchie, but this lacks in debugging and there are quite a number of bugs in the #if expression evaluation and others. The specifications since C95 are not supported. Lack of documentation.
There are few differences of preprocessing between 2003-08 and 2006-03.
9 : WIN32, Linux, etc. / / wave (V.1.0) 16: WIN32, Linux, etc. / / wave (V.2.0)
This preprocessor has been developed for "meta-programming" of C++ STL as a primary purpose. Wave has an unique construction: it is consisted of mainly C++ libraries, and its source is consisted of mainly header files. The examples of meta-programming use recursive macros heavily, and Wave expands those macros as GCC / cpp or -@compat option of mcpp, i.e. limiting the scope of "inhibition of once-replaced macro's re-replacement" narrower than the Standard's wording. (see 3.4.26.)
Wave intends also to be used as a usual preprocessor, and intends conforming to C++98 and C99. Though the degree of perfection was not high in V.1.0, it was greatly improved in V.2.0. It was reported that a lot of bugs were fixed after V.1.0 using the validation suite of mcpp. Further improvement of diagnostics and documents are still desired. Another problem is that a few mistakes are found in the author's interpretation of the Standards expressed in wave's diagnostics and its accompanying testcases.
11 : Linux / GCC V.3.4.3 / cc1, cc1plus
14 : Linux / GCC V.4.1.1 / cc1, cc1plus
The scoring is almost the same with V.3.2. However, the construction of preprocessing has changed largely. Although V.3.2 seemed to proceed in the direction to portability, GCC has changed the direction on V.3.3 and 3.4. It has become one huge and complex compiler, removing independent preprocessor, predefining many macros and restoring some old specifications which was once obsoleted by V.3.2. It is a question whether these changes can be said improvements. It has also given a privileged place to UTF-8 among many encodings of multi-byte character. I am afraid that it might narrow the wide variety of multi-lingualization.
V.4.1 has not big differences from V.3.4 as for preprocessing.
2 : FreeBSD, DJGPP, WIN32, MS-DOS, OS-9/09 / / mcpp (V.2.0)
10 : FreeBSD, Linux, CygWIN, WIN32, MS-DOS / / mcpp (V.2.4)
17 : FreeBSD, Linux, Mac OS X, CygWIN, MinGW, WIN32 / / mcpp (V.2.7.2)
Since I created and tested this myself, the conformance level is the best, of course. It should be the world's most accurate preprocessor. The plentifulness and accuracy of diagnostic messages and the detailed documentation is also the best. Useful options and #pragma directives are provided. The C99 specification is fully supported in V.2.3 and later. The portability of source is also the best.
However, there are some features still to be implemented. So I would appreciate your contribution.
As we test many preprocessors, we can find that nowadays many have high level of C90 Standard conformance. However, each compiler system still has many issues. I am not going to speak for mcpp since most of the items score full.
More compiler systems can process the n_* samples correctly now. GCC 2.95 and later, BC (Borland C) 5.5, LCC-Win32 2003-08 and later, Visual C++ (VC) 2003 and later, Ucpp 1.3, and Wave 2.0 have reached the level with not so many problems in practice. However, each compiler system has unexpected bugs.
The most surprising is that compiler systems including Visual C often have the division by 0 errors in n_13_7.t (n_13_7.c.) The basic specification of C, the "short-circuit evaluation" by &&, || and a ternary operator, is not handled. Borland C issues a warning in n_13_7.c and it issues only the same warning for the real division by 0 in e_14_9.c as well. In Turbo C, the real division by 0 and partial expression with skipped evaluation caused the same error while the same diagnostic message is downgraded only to warning in Borland C. This is an example of a "hasty diagnostic message" in this compiler system.
In the C90 specification, there are some with errors in the stringizing implementation using the # operator.
The specifications added by Amendment 1, Corrigendum 1 are implemented to some extent by GCC 2.95 and later, VC 2003 and later, and Ucpp.
In the C99 specification, only GCC 3.2 and later and Ucpp implements largely but not completely. The // comments has been implemented by many compiler systems for quite some time. In addition, GCC has long long, has considerable room in translation limits, and properly processes empty arguments of macros. GCC has variable argument macro of its own specification, but the one in the C99 specification is also implemented since 2.95. _Pragma() is implemented since 3.2. UCN is implemented by Ucpp and VC 2005 and later only. GCC 3.2 and later implements UCN in string literals and character constants only. Wave 2.0 implements more than half of C99 specifications.
In C++98, GCC 3.2 and later, Wave and VC 2005 and later implements most of the specifications.
The queer specification of C++98 to convert extended-characters to UCN is not yet implemented by any preprocessor.
In processing the implementation-defined area in i_*, many cannot handle wide characters in the #if expression. Though this is specified in Standards, using not only wide characters, but also character constants in the #if expression is almost meaningless and it will not hurt even if these cannot be used. This type of meaningless specification should be removed from the Standard.
Visual C supports relatively many encodings for multi-byte characters, though not enough. Other preprocessors are poor. The implementation of GCC 2.95,3.2 are half-finished and does not reach a practical level. GCC 3.4-4.1 has begun to support many encodings, by converting all encodings to UTF-8. The actual implementation is, however, not yet practical level on some encodings.
In the systems using shift-JIS or BIG-5, tokenization of literals and stringizing using the # operator requires attention. Visual C supports these well. Also BC 5.5J support shift-JIS.
In the diagnostic messages for e_*, GCC 2.95 and later are superior. Though Visual C and Ucpp issue diagnostics to comparatively many items, they are often vague or off-target. Very few preprocessors issue diagnostic messages to the overflow in the #if expression and only BC, Ucpp and GCC do to some extent.
In documents for implementation-defined areas, GCC 3.2 and later are at adequate level and the rest are all very poor.
In the diagnostics for u_*, GCC 3.2 and later is only adequate. The rest are very poor. I don't think it is acceptable for compiler systems not to do anything just because the result is undefined.
Almost no compiler systems handle s_* and w_*. It is unexpected that very few compiler systems issue even a warning for nested comments.
In "other various quality", GCC stand out with plentiful options, accurate diagnostic messages, high speed, and portability.
Overall, GCC V.3.2 and later, excels the most in the Standard conformance level, ease of use, and stability without many big problems.
Certainly, it is understood that mcpp exceeds in most aspects though only speed is inferior.
After the huge volume of testing, what I realize is the importance of test samples. mcpp is the result of creating samples and debugging in parallel. Since you cannot notice the existence of bugs without enough samples, it is anything but debugging.
If Standards come with this type of exhaustive test samples, the quality of each compiler system will be exponentially improved. Also, creating exhaustive test samples reveals the problems in Standards at the same time. Test samples are the illustration of Standards.
I look forward to opinions about this Validation Suite and preprocessing test reports for various compiler systems using this tool. Please use the "Open Discussion Forum" at:
or email.
If you perform the detail testing of a preprocessor, cut out the 6.2 score table and send it. To calculate the total score, please compile and use tool/total.c. The score table is cpp_test.old and if
total 18 cpp_test.old cpp_test.new
each field of stotal (sub-total), mtotal (mid-total), and gtotal (grand-total) is written and output to cpp_test.new. Specify the compiler system number at "18".
You can test automatically GCC by the testsuite edition of Validation Suite. I am waiting for the test reports on various versions of GCC. Please send me the log files (gcc.sum and gcc.log), and I will supplement my testsuite edition with the diagnostics of various versions if any differences exist.
Also, the development of Validation Suite and mcpp are in progress in the mcpp project above in SourceForge. Please send me email if you would like to join the development.
mcpp-2.7.2/doc/mcpp-summary.pdf 0000644 0001750 0001750 00000515053 11114175165 013307 0000000 0000000 %PDF-1.4 % 98 0 obj << /Linearized 1 /O 100 /H [ 1361 706 ] /L 170539 /E 43794 /N 16 /T 168461 >> endobj xref 98 46 0000000016 00000 n 0000001268 00000 n 0000002067 00000 n 0000002284 00000 n 0000002450 00000 n 0000003078 00000 n 0000003478 00000 n 0000009230 00000 n 0000010019 00000 n 0000018943 00000 n 0000019269 00000 n 0000019533 00000 n 0000019815 00000 n 0000020381 00000 n 0000020758 00000 n 0000021076 00000 n 0000021458 00000 n 0000022248 00000 n 0000022666 00000 n 0000022688 00000 n 0000023685 00000 n 0000023707 00000 n 0000024686 00000 n 0000024708 00000 n 0000025684 00000 n 0000026078 00000 n 0000026582 00000 n 0000026782 00000 n 0000026804 00000 n 0000027803 00000 n 0000027825 00000 n 0000028775 00000 n 0000029061 00000 n 0000029477 00000 n 0000029499 00000 n 0000030470 00000 n 0000030492 00000 n 0000031448 00000 n 0000031470 00000 n 0000032170 00000 n 0000032249 00000 n 0000034547 00000 n 0000037130 00000 n 0000042097 00000 n 0000001361 00000 n 0000002045 00000 n trailer << /Size 144 /Info 96 0 R /Root 99 0 R /Prev 168451 /ID[<37e5eec95206a42fdb51b2b11af06a93><1a2c2a3920c5e10776794727dd9e71ab>] >> startxref 0 %%EOF 99 0 obj << /Type /Catalog /Pages 94 0 R /Metadata 97 0 R /PageLabels 92 0 R >> endobj 142 0 obj << /S 537 /L 701 /Filter /FlateDecode /Length 143 0 R >> stream Hb```f`` d`c``)``@ Vv9@Y6BR' E Ha``#Q$}d{M{5V2`: rz Zp0yyS`i@bbR \3Ʌ]`C OO߳+93~o :^&0xЙ`f]YfV3! xeb6l¦!"e